John Harshman: The Phylogeny of Crocodiles

Continuing the discussion from Reviving Office Hours:

See the full paper here:

True and False Gharials: A Nuclear Gene Phylogeny of Crocodylia

We are privileged to @John_Harshman here, who has graciously offered to answer questions about this study. Why is it important?

So let’s look at crocodiles together!


What is the K-H test?

Why doesn’t this discrepancy count as evidence against common descent?

1 Like

Short answer: it’s a pairwise statistical test used to determine whether a given data set supports one tree over a second tree, both trees having been chosen prior to examining the data. There are both likelihood and parsimony-based versions.

Kishino H., Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 1989; 29:170-179.

Considered in isolation, one might say that it does. However, when one considers the general match of molecules to morphology over all of life, the rarity of such exceptions as this one argues that we should consider explanations of the discrepancy other than absence of a real phylogeny. And the fact that multiple independently gathered molecular data sets agree among themselves is an argument favoring the molecular tree. Finally, morphological data have a strong secondary signal agreeing with the molecular tree. For the latter argument, in addition to my paper, you can consult Gatesy J., Amato G., Norell M., DeSalle R., Hayashi C. Combined support for wholesale taxic atavism in gavialine crocodylians. Systematic Biology 2003; 52:403-422.


Can you quantify the rarity?

What are possible explanations?

What is the difference between a true and false gharial? What is a gharial?

1 Like

No. I’ll just say that when you find conflict it’s worth a whole publication, but when you don’t find conflict it isn’t. So the literature is a biased sample of the incidence of conflict. Nevertheless, conflicts are generally limited to a few nodes on large trees. In this case, for example, the only real conflict is the sole question of gharial relationships, and the rest of the crocodylians have non-conflicting relationships.

You know many of them with respect to conflict among molecular studies. Morphology would add coding errors. Coding of molecular characters can be difficult and subjective.

“Gharial” is the name for two species of crocodile with long, narrow snouts. The difference is that they are separate genera, easily distinguished by a host of characters. Incidentally, the issue of Systematic Biology in which both our paper and John Gatesy’s paper appeared has a dramatic painting of both species on the cover, if you can find it.

True gharial is on the bottom.


So if I understand correctly, the issue is really that the false Gharial had retained a number of morphological characteristics apart from the snout morphology that made it group more closely with other crocodiles in the Crocodylidae family, even despite it’s snout being more similar to the true Gharial.

So when the molecular tree comes out with both Gharials grouped most closely together as sister taxa, this also shows that morphology really can, in principle, be encoded by gene-sequences that yield phylogenies that don’t have to mirror the morphological ones. In this case the false Gharial has morphological characteristics that group it somewhere inside the Crocodylidae family, yet that morphology is encoded by gene-sequences that imply it is actually most closely related to the true Gharial.

Which in turn makes the case of the Gharials morphology-vs-molecules conflict a good of example of an “exception that proves the rule”. That there is really only one good explanation for consilience of independent phylogenies: The genes yield similar trees not because they’re functionally constrained to track each other, but because of shared genealogical history.


That makes so much sense. Thanks for the excellent explanation.

I recall a documentary some years ago on India’s gharials. A pond at some gharial sanctuary was so saturated and churning with such creatures that it looked like something out of a horror movie—or perhaps a Hindu analogue to Dante’s levels of hell.

1 Like

Yes, and “retained” is the correct term, as the true gharial has a large number of reversals from the derived form seen in Tomistoma and Crocodylus to the primitive form seen in the common ancestor of crocs and alligators.

Well, no it doesn’t, as the sequence we used in that paper encodes no morphology and has nothing to do with the characters that were used in the morphological tree. It would require considerable genomics work to determine what sequences were responsible for particular morphological differences, but then you could try a tree using those sequences. I expect it would result in the molecular tree. But I have no direct evidence for that expectation.

That would be a good inference from the sort of data set you suggest, but nobody has produced such a data set, and it would require untold years of work to identify the proper data.


Mind you, they eat fish almost exclusively.


I see what you mean. So I’ll have to change my argument.

It shows that, whatever the function of the the c-myc gene sequence is(I understand from your paper it’s a transcription factor), it’s not constrained in it’s sequence to yield a phylogeny that is identical with the phylogeny derived from the morphological characteristics used to elucidate the traditional Crocodilia topology.
So even if there really are such gene sequences under such constraints, it’s not the case for this one (or the false Gharial morphological and c-myc molecular trees should have been identical), which implies that at least for the c-myc sequence it really is independent of morphology.

Re-reading that last sentence I realize I have to relax that claim even more to a much weaker statement that, if there are any constraints operating on the c-myc gene sequences in the species used in your paper that could at all bias them to yield phylogenetic trees similar to the ones derived from the morphological characteristics, those constraints aren’t strong enough to force the trees to be completely identical.


True, no constraint other than that all these characters evolved on the same phylogeny. Yes, c-myc is a transcription factor; it’s also one of the few genes actually known to have functional alternative splicing. But it isn’t independent of morphology; clearly, transcription factors have something to do with morphology. It’s just that we can’t tie any differences in c-myc sequences to any particular differences in crocodylian morphology. It’s possible it could be involved, but I further suspect that none of the characters in the morphological data set has anything to do with differences in c-myc sequence. Also, the majority of the sequence is not part of the protein-coding exon and is either spliced out or untranslated.

And we also have no idea what any constraints could possibly be or how they would work.


Question 1:

Could you comment on whether a certain amount of anomaly is expected when using multiple, different methods of tree-building?

The reason I ask is this:stochastic phenomena with very large repetitions manifest (what look like) anomalies at a certain rate. For example, if I flip a fair coin 10 times, 10 heads in a row would be anomalous. If I flipped a fair coin 10,000 times and then selected 10 consecutive flips at random, I would expect to find 10 in a row at a certain frequency. (In fact, roughly every 1000 selections.)

In the realm of evolutionary biology, we also expect incomplete lineage sorting, a type of anomaly, to show up with a certain frequency, right?

Question 2:

Do the data you collected rule out the null hypothesis (no nested hierarchy–i.e., no common ancestry) at a level of statistical significance?


I’m not quite clear on what you mean here, as it isn’t the sort of language we use in phylogenetics to describe what you may be talking about. We would expect a certain amount of difference between trees built by different methods, largely because the methods have different assumptions about the nature of the data which may be violated to greater or lesser degree by the data sets. That’s quite different from what you describe in coin flips.

Now, if I understand the coin flip analogy here, it’s similar to the various sites of a DNA sequence being thought of as pulled from some ideal distribution, which methods in fact usually assume. Purely by chance, we may get a data set that isn’t representative of the distribution, and thus analysis of that data would produce a false tree. Is that what you mean? Of course that’s possible, though the probability decreases as sample size increases and as the extent of the required bias increases. In other words, 100 heads in a row is less probable than 10 heads in a row, and 10 out of 10 coins being heads is less probablye than 7 out of 10. I think our data in the croc paper surpasses any significant probability of that error on both counts.

Incomplete lineage sorting is quite another question. We do expect it to happen, with a frequency inversely related to branch lengths between divergence events and directly to population size. The way to deal with that is by assessing many independently assorting loci. That’s where that mention of prior publications comes in.

Not through any formal test, but I’d say the null expectation would be that the data would not support any tree over any other, so the various tests of support we performed should serve. Check the Kishino-Hasegawa test and the non-parametric phylogenetic bootstraps.


Hi John,

Appreciate the illumination!

I’m not sure I agree with this. Wouldn’t the null expectation be that the trees would have a randomized distribution of some kind? In other words, some trees would seem a little less likely, some would seem a little more likely, all the way up to one tree being the most likely.

However, this randomized distribution should be recognizable, and the most likely trees that emerge from your study would (presumably) not conform to it.

Does this make sense?


1 Like

@Chris_Falter do you know what a homoplasy is? It is a feature that does not fall into the tree formed by other features. How likely do you think homoplasies are in sequences not constructed by common descent?

1 Like

I love the Socratic method!

Homoplasy would be the result of convergent evolution where the sequences are the result of common descent, I postulate.

I would also postulate that homoplasy would be far more common in trees not formed from common descent.


Yes, but the differences would not be expected to be significant. That’s what I meant by “support”. In fact a randomized data set will generally provide you with a single most likely tree, but its difference in likelihood from a randomly selected other tree would be small.


By technical definition it’s “similarity not due to common descent”. In order to infer homoplasy, we must first estimate the true tree. The assumption there is that the data supporting the true tree will outweigh the homoplasies supporting any one of various false trees.


I’d like to suggest an experiment (perhaps @evograd, @davecarlson or @Jordan can help).

Let’s generate two datasets. One is sequences generated be a tree. The other is sequences generated by random distribution.

Let’s then see how strong the inferred tree is using a phylogeny package. @John_Harshman, what should we expect to see?

And most snakes and spiders are harmless to humans—but that doesn’t get in the way of popular horror movie tropes. :wink:

1 Like