John Harshman: The Phylogeny of Crocodiles

swamidass · July 30, 2019, 3:20pm

Continuing the discussion from Reviving Office Hours:

See the full paper here: https://www.researchgate.net/publication/10734528_True_and_False_Gharials_A_Nuclear_Gene_Phylogeny_of_Crocodylia.

True and False Gharials: A Nuclear Gene Phylogeny of Crocodylia

Reviving Office Hours

The phylogeny of Crocodylia offers an unusual twist on the usual molecules versus morphology story. The true gharial ( Gavialis gangeticus ) and the false gharial ( Tomistoma schlegelii ), as their common names imply, have appeared in all cladistic morphological analyses as distantly related species, convergent upon a similar morphology. In contrast, all previous molecular studies have shown them to be sister taxa. We present the first phylogenetic study of Crocodylia using a nuclear gene. We cloned and sequenced the c-myc proto-oncogene from Alligator mississippiensis to facilitate primer design and then sequenced an 1,100-base pair fragment that includes both coding and noncoding regions and informative indels for one species in each extant crocodylian genus and six avian outgroups. Phylogenetic analyses using parsimony, maximum likelihood, and Bayesian inference all strongly agreed on the same tree, which is identical to the tree found in previous molecular analyses: Gavialis and Tomistoma are sister taxa and together are the sister group of Crocodylidae. Kishino–Hasegawa tests rejected the morphological tree in favor of the molecular tree. We excluded long-branch attraction and variation in base composition among taxa as explanations for this topology. To explore the causes of discrepancy between molecular and morphological estimates of crocodylian phylogeny,we examined puzzling features of the morphological data using a priori partitions of the data based on anatomical regions and investigated the effects of different coding schemes for two obvious morphological similarities of the two gharials

We are privileged to @John_Harshman here, who has graciously offered to answer questions about this study. Why is it important?

So let’s look at crocodiles together!

swamidass · July 30, 2019, 4:13pm

What is the K-H test?

Why doesn’t this discrepancy count as evidence against common descent?

John_Harshman · July 31, 2019, 1:32am

Short answer: it’s a pairwise statistical test used to determine whether a given data set supports one tree over a second tree, both trees having been chosen prior to examining the data. There are both likelihood and parsimony-based versions.

Kishino H., Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 1989; 29:170-179.

Considered in isolation, one might say that it does. However, when one considers the general match of molecules to morphology over all of life, the rarity of such exceptions as this one argues that we should consider explanations of the discrepancy other than absence of a real phylogeny. And the fact that multiple independently gathered molecular data sets agree among themselves is an argument favoring the molecular tree. Finally, morphological data have a strong secondary signal agreeing with the molecular tree. For the latter argument, in addition to my paper, you can consult Gatesy J., Amato G., Norell M., DeSalle R., Hayashi C. Combined support for wholesale taxic atavism in gavialine crocodylians. Systematic Biology 2003; 52:403-422.

swamidass · July 31, 2019, 5:53pm

Can you quantify the rarity?

What are possible explanations?

What is the difference between a true and false gharial? What is a gharial?

John_Harshman · July 31, 2019, 7:24pm

No. I’ll just say that when you find conflict it’s worth a whole publication, but when you don’t find conflict it isn’t. So the literature is a biased sample of the incidence of conflict. Nevertheless, conflicts are generally limited to a few nodes on large trees. In this case, for example, the only real conflict is the sole question of gharial relationships, and the rest of the crocodylians have non-conflicting relationships.

You know many of them with respect to conflict among molecular studies. Morphology would add coding errors. Coding of molecular characters can be difficult and subjective.

“Gharial” is the name for two species of crocodile with long, narrow snouts. The difference is that they are separate genera, easily distinguished by a host of characters. Incidentally, the issue of Systematic Biology in which both our paper and John Gatesy’s paper appeared has a dramatic painting of both species on the cover, if you can find it.

True gharial is on the bottom.

Rumraket · July 31, 2019, 8:25pm

So if I understand correctly, the issue is really that the false Gharial had retained a number of morphological characteristics apart from the snout morphology that made it group more closely with other crocodiles in the Crocodylidae family, even despite it’s snout being more similar to the true Gharial.

So when the molecular tree comes out with both Gharials grouped most closely together as sister taxa, this also shows that morphology really can, in principle, be encoded by gene-sequences that yield phylogenies that don’t have to mirror the morphological ones. In this case the false Gharial has morphological characteristics that group it somewhere inside the Crocodylidae family, yet that morphology is encoded by gene-sequences that imply it is actually most closely related to the true Gharial.

Which in turn makes the case of the Gharials morphology-vs-molecules conflict a good of example of an “exception that proves the rule”. That there is really only one good explanation for consilience of independent phylogenies: The genes yield similar trees not because they’re functionally constrained to track each other, but because of shared genealogical history.

AllenWitmerMiller · July 31, 2019, 9:07pm

That makes so much sense. Thanks for the excellent explanation.

I recall a documentary some years ago on India’s gharials. A pond at some gharial sanctuary was so saturated and churning with such creatures that it looked like something out of a horror movie—or perhaps a Hindu analogue to Dante’s levels of hell.

John_Harshman · July 31, 2019, 10:12pm

Yes, and “retained” is the correct term, as the true gharial has a large number of reversals from the derived form seen in Tomistoma and Crocodylus to the primitive form seen in the common ancestor of crocs and alligators.

Well, no it doesn’t, as the sequence we used in that paper encodes no morphology and has nothing to do with the characters that were used in the morphological tree. It would require considerable genomics work to determine what sequences were responsible for particular morphological differences, but then you could try a tree using those sequences. I expect it would result in the molecular tree. But I have no direct evidence for that expectation.

That would be a good inference from the sort of data set you suggest, but nobody has produced such a data set, and it would require untold years of work to identify the proper data.

John_Harshman · July 31, 2019, 10:13pm

Mind you, they eat fish almost exclusively.

Rumraket · July 31, 2019, 10:52pm

I see what you mean. So I’ll have to change my argument.

It shows that, whatever the function of the the c-myc gene sequence is(I understand from your paper it’s a transcription factor), it’s not constrained in it’s sequence to yield a phylogeny that is identical with the phylogeny derived from the morphological characteristics used to elucidate the traditional Crocodilia topology.
So even if there really are such gene sequences under such constraints, it’s not the case for this one (or the false Gharial morphological and c-myc molecular trees should have been identical), which implies that at least for the c-myc sequence it really is independent of morphology.

Re-reading that last sentence I realize I have to relax that claim even more to a much weaker statement that, if there are any constraints operating on the c-myc gene sequences in the species used in your paper that could at all bias them to yield phylogenetic trees similar to the ones derived from the morphological characteristics, those constraints aren’t strong enough to force the trees to be completely identical.

John_Harshman · July 31, 2019, 11:08pm

True, no constraint other than that all these characters evolved on the same phylogeny. Yes, c-myc is a transcription factor; it’s also one of the few genes actually known to have functional alternative splicing. But it isn’t independent of morphology; clearly, transcription factors have something to do with morphology. It’s just that we can’t tie any differences in c-myc sequences to any particular differences in crocodylian morphology. It’s possible it could be involved, but I further suspect that none of the characters in the morphological data set has anything to do with differences in c-myc sequence. Also, the majority of the sequence is not part of the protein-coding exon and is either spliced out or untranslated.

And we also have no idea what any constraints could possibly be or how they would work.

Chris_Falter · August 1, 2019, 10:27pm

Question 1:

Could you comment on whether a certain amount of anomaly is expected when using multiple, different methods of tree-building?

The reason I ask is this:stochastic phenomena with very large repetitions manifest (what look like) anomalies at a certain rate. For example, if I flip a fair coin 10 times, 10 heads in a row would be anomalous. If I flipped a fair coin 10,000 times and then selected 10 consecutive flips at random, I would expect to find 10 in a row at a certain frequency. (In fact, roughly every 1000 selections.)

In the realm of evolutionary biology, we also expect incomplete lineage sorting, a type of anomaly, to show up with a certain frequency, right?

Question 2:

Do the data you collected rule out the null hypothesis (no nested hierarchy–i.e., no common ancestry) at a level of statistical significance?

John_Harshman · August 2, 2019, 2:17am

I’m not quite clear on what you mean here, as it isn’t the sort of language we use in phylogenetics to describe what you may be talking about. We would expect a certain amount of difference between trees built by different methods, largely because the methods have different assumptions about the nature of the data which may be violated to greater or lesser degree by the data sets. That’s quite different from what you describe in coin flips.

Now, if I understand the coin flip analogy here, it’s similar to the various sites of a DNA sequence being thought of as pulled from some ideal distribution, which methods in fact usually assume. Purely by chance, we may get a data set that isn’t representative of the distribution, and thus analysis of that data would produce a false tree. Is that what you mean? Of course that’s possible, though the probability decreases as sample size increases and as the extent of the required bias increases. In other words, 100 heads in a row is less probable than 10 heads in a row, and 10 out of 10 coins being heads is less probablye than 7 out of 10. I think our data in the croc paper surpasses any significant probability of that error on both counts.

Incomplete lineage sorting is quite another question. We do expect it to happen, with a frequency inversely related to branch lengths between divergence events and directly to population size. The way to deal with that is by assessing many independently assorting loci. That’s where that mention of prior publications comes in.

Not through any formal test, but I’d say the null expectation would be that the data would not support any tree over any other, so the various tests of support we performed should serve. Check the Kishino-Hasegawa test and the non-parametric phylogenetic bootstraps.

Chris_Falter · August 2, 2019, 2:06pm

Hi John,

Appreciate the illumination!

I’m not sure I agree with this. Wouldn’t the null expectation be that the trees would have a randomized distribution of some kind? In other words, some trees would seem a little less likely, some would seem a little more likely, all the way up to one tree being the most likely.

However, this randomized distribution should be recognizable, and the most likely trees that emerge from your study would (presumably) not conform to it.

Does this make sense?

Thanks,
Chris

swamidass · August 2, 2019, 2:31pm

@Chris_Falter do you know what a homoplasy is? It is a feature that does not fall into the tree formed by other features. How likely do you think homoplasies are in sequences not constructed by common descent?

Chris_Falter · August 2, 2019, 6:04pm

I love the Socratic method!

Homoplasy would be the result of convergent evolution where the sequences are the result of common descent, I postulate.

I would also postulate that homoplasy would be far more common in trees not formed from common descent.

John_Harshman · August 3, 2019, 12:11am

Yes, but the differences would not be expected to be significant. That’s what I meant by “support”. In fact a randomized data set will generally provide you with a single most likely tree, but its difference in likelihood from a randomly selected other tree would be small.

John_Harshman · August 3, 2019, 12:14am

By technical definition it’s “similarity not due to common descent”. In order to infer homoplasy, we must first estimate the true tree. The assumption there is that the data supporting the true tree will outweigh the homoplasies supporting any one of various false trees.

swamidass · August 3, 2019, 12:32am

I’d like to suggest an experiment (perhaps @evograd, @davecarlson or @Jordan can help).

Let’s generate two datasets. One is sequences generated be a tree. The other is sequences generated by random distribution.

Let’s then see how strong the inferred tree is using a phylogeny package. @John_Harshman, what should we expect to see?

AllenWitmerMiller · August 3, 2019, 12:33am

And most snakes and spiders are harmless to humans—but that doesn’t get in the way of popular horror movie tropes.

Topic		Replies	Views
Introducing Babacar Conversation Introduction	40	3227	June 2, 2020
What Line of Evidence is Strongest for Evolution? Conversation Science	166	2882	January 31, 2021
Holloway: Fallacy of the Phylogenetic Signal: Nucleotide Level Conversation Science , Design	28	645	October 3, 2020
Trees and Star Diagrams Conversation Science	45	1265	September 3, 2020
Common Ancestry and Nested Hierarchy Conversation Science	52	2213	October 20, 2022

John Harshman: The Phylogeny of Crocodiles

True and False Gharials: A Nuclear Gene Phylogeny of Crocodylia

Related topics