John Harshman: The Phylogeny of Crocodiles

Please explain CI and how it relates to the number of homoplastic mutations?

CI is a property of data mapped onto a tree. It’s the ratio of the minimum possible number of transformations for each character on any tree to the number of transformations as mapped onto the current tree. The minimum possible number of transformations is the number of observed states minus one. Most of the additional transformations above the minimum would be considered homoplasy. CI is from 0 to 1; higher CI means less homoplasy. A CI of .95 means very little implied homoplasy; .53, quite a lot.

3 Likes

Can you please work that out with a small example?

Sure. Suppose you have a site at which there are two states, say T and C, with some species having T and others C. Further suppose that you know the root state to be C. (That isn’t necessary and doesn’t affect the calculations, but it makes things easier to explain.) The minimum number of changes to explain the distribution is one, from C to T. Now suppose you have a tree that best fits all the data, and on this tree the species with T do not form a single group. We may have to suppose that the change from C to T happened twice independently in different parts of the tree, or we may have to suppose that C changed to T and, at some point, back to C. In either case, there are two changes necessary to explain the distribution of C and T over that tree. The consistency index for that site would be 1/2.

8 Likes

That was excellent. What program does one use to obtain a CI for a data set?

I used PAUP. Note that the CI isn’t for the data set; it’s for the data set combined with some particular tree.

2 Likes

Thanks, I understand.

Okay, here is a follow up.

I used the T-REX webserver to generate an arbitrary (but resolved) tree with 10 taxa.
I then used PAML’s evolver package to simulate 10,000 bp of sequence along each branch of the tree. Then, I used IQ-tree to perform a Maximum Likelihood phylogeny inference with 100 bootstrap replicates.

Here is the tree (with midpoint rooting for easy visualization):

As expected, the tree is completely resolved with 100% bootstrap scores.

Next, I repeated the same procedure but used a completely unresolved tree (i.e., a polytomy) to simulate the DNA sequences. Here is the tree resulting from that data set:

This time the internal branches are all very short, and the low bootstrap scores suggest that all the relationships are very uncertain. Again, as expected.

Edit: fixed some wording

4 Likes

What would be the CI values you get from each of those two trees?

I don’t currently have PAUP installed and don’t know which other programs estimate CI, but if I get a chance, I’ll install PAUP and try to check.

For laypeople, it might be clearer to simply point out that we would mis-score 2 mutations (a mutation and a reversion to the original base or amino-acid residue) as 0 mutations if we have not sampled a species with the mutation. This is noise, but it is washed out by the enormous amount of data we can collect.

2 Likes

Glad to hear it seemed cool. It is fairly old right now (the latest release about a decade old) but I am working on a new version with Java interfaces for the programs. I do like to think that it has the best documentation in the (phylogeny) industry.

4 Likes