Please explain CI and how it relates to the number of homoplastic mutations?
CI is a property of data mapped onto a tree. Itâs the ratio of the minimum possible number of transformations for each character on any tree to the number of transformations as mapped onto the current tree. The minimum possible number of transformations is the number of observed states minus one. Most of the additional transformations above the minimum would be considered homoplasy. CI is from 0 to 1; higher CI means less homoplasy. A CI of .95 means very little implied homoplasy; .53, quite a lot.
Can you please work that out with a small example?
Sure. Suppose you have a site at which there are two states, say T and C, with some species having T and others C. Further suppose that you know the root state to be C. (That isnât necessary and doesnât affect the calculations, but it makes things easier to explain.) The minimum number of changes to explain the distribution is one, from C to T. Now suppose you have a tree that best fits all the data, and on this tree the species with T do not form a single group. We may have to suppose that the change from C to T happened twice independently in different parts of the tree, or we may have to suppose that C changed to T and, at some point, back to C. In either case, there are two changes necessary to explain the distribution of C and T over that tree. The consistency index for that site would be 1/2.
That was excellent. What program does one use to obtain a CI for a data set?
I used PAUP. Note that the CI isnât for the data set; itâs for the data set combined with some particular tree.
Thanks, I understand.
Okay, here is a follow up.
I used the T-REX webserver to generate an arbitrary (but resolved) tree with 10 taxa.
I then used PAMLâs evolver package to simulate 10,000 bp of sequence along each branch of the tree. Then, I used IQ-tree to perform a Maximum Likelihood phylogeny inference with 100 bootstrap replicates.
Here is the tree (with midpoint rooting for easy visualization):
As expected, the tree is completely resolved with 100% bootstrap scores.
Next, I repeated the same procedure but used a completely unresolved tree (i.e., a polytomy) to simulate the DNA sequences. Here is the tree resulting from that data set:
This time the internal branches are all very short, and the low bootstrap scores suggest that all the relationships are very uncertain. Again, as expected.
Edit: fixed some wording
What would be the CI values you get from each of those two trees?
I donât currently have PAUP installed and donât know which other programs estimate CI, but if I get a chance, Iâll install PAUP and try to check.
For laypeople, it might be clearer to simply point out that we would mis-score 2 mutations (a mutation and a reversion to the original base or amino-acid residue) as 0 mutations if we have not sampled a species with the mutation. This is noise, but it is washed out by the enormous amount of data we can collect.
Glad to hear it seemed cool. It is fairly old right now (the latest release about a decade old) but I am working on a new version with Java interfaces for the programs. I do like to think that it has the best documentation in the (phylogeny) industry.