John Harshman: The Phylogeny of Crocodiles

Well, not all data generated on a tree could be expected to be informative. As you doubtless know, multiple hits lose information, so the shorter the branch and the longer ago it happened the less likely it is to be recovered from the data. But in general, we would expect any test of support to show lots of it on the tree from real data and little on the tree from random data. Bootstrapping, for example, should show many nodes supported at >70% for the real data and few, if any, for the random data. (Bootstrapping is a test of data consistency, i.e. of whether different samplings of the data agree on the same tree.)

Incidentally, an easy way to construct a random data set is to individually randomize the sites among taxa in a real data set.


I would be interested to know how one might go about doing this, perhaps typical software or methods I could google, at least something to point me in the right direction.

1 Like

I asked Joe Felsenstein, and he said “Seqboot in PHYLIP will do that nicely (see menu option J).” Just take a real data matrix in PHYLIP format, and that option will scramble the sites among species to produce a meaningless matrix.


Thanks @Joe_Felsenstein.

OK @John_Harshman and @swamidass, remember that I essentially know nothing about what I’m doing :wink:

I took the mitochondrial DNA sequence from GenBank for:

  • Gavialis gangeticus (labeled T for true gharial),
  • Tomistoma schlegelii (labeled F for false gharial), and
  • Crocodylus porosus (labeled C for croc, 'cause I’m not a biologist)

I did a sequence alignment using Kalign and got it into PHYLIP and produced the following tree using dnapars. I used parsimony instead of maximum likelihood because it came first in the documentation. I got the following tree:

I then used Seqboot in PHYLIP to scramble the data a little (although I’m not exactly sure how well. I used the bootstrap method with defaults and just had it generate a single set of 3 sequences sampling from the 3 I put in). I then took the three “random” sequences (labeled R1, R2, and R3) and added them to the previous real set and re-ran dnapars and got the following tree:

Does any of that make sense?

1 Like

Actually, you can’t read that off the tree. I’m sure that claim is true for most of the species shown, but the tree doesn’t show it. What it shows are inferred numbers of changes along particular branches, and the sums of those changes will not match the pairwise distances between two taxa. Nor does “based on nucleotide differences” mean what you think it does. I suspect that this tree was built by least squares fit of a simple matching distance matrix, but one can’t be sure without a real reference.

Sadly, no. First off, three sequences can produce only one tree, with three branches that meet at one internal point. You need at least four sequences to say anything. Second, the bootstrap method doesn’t produce random sequences but sequences resembling the original ones. In fact bootstrapping is a way to measure the strength of signal in data. You have to invoke option J when you set it up in order to get randomized sequences. Finally, I can’t tell what happened in your final analysis, but given that your “random” sequences are grouped, It appears you did manage to produce randomized sequences and must have incorrectly described the process.


sigh I figured.

I did look at the Seqboot documentation the options in J. I went with the default (bootstrap) which is described as:

It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets.

Which made sense to me from what little I know of boostraping from non-biological contexts I’ve seen it used. My reading was that this option wasn’t doing the bootstraping analysis, only sampling the data prior to doing the analysis.

Amongst the other options these two also seemed like they might be what we want:

Permuting species within characters. This method of resampling (well, OK, it may not be best to call it resampling) was introduced by Archie (1989) and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the columns of the data matrix separately. This produces data matrices that have the same number and kinds of characters but no taxonomic structure. It is used for different purposes than the bootstrap, as it tests not the variation around an estimated tree but the hypothesis that there is no taxonomic structure in the data: if a statistic such as number of steps is significantly smaller in the actual data than it is in replicates that are permuted, then we can argue that there is some taxonomic structure in the data (though perhaps it might be just the presence of a pair of sibling species).

Permuting characters separately for each species. This is a method introduced by Steel, Lockhart, and Penny (1993) to permute data so as to destroy all phylogenetic structure, while keeping the base composition of each species the same as before. It shuffles the character order separately for each species.

OK, I think I see what you mean. I put in Crocodylus porosus because I thought I would need another sequence, but looking at the result I see what you mean. Would this be where I’d want to just have the random sequences or do we want to compare one tree without random sequence to another with them (i.e. I need to find another species’ sequence)?

@John_Harshman, thanks for the guidance here, I’m just taking a stab at this as an exercise in learning about phylogenetic trees and some of the tools you all use. This was the first time I’d ever done a sequence alignment, it was really interesting to see the sequences laid out with similarities and differences.

1 Like

I take it you’ve responded to scd’s post here in the wrong thread?

He got the figure from this webpage btw: Evolution - Evolutionary trees |

1 Like

Correct. And it’s just like any bootstrap: resampling the data with replacement. However, this doesn’t produce a randomized data set. It produces a data set that is supposedly drawn from the same distribution as the real data set, or something like that distribution. Not what you want.

Yes, that’s what you want, though I’m not clear on the difference between “permuting species within characters” and "permuting characters separately for each species. What we want is to independently scramble the order of cells in a column, i.e. the character states among species. You have to keep entering “J” until you get permutation. Then you get a randomized data set that you can then run through a parsimony (or other) analysis. Ideally, you would test for signal by bootstrapping (again, sampling sites with replacement) and then analyzing each of 100 or so bootstrapped replicates and constructing a majority-rule consensus tree of those 100 replicates.

Depends on what you’re trying for. If you want to compare real data with random data you need to construct a real data set, bootstrap and analyze the replicates, then permute the original data set, bootstrap and analyze those replicates. Then compare the consensus trees resulting from those analyses. The prediction is that there will be a strong consensus from the real data but no real consensus from the permuted data.

I’m intending to do that with my c-myc data whenever I can get a chance.


I think PAML’s evolver package can do this. When I get a little time, I’ll try working on this.


You’re welcome. That program, in addition to its usual bootstrap and jackknife options, has Permute capabilities – it can either permute each column (states of one character) or each row (states of all characters in a species). It can also do fractional jackknifes (jackknives?) and some changes of file format. Free in the package.


Thanks again @Joe_Felsenstein.

We have an experiment on the table as you can see. I expect that, compared to the tree data, the permuted or random data will:

  1. Have more homoplastic mutations that don’t fit the tree.

  2. Have a more flattened tree, with sequences more equadistant from one another and the root node.

These are alternate solutions to the same problem of missing nested clades. So I suspect that different tree building methods may result in more of one or the other error signal.

Do you agree with me here? Are my instincts correct? What might I have missed?

This should be verified (and I have no reason to doubt it) then bolded and emphasized. This means that the method definitely does not assume a tree, and can determine that “no statistically significant tree determined.”

1 Like

If my inference of the reason you are promoting this (education) is correct, IMO the current conversational format is unlikely to educate anyone unless a detailed walkthrough with screenshots is provided.


OK, I did the analyses. First tree: permuted data bootstrap; second tree, regular data bootstrap. The first tree shows a complete polytomy; that is no structure is supported, as expected. The second tree is completely resolved, with high bootrap percentages — the measure of support — for each branch, most of them 100%.


What it actually means is that the method doesn’t assume that any tree is better supported by the data than any other, and a resampling of the data is unlikely to produce the same tree as another resampling of the data. A bootstrap consensus collapses all branches supported by less than 50% of bootstrapped trees. Which in the case of randomized data is all of them.


Was the alligator sequence included in the permutation?

Can measure and report the number of homoplastic mutations?

Of course it was. Don’t you see it on the tree? Of course, permuting all the taxa but one would have the same effect as permuting all the taxa anyway.

I could report the consistency index; that’s easy enough. Why?
Permuted: CI .5337; regular: CI .9531.


@Joe_Felsenstein, thanks for the clarification, and thanks for PHYLIP. I’m just a chemist playing around with population genetics trying to learn something, but I spent a few hours today using it and it’s really cool. The documentation was quite helpful.


If by “the tree” you mean whatever tree is recovered from the permuted data, then yes. More noise, less “signal” (actually zero signal, but by chance some of the noise will favor one or a few trees slightly better than others, and that will translate to a weak signal).

Yes, though this is not something one would every describe as a main feature of permuted data.

Not sure I understand that one.

1 Like