Well, not all data generated on a tree could be expected to be informative. As you doubtless know, multiple hits lose information, so the shorter the branch and the longer ago it happened the less likely it is to be recovered from the data. But in general, we would expect any test of support to show lots of it on the tree from real data and little on the tree from random data. Bootstrapping, for example, should show many nodes supported at >70% for the real data and few, if any, for the random data. (Bootstrapping is a test of data consistency, i.e. of whether different samplings of the data agree on the same tree.)
Incidentally, an easy way to construct a random data set is to individually randomize the sites among taxa in a real data set.
I would be interested to know how one might go about doing this, perhaps typical software or methods I could google, at least something to point me in the right direction.
I asked Joe Felsenstein, and he said âSeqboot in PHYLIP will do that nicely (see menu option J).â Just take a real data matrix in PHYLIP format, and that option will scramble the sites among species to produce a meaningless matrix.
OK @John_Harshman and @swamidass, remember that I essentially know nothing about what Iâm doing
I took the mitochondrial DNA sequence from GenBank for:
Gavialis gangeticus (labeled T for true gharial),
Tomistoma schlegelii (labeled F for false gharial), and
Crocodylus porosus (labeled C for croc, 'cause Iâm not a biologist)
I did a sequence alignment using Kalign and got it into PHYLIP and produced the following tree using dnapars. I used parsimony instead of maximum likelihood because it came first in the documentation. I got the following tree:
I then used Seqboot in PHYLIP to scramble the data a little (although Iâm not exactly sure how well. I used the bootstrap method with defaults and just had it generate a single set of 3 sequences sampling from the 3 I put in). I then took the three ârandomâ sequences (labeled R1, R2, and R3) and added them to the previous real set and re-ran dnapars and got the following tree:
Actually, you canât read that off the tree. Iâm sure that claim is true for most of the species shown, but the tree doesnât show it. What it shows are inferred numbers of changes along particular branches, and the sums of those changes will not match the pairwise distances between two taxa. Nor does âbased on nucleotide differencesâ mean what you think it does. I suspect that this tree was built by least squares fit of a simple matching distance matrix, but one canât be sure without a real reference.
Sadly, no. First off, three sequences can produce only one tree, with three branches that meet at one internal point. You need at least four sequences to say anything. Second, the bootstrap method doesnât produce random sequences but sequences resembling the original ones. In fact bootstrapping is a way to measure the strength of signal in data. You have to invoke option J when you set it up in order to get randomized sequences. Finally, I canât tell what happened in your final analysis, but given that your ârandomâ sequences are grouped, It appears you did manage to produce randomized sequences and must have incorrectly described the process.
I did look at the Seqboot documentation the options in J. I went with the default (bootstrap) which is described as:
It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets.
Which made sense to me from what little I know of boostraping from non-biological contexts Iâve seen it used. My reading was that this option wasnât doing the bootstraping analysis, only sampling the data prior to doing the analysis.
Amongst the other options these two also seemed like they might be what we want:
Permuting species within characters. This method of resampling (well, OK, it may not be best to call it resampling) was introduced by Archie (1989) and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the columns of the data matrix separately. This produces data matrices that have the same number and kinds of characters but no taxonomic structure. It is used for different purposes than the bootstrap, as it tests not the variation around an estimated tree but the hypothesis that there is no taxonomic structure in the data: if a statistic such as number of steps is significantly smaller in the actual data than it is in replicates that are permuted, then we can argue that there is some taxonomic structure in the data (though perhaps it might be just the presence of a pair of sibling species).
Permuting characters separately for each species. This is a method introduced by Steel, Lockhart, and Penny (1993) to permute data so as to destroy all phylogenetic structure, while keeping the base composition of each species the same as before. It shuffles the character order separately for each species.
OK, I think I see what you mean. I put in Crocodylus porosus because I thought I would need another sequence, but looking at the result I see what you mean. Would this be where Iâd want to just have the random sequences or do we want to compare one tree without random sequence to another with them (i.e. I need to find another speciesâ sequence)?
@John_Harshman, thanks for the guidance here, Iâm just taking a stab at this as an exercise in learning about phylogenetic trees and some of the tools you all use. This was the first time Iâd ever done a sequence alignment, it was really interesting to see the sequences laid out with similarities and differences.
Correct. And itâs just like any bootstrap: resampling the data with replacement. However, this doesnât produce a randomized data set. It produces a data set that is supposedly drawn from the same distribution as the real data set, or something like that distribution. Not what you want.
Yes, thatâs what you want, though Iâm not clear on the difference between âpermuting species within charactersâ and "permuting characters separately for each species. What we want is to independently scramble the order of cells in a column, i.e. the character states among species. You have to keep entering âJâ until you get permutation. Then you get a randomized data set that you can then run through a parsimony (or other) analysis. Ideally, you would test for signal by bootstrapping (again, sampling sites with replacement) and then analyzing each of 100 or so bootstrapped replicates and constructing a majority-rule consensus tree of those 100 replicates.
Depends on what youâre trying for. If you want to compare real data with random data you need to construct a real data set, bootstrap and analyze the replicates, then permute the original data set, bootstrap and analyze those replicates. Then compare the consensus trees resulting from those analyses. The prediction is that there will be a strong consensus from the real data but no real consensus from the permuted data.
Iâm intending to do that with my c-myc data whenever I can get a chance.
Youâre welcome. That program, in addition to its usual bootstrap and jackknife options, has Permute capabilities â it can either permute each column (states of one character) or each row (states of all characters in a species). It can also do fractional jackknifes (jackknives?) and some changes of file format. Free in the package.
We have an experiment on the table as you can see. I expect that, compared to the tree data, the permuted or random data will:
Have more homoplastic mutations that donât fit the tree.
Have a more flattened tree, with sequences more equadistant from one another and the root node.
These are alternate solutions to the same problem of missing nested clades. So I suspect that different tree building methods may result in more of one or the other error signal.
Do you agree with me here? Are my instincts correct? What might I have missed?
This should be verified (and I have no reason to doubt it) then bolded and emphasized. This means that the method definitely does not assume a tree, and can determine that âno statistically significant tree determined.â
If my inference of the reason you are promoting this (education) is correct, IMO the current conversational format is unlikely to educate anyone unless a detailed walkthrough with screenshots is provided.
OK, I did the analyses. First tree: permuted data bootstrap; second tree, regular data bootstrap. The first tree shows a complete polytomy; that is no structure is supported, as expected. The second tree is completely resolved, with high bootrap percentages â the measure of support â for each branch, most of them 100%.
What it actually means is that the method doesnât assume that any tree is better supported by the data than any other, and a resampling of the data is unlikely to produce the same tree as another resampling of the data. A bootstrap consensus collapses all branches supported by less than 50% of bootstrapped trees. Which in the case of randomized data is all of them.
Of course it was. Donât you see it on the tree? Of course, permuting all the taxa but one would have the same effect as permuting all the taxa anyway.
I could report the consistency index; thatâs easy enough. Why?
Permuted: CI .5337; regular: CI .9531.
@Joe_Felsenstein, thanks for the clarification, and thanks for PHYLIP. Iâm just a chemist playing around with population genetics trying to learn something, but I spent a few hours today using it and itâs really cool. The documentation was quite helpful.
If by âthe treeâ you mean whatever tree is recovered from the permuted data, then yes. More noise, less âsignalâ (actually zero signal, but by chance some of the noise will favor one or a few trees slightly better than others, and that will translate to a weak signal).
Yes, though this is not something one would every describe as a main feature of permuted data.