A 100% resolved whole-genome phylogeny of placental mammals

Rumraket · May 2, 2023, 2:05pm

https://www.science.org/doi/10.1126/science.abl8189
(paywalled but a version can be reached on the arxiv here: https://www.biorxiv.org/content/10.1101/2022.08.10.503388v1.full)

Legend to figure 1 says 100% bootstrap support for all nodes:

Placental mammal phylogeny based on coalescent analysis of nearly-neutral sites.

(A) 50% Majority-rule consensus tree from a SVDquartets analysis of 411,110 genome-wide, nearly-neutral sites from the human referenced alignment of 241 species. Bootstrap support is 100% for all nodes. Superordinal clades are labelled and identified in four colors. Nodes corresponding to Boreoeutheria and Atlantogenata are indicated by black circles. (B) The frequency at which eight superordinal clades (numbered 1-8 in Fig. 1A) were recovered as monophyletic in 2,164 window-based maximum likelihood trees from representative autosomes (Chr1, Chr21 and Chr22) and ChrX. Dotted lines indicate relationships that differ from the concatenated Maximum Likelihood analysis.

So a perfectly consistent nested hierarchy of placental mammals based on whole genome sequences. Is that evidence for common descent?

John_Harshman · May 2, 2023, 6:51pm

Bootstrap isn’t really a very good measure of confidence for very large data sets, which tend to have 100% support for every node, so I would view that with caution. Still evidence for common descent, though.

Faizal_Ali · May 2, 2023, 6:55pm

I’m not sure why I never asked this before: What does “bootstrap” mean?

Dan_Eastwood · May 2, 2023, 6:59pm

Bootstrap is a statistical resampling method for estimating the distribution of a statistic from the data itself, rather than by assuming a distribution. The name is a reference to “lifting yourself up by your bootstraps”.

Dan_Eastwood · May 2, 2023, 7:12pm

Bootstrap is often used when the data may not meet all the assumptions for a particular statistical method. Non-normal data or not completely independent data are common examples where bootstrap might be used.

Rumraket · May 2, 2023, 10:19pm

Interesting. I can’t seem to work out why increasing the size of the data set would tend towards 100% bootstrap support. Can you explain?

Rumraket · May 2, 2023, 10:19pm

As applied to phylogenetic inference it’s this:

"I'm treating the mutation rate as a substitution rate" - Dr. Nathaniel Jeanson Conversation

What @John_Harshman said. But I figure it actually takes a bit of explaining. The bootstrap is basically a test of tree consistency. You have to understand something about how the phylogeny is made in the first place, and then how it’s tested for consistency. As it says in the paper it’s inferred on the basis of 54 nuclear genes from each of those 191 species. It’s important to note that those 54 nuclear genes come from different chromosomes, have wildly different functions, some come from coding regions and some from non-coding regions. Introns, exons, genes that regulate other genes and control development, genes that function as enzymes, genes that take part in DNA replication, etc. Some from X and some from Y chromosomes. So whatever functional constraint you might imagine operates on some gene, it’s rather difficult to see how the same constraint can be operating on another, particularly in a sense where that constraint should force trees inferred from each of those 54 loci to g…

John_Harshman · May 2, 2023, 10:20pm

As Dan explains, it’s a general term for creation of a distribution through resampling with replacement. In a phylogenetic bootstrap, individual characters in a data matrix are resampled with replacement to produce a new data set of the same size as the original one, many times. Each new data set, called a pseudoreplicate, is then analyzed to produce a phylogenetic tree. The bootstrap value for a branch of the original tree is the percentage of pseudoreplicates for which that branch appears. It’s more or less a measure of the consistency of the matrix’s phylogenetic signal.

But it’s also sensitive to the size of the data set.

Dan_Eastwood · May 2, 2023, 10:29pm

That makes sense. A larger dataset will be less dependent on one (or a few) observations, and less likely to find different branching.
A larger data set, other things being equal, should also have more complete data on the true branching.

John_Harshman · May 2, 2023, 11:06pm

I can offer a hypothesis, though the main support is just empirical: different large data sets, or the same data sets analyzed in different ways, can produce 100% bootstrap results for contradictory branches. I’m guessing that this is because very small biases in some small percentage of the data are more likely to be well sampled in a large data set. If most of the data are indecisive, that small bias can control the outcome.

Faizal_Ali · May 3, 2023, 5:31pm

Thanks for the explanations, everyone!

swamidass · May 3, 2023, 5:50pm

It’s related to the law of averages.

Rumraket · May 4, 2023, 2:12am

Assuming the data that implies conflicting branches isn’t sampled together, in the bootstrap, you mean? Otherwise I don’t follow.

Why would it be more likely to be sampled in a large rather than small data set? That seems counterintuitive to me. If some X% of the data has some bias and the rest is indecisive, and we’re randomly sampling this data set when building a new alignment, why would you be more likely to sample this biased portion when you pull from an alignment of a larger fraction of the genome, than if you pull from a smaller fraction?

Mark_Sturtevant · May 4, 2023, 9:28pm

Thanks for the link to the full article. Figuring out the deep phylogeny of this group has been very challenging, and attempts to do so based on morphology does not work well. But genetic analysis has given us a tree that suggests that early branching of placentals happened while the northern land masses were breaking up, beginning ~ 130 million years ago. As a result we get large clades like Afrotheria, originating around Africa, and Laurasiatheria, which is a clade whose origins map to what is now Asia. There is this other paper that goes over this:
http://faculty.chas.uni.edu/~spradlin/evolution/Readings.blocked/mammaltrees.pdf, and also Wikepedia gives a comprehendible review of this matter under placental mammals.
A nice pattern that emerges is that besides divergent evolution within the different northern land areas, there is meanwhile some convergent evolution between isolated land areas. For example: Moles within the Laurasiatheria being similar to the golden mole within the Afrotheria. Shrews within the Laurasiatheria being much like the shrew tenrec within the Afrotheria. And pangolins within the Laurasiatheria to the aardvark within the Afrotheria.

John_Harshman · May 4, 2023, 9:29pm

Let us suppose that there’s a slight bias in some percentage of the data and again that the rest of the data are indecisive. If that bias is slight, it might take a lot of it to provide a strong signal. A small data set might not even contain any of those data or, if it did, not enough that the average pseudoreplicate had enough of it to be decisive. But resampling a large data set would be almost certain to include that biased data, in its actual proportion, in every pseudoreplicate. That is, there would be a smaller variance among pseudoreplicates.

Again, I can’t say that this is the actual explanationi, but it seems plausible to me.

Tim · May 5, 2023, 3:05am

Is the problem here one with the statistical method, or with the data though?

In this case, the only ‘signal’ would appear to be due to the “bias”. If that signal is statistically significant, shouldn’t a valid statistical method detect it – whether the signal is due to a genuine effect or bad data? Garbage in, garbage out – but it’s the statistical method’s job to treat the data as valid.

(I seem to remember methods specifically designed to be insensitive to outliers, Ridge Regression is the phrase that comes to mind – but it’s so long ago that my memory is hazy, but that’s a whole different kettle of fish.)

John_Harshman · May 6, 2023, 5:02pm

The purpose of phylogenetic bootstrapping is to determine the consistency of the data and the strength of the common signal. If the data are not in fact consistent and there is no common signal, there ought not to be a high bootstrap value. It’s conceivable that jackknifing would do better. It would be nice if @Joe_Felsenstein would weigh in on all this.

Rumraket · May 6, 2023, 5:03pm

Ahh yeah I get it now. Even if there is contradictory branches in the data it’s not likely to get sampled all that differently between reach pseudoreplicate. Makes sense.

system · May 13, 2023, 5:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nested Hierarchy and Bootstrap Values Conversation Science	30	1061	October 15, 2022
What Line of Evidence is Strongest for Evolution? Conversation Science	166	2881	January 31, 2021
John Harshman: The Phylogeny of Crocodiles Office Hours Science	51	2251	August 10, 2019
Introducing Babacar Conversation Introduction	40	3227	June 2, 2020
John Harshman: Q and A on phylogeny for the skeptic Conversation Science	4	257	July 8, 2021

A 100% resolved whole-genome phylogeny of placental mammals

Related topics