Heliocentric Certainty Against a Bottleneck of Two?

Maximum or Median TMRCA (and TMR4A)?

This discussion of TMR4A brought us to this phenomenal paper, which inferred phylogenies (and ancestral recombination) across the whole genomes of 54 individuals. This is really hard to do, and took quite a bit of methodological innovation, and a lot of computer time.

The key data the authors report is this table, which reports the max TMRCAs across the genome. Here they are, reported in units of “generations” where a generation is 25 years.

Multiplying by 25, we find TMRCAs here ranging from 7.5 million to 15 million years ago (mya). Dividing by 4 (to get a TMR4A), that would seem to put a limit on a bottleneck at 3.75 mya. Is this, however, valid reasoning?

Four things are important to keep in mind:

  1. These are maximum TMRCAs reported on a genomewide scan, which by definition, are not representative. They are much larger than average or median TMRCA, which is what we really want.
  2. These are estimated TMRCAs, which are drawn (essentially) from a random distribution. Even if all true TMR4As were less than 1mya, for example, we would still expect to see estimated TMRCAs greater than 1mya. In fact, about 50% of estimated TMRCAs would be expected to be greater than the true TMRCA.
  3. In genome wide analysis, outlier TMRCAs are likely erroneous; they are the places were the assumptions required for dating are most likely violated. For example, several of these positions are subject to balancing selection (and therefore need to be dated differently) and the top positions have high copy number variation (CNV). High copy number variation is an example of a type of sequence that will be erroneously given high TMRCA just because the method used here does not take it into account.
  4. As we will see, for outlier TMRCA, our approximation of TMR4A breaks down. For balancing selection, the gap between TMR4A and TMRCA can be arbitrarily large. The same is true for trans-species variation. We really need to measure the TMR4A directly to know what it is in these cases.

So moving from this table to limit on bottleneck time is not valid reasoning. There are methodological and statistical problems with that leap. Using the maximum TMRCA for this purpose is the definition of cherry-picking data.

What we really need is the distribution of TMRCAs (and eventually TMR4As) across the genome. The good news is that we found exactly that in the supplementary data of the paper. From there, we can get to our first plausible genomewide estimate of TMR4A.

1 Like