Heliocentric Certainty Against a Bottleneck of Two?

Richard Buggs writes:

This is exactly my point. Thank you for stating it so concisely. To my mind, the way ahead would be to write a programme that computes the TMR4A for each haplotype block of the human genome, and work out a reasonable time frame using data from all blocks. Until that has been done, I do not think we can say that the bottleneck hypothesis has been rigorously tested.
Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1) - #264 by RichardBuggs - Faith & Science Conversation - The BioLogos Forum

Now we are prepared to take him up on that challenge. It will be really interesting to see what the data shows. Here is a key schematic to show how this relates with the phylogenies we collected. Moreover, beware that I have reversed numbering of coalescents from standard notation (as I learned it) for clarity here.

image

Notice that the 4th coalescent is the TMR4A. That is what we are after. Also notice how “bottom heavy” the phylogeny is. The trunks (at the top here) are usually long, but the branches are usually short. We are after the age of fourth node in the tree, the 4th coalescent.

A Genome-Wide Estimate of TMRCA

So first, off, we can take a look at the TMRCA distribution. Here, the CDF is on the left and the PDF of the distribution is on the right. Time 0 is present day and time 3,000 at the end of the graph is 3 mya. The jagged black line is the CDF of the data, which is a step function because it is quantized (not smooth). The smooth colored line is a sigmoid function fit to the CDF (in log space), and the colored line dropping down shows where the median estimate is based on the curve. Using this method, the estimated genome-wide TMRCA is 1323 kya, or 1.3 million years.

For reference, this is really close to the value computed from S17. Overlaying the two figures, it looks as if the genome-wide distribution is skewed a bit to the right at the top portion, which increases our estimates by a small amount. That is to be expected, as S17 used a small set of hand checked neutral regions. These genomewide estimate includes everything. I’m including a graph approximately scaled to S17 for reference (mind the change in units).

A Genome-Wide Estimate of TMR4A

We can do the same thing for TMR4A. The graph here is on the same scale, for comparison to the prior figure. Using this method, the estimated genome-wide TMR4A is 431 kya, well within the 500 kya that Richard Buggs hypothesized as a point beyond which we may not have as much confidence in ruling out a bottleneck.

We want to be sure this is a robust result, not overly affected by all the quantization errors. Notice that there is higher certainty in the TMR4A estimate than the TMRCA estimate because:

  1. The variance of the TMR4A is much lower than TMRCA. Visually, we see that in the reduced spread of the TMR4A PDF (right) and increased steepness in the CDF.

  2. The closer fit between the fit curve and the underlying data.

  3. Notice also that the outlier TMR4A’s are much less extreme. There are fewer outliers, and the max is not too extreme. It is high, but not nearly as bad an outlier as the highest TMRCA (13 million). That is good sign, suggesting we are measuring real distribution around a real quantity of the data. We might look at these more closely later, but that is not our focus here.

Just to be sure, and raised confidence further, I computed TMR4A another way. this time averaging three estimates together. If our first estimate was good, it should be robust to these sorts of refinements.

Improved_TMR4A = ( TMR3A * 0.74 + TMR4A + TMR5A * 1.26 ) / 3

The estimates were defined earlier (Heliocentric Certainty Against a Bottleneck of Two? - #5 by swamidass). Here we use the 3rd and 5th coalescents to estimate the 4th coalescent, averaging them all together. Below, we see this does a lot to smooth out the distribution; but it does not succeed completely. This improved estimate is 437 kya, nearly identical to the first estimate. This increases our confidence that this is a reasonable estimate; it is robust to manipulations like this.

A large number of variations like this were attempted, and they gave similar results. That is good news. This is not the last word on TMR4A, but it is pretty good “first” word on the matter.

What is the Confidence Interval?

So we arrive at a genomewide TMR4A estimate of about 430 kya. Given all the uncertainties involved, I would say that the confidence interval here is probably +/- 100 kya. That is based on the recognition:

  1. ArgWeaver is merely approximating the TMRCA, with a quantization. That certainly adds error here.
  2. Mutation rate is assumed constant across 10kb windows, but is known to vary on shorter distance scales.
  3. Most importantly, we expect mutation rate to be different at times in the past. For this reason, we cannot know it for sure. Especially as we are considering times in the deep past (say 300 kya) but not so deep that we can expect the central limit theorem to save us (say if we are going 5 mya).
  4. This does not correct for balancing selection, by making sure we are only looking at neutral regions of the genome. This might skew the results, but it also protects against bias and makes this analysis more reproducible.
  5. At the moment, all coalescents are weighted equally. This biases the averages to high recombination areas, which might be biased towards upward errors in TMRCA estimates. It would be wiser to average weighting by the length of the DNA segment to which the phylogeny applies.
  6. There may be other ways to improve these estimates. Likewise, the quantization could introduce systematic bias that is not fully corrected for in our approach. Using a median of a fitted curve helps, so does averaging multiple estimates of the coalescent. It might be better, however, to find a way to refine the trees further. It’s not clear, however, how to do this. It may be infeasible with current approaches.

Based on these concerns, I can give my expert opinion, which is nonetheless only an opinion. I’d estimate that there is about 20% error one way or another, at least.

Keeping in mind that some scientists think that Homo sapiens might arise as early as 340 kya, it is plausible (but not proven) to imagine a bottleneck of one couple at that time. Some will debate the exact reasoning here. This, I should add, is not my view of our origins. This, however, is a plausible scientific argument.

Not All the Evidence

It is critical to point out that this is NOT all the evidence at play. For example, this is not an estimate of effective population size (which would look at the distribution of coalescents over time). It also ignores trans-species variation. Trans-species variation is shared by humans and chimps, like that we see in the MHC, and might provide the strongest evidence against a single couple bottleneck. Those parts of the genomes are some of the outliers. To full consider this hypothesis, we have to think about that data too.

We will consider these two types of evidence in time. However, it is important not to overstate what has been shown here. The key point I’ve put forward here is that we should not overinterpret TMRCA as evidence against a bottleneck in our lineage before 0.5 mya. A very recent bottleneck (say 50 kya) seems impossible, but a more ancient bottleneck of our ancestors (if very brief) at 500 kya might be consistent with the evidence. Sometime before 500 kya, this couple would not be Homo sapiens, but they might (exact dates debatable) be the common ancestor of Homo sapiens, Denisovans, and Neanderthals.

Moreover, data from TMRCA on mitochondria and Y-chromosomes should put a bound at least at 100 kya, after which a bottleneck is not very likely. So we still are very certain that a very recent bottleneck is inconsistent with the evidence.