Heliocentric Certainty Against a Bottleneck of Two?

With the first claim set aside for now, the second claim is more interesting…

Our ancestors as a whole do not dip down to a single couple between 300 kya and 3 mya with very high confidence, but maybe not as high.

Was there a bottleneck among our ancestors as a whole, both Homo sapiens and non-sapiens. What is the timeline over which we are certain there was no bottleneck? 100 kya? 500 kya? 2 mya? or 13 mya? How certain should we be in this evidence?

TMRCA or TMR4A?

Population size estimates in the past from genetic sequences are fairly involved. It is not easy to explain concisely how these estimates are made, and that will be left for a future effort. However, one common way a bottleneck is argued against is using the estimated TMRCA of segments of DNA. To be clear, this is not really the strongest evidence against a single couple bottleneck. For that we have to look elsewhere.

Dennis, nonetheless, pointed to this paper as evidence for no bottleneck in the last 3.2 million years.

image

http://www.genetics.org/content/172/2/1139.long

This is a phylogeny that shows the history of a piece of DNA, which appears to have a history going back 3.3 million years. If we take the tip (at 3.2 million years), that is were we expect to find the genetic common ancestor. That is how some put a bound on a bottleneck.

However, if we are looking to put a minimum time on a bottleneck, the bottleneck couple certainly could have been heterozygous, carrying 4 alleles at each loci (in their autosomal genome). If that is the case, it is not the time to most recent common ancestor (TMRCA) that matters, but the time to most recent four alleles (TMR4A). That would correspond to the horizontal line which would correspond to when 4 lineages arise (which is in the much more recent past).

In this case, also, there is not enough data to precisely visualize TMR4A in the phylogeny. When there is not enough data, it is common to branch the phylogeny more than twice. Taking that uncertainty into account, the TMR4A It could easily be within 0.5 million years ago.

Estimating TMR4A

TMR4A is not regularly computed in genetics studies. Usually only TMRCA values are put forward. Is there a way to estimate TMR4A from TMRCA? One heuristic is that, on average:

TMR4A = TMRCA / 4 

I put this forward in the conversation, but did not have the math worked out. It was an educated guess. Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1) - #248 by Swamidass - Faith & Science Conversation - The BioLogos Forum

I was honestly not comfortable with that, even after looking at some phylogenies that seem to validate the general number. So I wrote up some code to compute what the average expectation is. Assuming constant population size, we expect on average:

TMR4A = 0.24 * TMRCA

Which is almost exactly what I estimated: 1/4. Keep in mind, however, that this is a noisy estimate. And it will not hold up in practice. In a future post, I test this empirically, and find that on the ArgWeaver dataset that the actual relationship is closer to:

TMR4A = 0.38 * TMRCA

Regardless, this is only an initial heuristic. In fact, we will see that TMR4A is much more stable than TMRCA, and stays in a tighter range. The higher TMRCA is the lower the ratio TMR4A / TMRCA. So, this quite substantially reduces the time bound on a bottleneck using the TMRCA approach.

There can be a wide gap between this estimate and the true TMR4A. It has to be estimated directly from the data. For example, in the figure, the TMR4A is actually less than TMRCA / 4 (450 kya / 3.2 mya ~= 0.14, not 0.25), because this TMRCA is a high outlier. I caution strongly against “eyeballing” these graphs to get anything more than a general understanding of what we are calculating. Ultimately, we need to compute TMR4A across the whole autosomal GENOME, and understand that distribution to understand what the data shows.

The bad news is that this is very hard to do. The good news is that this is becoming easier with new markov chain inferences of recombination graphs. More on that later though.

Note: the move to TMR4A is not justified when using DNA from X-chromosome (where TMR3A would be relevant) or from mitochondrial and or Y chromosome (where TMRCA is just fine). The y-MRCA and m-MRCA do seem to put a bound a single couple bottleneck to sometime after 100 kya to 150 kya.

The Math

Here is some python code to compute the numbers (it requires numpy to run).

# Any large number will do, because the coalescent ages quickly converge.
expected_age = n_kingman_ages(107) 

# What do we multiply TMRCA by to estimate TMR4A?
# 4th coalescent corresponds with TMR4A
print expected_age[3] / expected_age[0]
>>> 0.242990654206

# By what do we multiply the first 6 coalescents to estimate TMR4A?
print expected_age[3] / expected_age[:6]
>>> [ 0.24299065  0.49056604  0.74285714  1.          1.26213592  1.52941176]

TMR4A - Replit

For reference, a phylogeny with 54 diploid individuals has 54 * 2 - 1 = 107 coalescents. There 54 individuals in the Personal Genomics data used by ArgWeaver, which is why I use 107 here. Any large number, however, will do, as this summation converges very quickly.

The functions are defined as:

import numpy as np

def n_kingman_waits(n):
  """the expected wait times / Ne of the 1 through n coalescents. 
  The n-th element is the (n+1) coalecent time, and is the expected time / Ne
  for n+2 alleles to coalesce to n+1. Ne is the effective population size (alleles).
  """
  k = np.arange(n) + 1
  return 4. / (k * (k+1)) 

def n_kingman_ages(n):
  # get the expected ages / Ne of the n-coalescents 
  return np.cumsum(n_kingman_waits(n)[::-1])[::-1]