Heliocentric Certainty Against a Bottleneck of Two?

Richard Buggs writes:

This is exactly my point. Thank you for stating it so concisely. To my mind, the way ahead would be to write a programme that computes the TMR4A for each haplotype block of the human genome, and work out a reasonable time frame using data from all blocks. Until that has been done, I do not think we can say that the bottleneck hypothesis has been rigorously tested.
Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1) - #264 by RichardBuggs - Faith & Science Conversation - The BioLogos Forum

Let’s give it a shot. First, however, we need some data…

Exploring Genome-Wide Phylogenies

I reached out to the authors of the paper under question. To their credit, they responded back immediately, and were very helpful. It is not as common as it should be, but Dr. Adam Seipel from Cold Springs Harbor Laboratories, was immensely helpful, answering several questions about the data.

It turns out that they had already made genome wide phylogenies available on the web, for anyone to download. compgen.cshl.edu/ARGweaver/CG_results/download/. The data looks, utlimately, like this:

image

There were two problems.

  1. The concatenated results file is 424 GB (half a terabyte). Downloading it in on shot takes a prohibitively long time, because downloads are quickly throttled to a snail’s pace.

  2. Processing the trees with my library of choice (dendropy, dendropy.org/) was much slower than expected. To give a sense of the scale, there are 12,500,000 trees covering about 200 bp each, and conditioned to be similar to one another if they are adjacent on a chromosomal arm. That is just for one sample, but there are 200 samples from the model. So this is a lot of data that is hard to process quickly.

The good news is that both these hurdles were manageable. It turns out that a single genome wide sample (about 1.5 GB) can be downloaded (I chose sample from iteration 2400) in pieces (getting past the throttle) and only about three hours of processing time is required (on a super fast compute node rented from Amazon). From here, I extracted the first 50 coalescents in to a 126 MB file.

I’m going explain the data involved, and then dive into the key results.

The Distribution of TMRCA

The first thing to do with new data is to look at its distribution. Here is the distribution (PDF, Probability density function - Wikipedia) of TMRCA’s across the genome. This is the age of the 1st-coalescent, the way I am indexing things.

Oddly, this is not a smooth distribution. It is not spread out, but it looks like the results are quantized. This becomes more clear when we look at the distribution for log TMRCAs. It looks like TMRCAs come in groups, spaced out by about 0.5 log units. That is exactly the case. It turns out that that the way that ArgWeaver (the program used in this study) can work so quickly is by quantizing the results. This substantially simplifies the problem. A close read would have clarified this, but I had not read the paper closely enough. The data however makes this clear.

This is a profoundly important point, as this will introduce substantial error into any individual TMRCA estimates. It highlights again the value of looking at distribution of TMRCAs, not just individual outliers.

The Distribution of TMR4A

We see a similar distribution for TMR4A, which corresponds with the 4th-coalescent. The results are quantized, instead of being nicely distributed across the data. This going to present a problem as we move to estimating the TMR4A across the genome. We have to correct for this quantization somehow.

We can also ask what the relationship between TMRCA and TMR4A is, by plotting the distribution of TMR4A / TMRCA.

Once again we see same quantization, even spaced in log scale. So it seems that ArgWeaver is fixing TMR4A at some multiple of TMRCA. In this dataset, on average, TMRCA * 0.37 = TMR4A. Moreover, we also see a fairly strong relationship between TMR4A/TMRCA and TMRCA. This is evident on both standard and log scale (Pearson R -0.63 and -0.73, respectively), but is clearest in log scale:

image

The key way to understand this is that the larger TMRCA grows the smaller TMR4A is relative to TMRCA. That is an important point. If we care about TMR4A, we cannot interpret large TMRCAs easily. We have to have the underlying data directly to estimate it.

That is, however, exactly what we have, and what we can use now to make an unbiased estimate of TMRCA and TMR4A across the genome.