Heliocentric Certainty Against a Bottleneck of Two?

Systematically Correcting for Errors

One of my good friends in this conversation is Dr. AJ Roberts from Reasons to Believe. She is a cautious and careful voice, and though we have our disagreements I pay close attention to her thoughts and questions.

She raises an important point, which I will step through here. The reason is to show that the median TMR4A estimate is robust to concerns like raised by AJ.

What about the Whole Genome?

AJ makes the important point that this is not the whole genome. Most obviously, it does not include the Y-Chromosome and Mitochondrial DNA. Less obvious to those outside the field, “whole genome sequences” (WGS) are not actually the entire genome. About 10% of the genome (give or take) is not included in WGS. Why is that?

Why isn’t WGS the whole genome?

Some might be concerned about cherry picking: perhaps scientists are excluding data that does not fit some conclusion. That is not the case, however. The real reason is technical. Its very hard to sequence and assemble about 10% of the genome. Some of this DNA is very tightly bound to protein, and we cannot easily separate it. Some of this DNA is very repetitive, which makes assembling a final sequence very difficult. So the problem is not intentional cherry picking, but a real technical challenge.

Still, its a valid question to ask how much adding this data could spoil our results. Maybe if we saw those missing DNA sequences it would change the results. How much could the results shift?

Adding in 10% Simulated Data

One reason I chose to use a median estimate (rather than mean or mode) is that it is remarkably stable, and also eases error analysis for questions just like this.

Let’s say there is 10% of the genome missing, and we can add that in someday. How much would the estimate be expected to change? Well, if the processes involved on 90% of the genome also apply to this 10%, we would expect to have nearly the same result. That, however, makes a guess about the data we have. Honestly, we know this DNA is different, so it is possible it is changing under different rules

A better way to frame the question is to ask if we got substantially lower TMR4A estimates from the hidden 10% of the genome, what is the maximum the median TMR4A estimate could possibly change? That turns out to be easy to answer. We know that the extended data (green simulated, and purple real), in this case, would just shift a little bit to the left the TMR4A median. We even simulate a case where the 10% extra DNA has a TMR4A of 1 generation (unrealistically low)…

image

This means, at most, adding in 10% more of the genome (green) would decrease the median TMR4A estimate by 20 kya, or about 2 kya per 1% of the genome we add. Essentially, this is a very very robust estimate that we do not expect to change much be adding in data like this.

What about the Y chromosome and Mitochondrial DNA?

These types of DNA, are different for 3 reasons.

  1. They have no recombination, and for this reason do not require ArgWeaver to estimate the tree.
  2. Because they have no recombination, are single locus measurements that only produce a single estimate, instead of the millions of trees that ArgWeaver gives us. Adding in two more estimates to this distribution has zero impact on the median TMR4A.
  3. A Couple will only have one copy of each these pieces of DNA. So we do not want to look at TMR4A, but at TMRCA here. They provide an independent test, and end up being well over 100 kya. However, as single locus, the error on these estimates will aways be much higher than that which we get from genome wide TMR4A.

For these reasons, its best to use Y-chromosomes and mitochondrial DNA as an independent check for sanity, but not to refine these estimates. Likewise, these results on TMR4A, because they represent 12.5 million locations in the genome, are much higher confidence than the Y-chromosomes and mitochondrial DNA estimates.

Adjustment for Functional DNA

There are many other adjustments we can discuss. One example that will be interesting to AJ will be the effect of correcting for the percentage of the genome that is “functional”.

If the precise sequence of a part of the genome is important (because it is important to gene regulation or because it encodes a gene), then we will not see as much variation at those points. For this reason, we expect the TMRCAs at these regions to be much lower. One way to correct for this is by estimating the percentage of the genome with low TMRCA we should exclude from the analysis.

For example, we could say about 10% of the genome has enough function that it should be excluded from the analysis. Using our shortcut formula above, that would increase our TMR4A estimate from 495 kya to 515 kya. Not too much of a change, which is good news. I did not make this adjustment, though, because it’s not fair to only make adjustments in one direction. The fact is we can make adjustments in both directions, and they are going to (for the most part) just cancel out.

Of course those who think that more than 10% of the genome is dense with function with critical sequences, we might rule out even more of the genome than 10%. That could end up increasing the TMR4A estimates more.

An Easy Estimate to Adjust

For this reason, our estimate of 495 kya TMR4A is easy to refine. We can use this approximate formula:

495 kya + 2 kya per 1% genome adjustment 

This gives us the maximum the model can change with new data. In fact, the change may be much much less. As we can see, this estimate is very stable to excluding large amounts of the genome, or adding more data in. We demonstrated this by computing the maximum change in median TMR4A from adding in 10% more of the genome, and applied it to compute how much functional parts of the genome could be biasing results downward.

The good news is that none of these effects are large. Tweaks here are not going to change our estimate much.