Heliocentric Certainty Against a Bottleneck of Two?

There were concerns offered by Richard about errors in inferring recombination.

What About Recombination?

Sometimes ArgWeaver has a hard time identifying recombinations, but these do not have a strong effect on TMRCA estimates. Case in point is S4:

These figures show that when u/p (the ratio of mutation to recombination rate) is near one, some of the of the recombinations are missed (middle bottom). However, look at the figure (right bottom) where it shows the true ARG branch length. We see that the variance of the estimate increases but there is no systematic error that is shifting estimates upwards from the true value. This is a critical point. Remember we are taking the median of the TMR4A, which only depends on where the distribution is centered, not its spread. That is why we chose median, not mean. This graph is very strong evidence that the recombination inference errors are NOT increasing the TMR4A.

In fact, even when recombinations are underidentified (bottom row, middle column), ARG length is still measured about the same, but with higher variance (bottom row, right column). We chose an estimator that does not depend on variance, however, so this has zero impact on TMR4A.

We see the exact same pattern in S7.

The correlation between the true and estimated TMRCA drops a little, from 0.87 to 0.78, but if there is systematic error, it is very low. We cannot tell precisely this graph, but we can from the prior graph. The median of each distribution is going to be very close to each other. Taken with the prior figure, this is evidence that the under inference of recombination is not a major source of error. In fact, these data points show that recombination inference mistakes do not change the average/median TMRCA estimates. Also, TMRCA has much higher variance than TMR4A, and TMR4A will be even less susceptible to these types of errors.

This hypothesis seems false. We can see from figure S4 and S7 that about 50% are less parsimonious than the correct tree (higher TMRCA) and about 50% are more parsimonious (lower TMRCA). Remember, we do not expect the trees to be precisely correct. There are just estimates, and we hope (with good reason) that the errors one way are largely cancelled by the errors the other way when we aggregate lots of estimates.

Finally, we are aggregating a lot of estimates together to compute the TMR4A across the whole genome. This is important, because by aggregating across 12.5 million trees, we reduce the error. While our estimate in a specific part of the genome might have high error, that error cancels out when we measure across the 12.5 million trees. This is a critical point.

The statistics here substantially increases our confidence in these numbers.

Just about any source of error we can identify will push some of the TMRCA estimates up and some of them down. However, because we are looking at the median of all these estimates, this increase in variance does not affect the accuracy much. A great example of this is mutation rates.

A Closer Look

I should also add that the referenced supplementary figures (S4 and S7) appear to be using only 20 sequences. Accuracy improves dramatically as more sequences are added. For the data we used, there were 108 sequences, so we expect better accuracy than the figures shown.

Also, S6 is an important figure, that shows the inferred vs true recombination rates for simulation using the known distribution of recombination rates across a stretch of the genome.

A few things to note about this.

  1. For most of the genome, recombination rate is low (corresponding to high u/p), but only jumps up at recombination hotspots (the places where u/p is low).

  2. The program is good at inferring recombination in most of the genome, but sometimes misses recombinations that make no difference to TMRCA in recombination hotspots.

  3. The model picks up some of the recombinations, but not all of them. Most of the recombination inference errors are there, in the recombination hotspots, which are confined to a very small proportion of the genome.

  4. At recombination hotspots, the trees will span shorter amounts of the genome than the rest of the genome. Trees with low bp span are signature for high recombination rate.

  5. That means that most of the genome has a high u/p and is being estimated accurately, but there is only difficulty at recombination hotspots where u/p is low.

  6. By weighting trees by the number of base pairs they cover, we can dramatically reduce any error that might be introduced by recombination inferences. That’s because recombination hotspots are were the vast majority of the errors are, and these hotpots are just a few percent of the genome.

And that is exactly what I did. Rather than reducing the TMR4A estimate, downweighting the error prone recombination hotspots (by weighting by bp span of trees) increases the TMR4A estimate (from about 420 kya ot 500 kya).

I’m going back over all this to point out I was already thinking about the effect of recombination and correcting for it in a plausible way. There are always sources of error in any measurement. This is no exception. The fact there is error however, does not mean the error is large. Clearly, we are only computing an estimate, but this is a good estimate of TMR4A.

Of note, correcting for recombination errors by downweighting recombination hotspots increases the TMR4A estimate (from about 420 kya about 500 kya). It does not decrease it. Thats because for trees spanning only short segments of the genome, they will be more influenced by the prior. That’s because in short segments of the genome, there is not enough data/evidence to overwhelm the prior, so it takes over. On longer genome segments, the data is strong enough to disagree with the prior. As we have seen, the prior pull the TMR4A estimates downwards on real data. So in the end, reducing the effect of recombination hotspots just increases the TMR4A estimate. This is appropriate, because we want the TMR4A least dependent on the prior.

This may seem surprising, and in conflict with the the S4 and S7 data. It is not. In the S4 and S7 experiments, the prior matched the simulation, and did not pull the results up or down. In the real data, the prior pull the TMR4A estimates down, and pulls them down most in recombination hotspots because their bp spans are smallest. So this counterintuitive effect makes sense as an interaction with the prior and recombination hotspots. This error is important to understand, because unlike most types of errors:

  1. it is biased in one direction (towards artificially lowering TMR4A)

  2. its impact is large (about 70 kya, or about 15% relative effect)

Note, also, that I identified this source of error and corrected for it several weeks ago. Even in my first estimate, I disclosed it was going to be an issue.

Before I looked at the prior, however, I guessed wrong on the direction of the effect. I cannot identify any other sources of error likely to have this large an effect. Also, this adjustment was within my +/-20 confidence interval, which shows even my original estimate was not overstated.

Moreover, I have at this point corrected for it. A better correction might take this further, by just excluding the trees with small bp lengths, thereby excluding all regions where recombination rate is high. This refinement, will certainly increase the TMR4A estimate. I’m more inclined to improve this estimate with a different program first. That would likely have more value in the long run.