ArgWeaver Does Not Assume Large Population
In response to the work done here, there were concerns that ArgWeaver did not allow for bottlenecks of 2, and would be biased against it. RIchard Buggs wrote…
It appears he was drawing upon an observation by Andrew Jones at the DI, who writes:
However, a little digging into how ARGweaver works reveals that it too assumes a constant population, and uses this assumption to assign probabilities to ancestry trees. Therefore, again, it is not clear if it is really appropriate for asking questions about Adam and Eve. The particular reason why it is a problem is a bit technical: coalescence (branching but backwards in time) happens much more slowly in a large population. In a large population, the last few coalescents could take thousands of generations. But what if you have a small number of generations, drawing to a smaller and smaller population and terminating in a single couple? All the lineages will coalesce (down to at most four as explained above) but at a faster rate.
On Prejudiced Models and Human Origins | Evolution News
The program itself does require a population size number, and the authors used 10,000, when it is run. However, this skepticism turns out not to be the correct assessment.
ArgWeaver is using a prior on the trees, that is parameterized by population size (N = 10,000). The language of “assumes a large population size” is just incorrect. It is more accurate to say that is starts with a weak prior belief of a population size of 10,000. It is a weak prior belief, because it is designed to be quickly overcome by data. Here is what it looks like:
Notice that the prior median (black) is at about 100 kya, but the data shows a TMR4A of about 420 kya (as we have shown before). From here, it becomes clear why there is no reason to doubt these results:
-
As a prior, this is not an assumption, but a starting belief that is meant to be overridden by the data. The only way that the ArgWeaver program uses the population size is in computing this prior. Population size is neither simulated nor modeled in the program except for placing this weak prior on population size. Remember, priors are not assumptions or constraints. That is why the measured
-
The ArgWeaver output files tell us the strength of the prior vs. the data, and it is just about 5%. That means the model output is dominated 95% by the data, and not by the prior (as it is designed).
-
The prior distribution for TMR4A is at about 100 kya, but we measured the TMR4A at about 420 kya. That means the data is pulling the estimate upwards from the prior, not downwards.
This last point should end any confusion. To draw analogy, it’s like we measured the weight of widgets, with the weak starting belief that the average weight of these widgets is 200 lb. After weighing several of them, and taking the prior into account, we compute the average weight is 420 lb. The fact we used a prior could be an argument that the real average is greater than 420 lb, but that is not a plausible argument that the true average is less than 420 lb. The prior, in our case is biasing the results downwards, not upwards.
With that in mind Dr. Jones was just mistaken when he writes:
The tool used, ARGweaver, is fantastic in that it combines an enormous amount of real genetic information to model the past genetic history of humans. For this reason it gives the impression of being truly objective, and so when I first read it, I thought he had proved that there could be no bottleneck earlier than 300,000 years…However, a little digging into how ARGweaver works reveals that it too assumes a constant population, and uses this assumption to assign probabilities to ancestry trees.
Given what I have just explained, this is not a reason to doubt the results that I put forward. I do believe this data shows there could be no bottleneck earlier than 300,000 years without either miracles or our ancestors have vastly different mutation rates than us. Both those possibilities, however, are off the table right now.
Technical Details
Getting the prior required spelunking a bit into the ArgWeaver code. The key function is included below, and translated from the original C code here…
def prob_coal_counts( a, b, t, n):
'''The probability of going from 'a' lineages to 'b' lineages in time 't'
with population size 'n'
Original code was written in C and can be found here: https://github.com/mdrasmus/argweaver/blob/fee3d32c10fbf854d39ebdf9650e499fedd485a3/src/argweaver/total_prob.cpp
'''
C = 1.0;
for y in xrange(b):
C *= (b+y)*(a-y)/float(a+y);
s = np.exp(-b*(b-1)*t/2.0/n) * C;
for k in xrange(b+1, a+1, 1):
k1 = float(k - 1); k = float(k)
C *= (b+k1)*(a-k1)/(a+k1)/(b-k);
s += np.exp(-k*k1*t/2.0/n) * (2*k-1) / (k1+b) * C;
for i in xrange(1, b, 1):
s /= i
return s
From here, we can compute the prior distribution of the TMR4A with this function call:
Probability_Denisty = prob_coal_counts(5,4, N_Generations, 10000 * 2)
The multiple of 2 is used because humans are diploid, and the time units used internally by ArgWeaver are generations. Following the paper, we also use a generation time of 25 years and a mutation rate of 1.8e-8 per generation.
Its not entirely clear why the prior comes out this low, at 100 kya. I would have thought it would be closer to 250 kya. There may be a bug in the argweaver code for computing prior, which shifts the prior downward. Even if this is a bug, this code is computing the prior used in the data we computed TMR4A, so it is ultimately what we need to understand how to interpret the impact of the prior on these results.