There was ongoing confusion about how ArgWeaver was using Ne (effective population size) in doing its computation. Was this an assumption that reduced our confidence in the results, or something else?
The information in this post eventually cleared up the confusion for Dr. Buggs.
You are right, that is the key point. I’m glad we are getting chance to explain it.
That is exactly right. That is what they are doing, with a few bells and whistles. Essentially, this is exactly what MrBayes does (http://mrbayes.sourceforge.net/), except that unlike MrBayes, ArgWeaver can handle recombination. Technically, it is constructing ARGs (ancestral recombination graphs), not phylogenetic trees. ARGs (of the sort argweaver computes) can be represented as sequential trees along the genome. That’s convenient representation that is easier for most of us to wrap our heads around, but the actual entity it is constructing is that ARG.
Except, as you are coming to see, this is not a coalescence simulation at all.
To clarify for observers, there are three types of activities/programs relevant here.
-
Phylogenetic tree inference. Starting DNA sequences → find the best fitting phylogenetic tree (or ARGs when using recombination) → assign mutations to legs of tree (or ARG) → use #mutations to determine length of legs. (see for example MrBayes)
-
Coalescence simulation. Starting from a known population history → simulated phylogenetic trees (or ARGs when using recombination) → simulated DNA sequences. (see for example ms, msms, and msprime)
-
Demographic history inference. Many methods are used, but one common way (see mcms) is…starting from DNA sequences → Infer phylogenetic trees / args (task #1) → compute the coalescent rate at time windows in the past → Ne is the reciprocal of the coalescent rate. (see for example psmc and msmc).
It seems that there was some confusion about what ArgWeaver was doing. Some people thought it was doing #2 or #3, but it is actually just doing #1. The confusion arose because it used a fixed Ne as parameter, which seemed only to make sense if it was doing #2, and might make its results suspect if it was doing #3. However, ArgWeaver was never designed to do #2 or #3. Instead, it is doing #1.
So what is the Ne for?
One of the features of ArgWeaver is that it uses a prior, which is good statistical practice. They were using Ne to tune the shape of the prior, but ultimately this does not have a large effect on the trees. It’s only important, in the end, when there is low amounts of data. They prior they used pushed the TMR4A downwards from what the data showed too.
How This All Gets Confusing…
In defense of the confused, the confusing reality of population genetics is that the same quantities can be expressed in several different units. Often they are all used interchangeably without clear explanation, and its really up to the listener to sort out by context what is going on.
This is all possible because of that key equation:
D = R * T
However, it means have flexibility in the units we choose to measure the lengths legs a phylogenetic tree. To help explain, let’s go back to a figure from much earlier in the conversation:
.
In this figure, the dots are mutations assigned to legs in tree, the scale bar is in units of time (years in this case), and the leaves of the tree are observed DNA sequences obtained from actual humans. I’ve seen several units of tree length pop in this conversation and the literature…
- Number of mutations (dots in figure, or D in my formula)
- Years (scale bar in figure)
- Generations (argweaver)
- Coalescence units (number mutations / sequence length, or D in my formula)
A critical point it that in ArgWeaver the mutations are observed in the data, and the number along each leg is used to estimate the time using the formula. This is just unit conversions, provided we clarify mutation rates, the length of the sequence, and (sometimes) generation time. So these units are interconvertible if we settle one the mutation rate. If we express them as coalescence units or number of mutations, then they do not even require specifying a mutation rate, as this is a fundamental property of the data itself.
Though, as we have discussed, we have reasonable estimates of mutation rates determined by direct measurement. For example, ArgWeaver uses a generation time of 25 years / generation, and a mutation rate of 1.25e-8 / bp / generation. This is equivalent to using a mutation rate of 0.5e-9 / bp / year. Knowing these values, we can easily convert back and forth between their estimates and the number of mutations observed in the data.
Maximum Likelihood Estimation (MLE) of Lengths
One of the easiest ways to estimate a leg length is with a MLE estimate. Let’s imagine we observer 10 mutations in a 10,000 bp block (or 1e4). For illustration, we can convert this to all the units we’ve mentioned, using the ArgWeaver values.
- 1e-3 coalescent units (or 10 mutations / 1e4 bp).
- 2,000,000 years (1e-3 coalescent units / 0.5e-9 mutation rate per year)
- 800,000 generations (1e-3 coalescent units / 1.25e-8 mutation rate per generation)
In actual trees, it is a little more complex, because branch points have multiple legs. In this case, we are going to average lengths computed across the data in each leg; we are building an ultrametric tree (distance from tip to each leaf is the same). In this application, the ultrametric constraint makes a lot of sense because we all agree these alleles are related, and this gives a way to pool data together to get a higher confidence estimate that is not sensitive to population specific variation in mutation rates.
Nonetheless, these units are so trivially interchangeable, that they are not consistently used. While coalescence units is the most germaine to the data, it is also the most arcane. It is common for programs to use different units to display results more understandably. Argweaver and msprime, for example, use “generations.”
Maximum A Posteriori (MAP) Length
MLE is great when we have lots of data, but it is very unstable when there is only small amounts of data. For example…
-
What if the number of bp we are looking at is really small, let’s say exactly zero. In this case, what is the mutation rate? 0 mutations / 0 bp is undefined mathematically, and creates problems when taking recombination into account, some trees can end up having 0 bp spans in high recombination areas.
-
How about if the number of bp is just 100, and the observed mutations is zero. What is the mutation rate then? From the data we would say zero, but that’s not true. We know it is low, but its not zero.
So how do we deal with these problems? One way to solve this problem is to add a weak prior to the mutation rate computation. There is a whole lot of math involved in analytically deriving this in a formal way (using a beta prior), but I’ll show you a mathematically equivalent solution that uses something called pseudocounts.
With pseudocounts we preload the estimate with some fake data, pseudo data. If the mutation rate is 0.5e-9 / year and we think this leg should be about 10,000 years long, we can use this to make our fake data. In this case, we will say the fake data is a 100 bp stretch, where we observed 0.0005 mutations (100 * 10000 * 0.5e-9). This is fake data so we can make fractional observations like this. We choose 100 bp to make this easily overwhelmed by the actual data.
Now, we estimate the mutation rate by looking at the data + pseudo data, instead of the data alone. If, for example, we are looking at no data. We would end up with a length of 10,000 years instead of the nasty undefined 0/0 we get in the MLE. Likewise, if we look at a real tree over a 2,000 bp region where 3 mutations are observed.
- We can make a MLE estimate of the length in coalescent units, at 0.0015 (or 3 / 2000), which is equivalent to 3 million years.
- We can also make MAP estimate of its length (using our pseudo counts), at 0.001428 (or 3.0005 / 2100, which is equivalent to 2.8 million years)
There are a few observations to make about this example.
-
These numbers can be converted into other units as discussed above. However, these are all based directly on observed data, the number of mutations in a region.
-
The MLE estimate and MAP estimate are pretty close. The more data there is, the closer they will be. They are asymptotically equivalent.
-
Even though our prior was 10,000 years, it’s totally overwhelmed by the data in this case, to give an estimate of millions of years. We still do see a small imprint on the MAP estimate, pulling it downwards compared to MLE. By comparing the prior and the MAP estimate, we can know that the MLE estimate is higher than the MAP estimate in this case.
-
Only a few mutations is enough to increase the estimate of the length, which is why individual estimates have very high error (they will both be above and below the true value). We really need to see estimates from across the whole genome. Nonetheless, this example is not quite typical (just for illustration) and had 3 mutations in a tiny stretch of 2000 bp. That is a really high amount of mutations.
-
In the end, we want to choose a prior that will have little impact on the final results, but will help us in some of corner cases where things blow up in the MLE estimate. That is why were use a weak prior (low pseudocounts).
This is just an illustration, designed to be easy to understand without requiring statistical training. It is not precisely how argweaver works, for example, but is a very close theoretical analogy.
ArgWeaver Works Like MAP
ArgWeaver works very close to a MAP estimate. Our median TMR4A estimate is very much like a MAP estimate of TMR4A. What are the differences, however, with how ArgWeaver works from MAP…
-
The MAP and MLE estimates are designed to find the mean of the distribution, but we are using the median instead. The median, however, is more stable than the mode, and there are also some theoretical reasons for choosing this here too (it’s an unbiased estimator in this case). Also, as we will see, using the median improves our estimate of error tolerance.
-
ArgWeaver is not making a single MLE or MAP estimate (as described above). Instead, it is sampling ARGs based on fit to the data (likelihood) and the prior. This called Markov Chain Monte Carlo (MCMC) and is closely related to a MAP estimate when a prior is used in sampling (as it is here).
-
ArgWeaver prior is not implemented using pseudocounts, instead they are using an explicit prior distribution. Using a prior distribution (rather than psuedocounts) is the preferred way of doing this, as it is less ad hoc, more flexible, has clear theoretical justification, and clarifies upfront the starting point of the algorithm.
-
The ArgWeaver prior does not use a fixed time (we used 10,000 years above), but a range of times. This is how the Ne comes in. They use the distribution of times expected from a fixed population of 11,534. I have no idea why the chose such a specific number, and note that I misreported this as 10,000 years earlier.
-
The ArgWeaver prior is on the time of coalescence, not the length of a leg in the tree. This is subtle distinction, but the TMR4A is actually the weighted sum of several legs in the tree. The prior ArgWeaver uses says that we expect (not having looked at data) for that TMR4A time (which is a sum of leg lengths in the tree) to be at about 100 kya. As implemented, it’s a weak prior, and is overwhelmed by the data. Ultimately, the tree lengths computed by ArgWeaver are not strongly influenced by the prior.
-
Though I have explained this as actions on trees, ArgWeaver is applying this to branch lengths on the ARGs (the ancestral recombination graph). This is important because ARGs end up using more information (larger lengths of sequences) to estimate the length than naively trying to estimate phylogenetic branch lengths independently for each tree. The trees we have been using are an alternative representation of an ARG that is less efficient, but easier to use for many purposes (like estimating TMR4A).
In the end, to ease interpretation, ArgWeaver reports results in “generations” but its converting using the equations I’ve already given. So we can easily convert back and forth into any of these units. Most importantly, at its core, we are just using the fundamental formula:
D = R * T
Mutational distance is the product of mutational rate and time. That’s all that is here. That is what enables the conversions.
The fact that ArgWeaver makes the surprising decision to use Ne to parameterize its weak prior is just a non issue. As I have explained, the prior it uses for TMR4A is lower than TMR4A, so it’s just pulling the estimate down any ways. Getting rid of it will only increase the estimate (only a small amount). MAP estimates, also, are considered vastly superior to MLE estimates, so it just makes no sense to doubt this result for using a better statistical technique.
A Prior Is Not an Assumption
It should be clear now why it is just incorrect (despite that footnote in the paper) to call a prior an assumption. It is also incorrect to say that argweaver is “simulating” a large population. All it is doing is using a weak prior on the tree lengths, and that is a good thing for it to do that makes the results more stable.
The language of prior and posterior is chosen intentionally. The terms are defined in relation to taking the data into account. In Bayesian analysis, the prior is updated by the data into the posterior. Then, the posterior becomes the new prior. We can then look at new data, to update it again. So priors, by definition, are not assumptions. They are starting beliefs that are updated and improved as we look at more data. It is just an error to call them assumptions.
With that in mind, we can visualize how the prior (black) for TMR4A is updated by the data into a posterior (red). Notice how the prior is lower than the posterior. That is how we know that using an MLE estimate would be higher than this MAP estimate, likely by just a small amount. Also notice that the Ne used here matches that used in ArgWeaver, which brings the median prior TMR4A to about 120 kya (not 100 kya as stated before).