Window Size For Effective Population Size Estimates?

I had a focused question for population geneticists. It is fairly technical, but @Joe_Felsenstein might have the answer tucked away somewhere.

The N_e (effective population size) is the number of individuals in an idealized (large/panmictic/diploid/constant population size) population that would reproduce the statistic we are looking at.

In reference to coalescence based methods of approximating N_e at different points in the past (e.g. SMC and PMC), consider a population that is idealized in all respects, but the population size is changing from generation to generation.

To fix our understanding here, consider this case. For 1000 generations is it 10,000 individuals, and then for 200 generations it is 5,000 individuals. Assuming an ideal coalescence-based method, working from 100 present-day samples, what do we theoretically expect the N_e over time to be estimated as?

As I understand it, N_e at generation g in the past will be a harmonic mean (1 / {\text{mean} [1/N] }) of the population size across a window of times. It seems that the width of that window which should increase as we go back in time, and should be related to the number of coalescences near time g.

It seems the window size should be a function of un-coalesced linages (from the samples) at time g, and it seems like there should be a theoretical result. Of course, thinking of this as a “window width” is somewhat of a simplification, and it might be better understood as kernel function centered on time g that defines a weighted harmonic mean as the theoretical value of N_e in a varying N population.

So then, is there an analytic formula published anywhere that tells us the width of that sliding window (or the kernel function) at g?

1 Like

OK, I see there is an issue here. At time g if the effective population size is N, then the rate of coalescence in that generation is calculated from that N and not from a window of generations around it.

2 Likes

Josh, the rate of coalescence in a single random-mating population whose effective population size is N is k(k-1)/(4N) when there are k populations. You are right to use Sewall Wright’s harmonic mean calculation. What it calculates is the population size that would result in the same expected amount of genetic drift – and the same expected amount of coalescence – if it were the size throughout that period. I am not sure I understand what you want from this “window”.

1 Like

@Joe_Felsenstein thanks for the help.

The issue is that coalescence estimates don’t give point estimates in the past.

That is true, but not what I am talking about. I’m trying to figure out what the theoretical ideal for the N_e estimate, which is not going to be giving information about a single generation, but about several (but how many?).

Let me give you some data. This is the results of a coalescence on a set of individual samples. The red line shows a bottleneck with a short and brief bottleneck (we’ve considered all sorts of variations of this, and the details here don’t really matter to my point). The black line is the average estimated N_e over several simulation runs.

image

You can see the first region (out till 10,000 years) wildly over-estimates the N_e, and this is a well known limitation of most coalescence based methods.

Then you do see a dip in N_e, but it is offset forward in time from the bottleneck, an it is not as deep as it should be. It is off by about 3 orders of magnitude.

So how do we interpret this data?

It seems that it is reasonable to think about the N_e estimate as a harmonic average over time window on this graph. We would not really expect it to ever be as low as the actual population dip, but how low should it go? We cannot compute an expectation here without knowing the window size of the average (or a kernel function).

Another way to think about it is that we are trying pin down the measurement “resolution” of demography inference.

That is what I am after here. I hope it would help us untangle how much of the discrepancy is due to the intrinsic limits of coalescence methods, and how much is due to other facts particular to this method.

So how do you think about that discrepancy? How is it that the N_e estimates are so far away from the true population sizes in this case?

1 Like

Additional question: were the multiple simulations all with the same coalescent tree, or were both the coalescent tree and the sequences resimulated each time?

OK, on rereading I see that you used widely-used SMC or PMC programs, I see that in the figure you used a very short bottleneck down to 2 individuals. I am not surprised that this stresses the programs. I assume they infer a model with discrete periods each of constant N. But I don’t know whether those particular programs have some built-in “smoothing” that causes them to prefer to have population size changes be modest at the end of each time segment.

Josh, where did you get the “inferred population sizes” from that you obtained from your simulated data?

On more careful inspection, that they infer log population sizes changing linearly as a function of log time, with these segments constrained to connect their ends into a continuous curve. (BTW, Josh approved my comments in reverse order so that adds to the oddness of my response).

Well I wasn’t the one who approved them. Perhaps go back and collect them into a single post, and which point I’ll delete this post!

Each of these methods give an estimate that is discretized, both in the population size estimate size, and the the time points. You can see that in the individual estimate lines. They tend to “click” to the same population size estimates, and they tend to “click” to particular time points.

Other methods (e.g. ABC), to reduce the simulation space, usually enforce explicitly that population size changes must be modest between time points. That isn’t how coalescence based methods work though, but it is an underlying parsimony assumption used in methods development in the field.

Are you aware of any method that doesn’t use discrete periods? I’ve looked at all I can find. None of them give point estimates. All of them use discretization to reduce computational complexity.

Even if they were not to use discretization, that would not get rid of the underlying uncertainty of the data, which is related to \mu / \rho. That ratio puts constraints on our certainty of any coalescent time estimate.

These were different coalescent trees, across a 2.2 million base pair genome, with \mu / \rho = 2.0. Averaging across 20 runs is somewhat equivalent to an estimate across a 44 million base pair genome. In practice, these results are applied across a whole genome of billions of base pairs, so it does seem that the average run is the most important to look at. The individual runs that occasionally pick up a sharp bottleneck (at the wrong time) look to be just stochastic noise that gets averaged out.

Of course, this does not include any selection (positive or negative), which would have the effect of distorting the estimate even more.

@Joe_Felsenstein, do know of any theoretical results on this? That would be helpful.

That is approximately the case, which is what leads me to think of this, to first approximation, as a kernel-weighted harmonic average.

We can fit the shape of that kernel function empirically from the data itself, but I want to know if there is a theoretical computation published we can compare with. If not, I want to see if we can derive it theoretically, and if that is interesting to the field.

Any one coalescent tree, from one part of the genome, would be inferred as coalescing at a time depending on what bases changed in the tree – it would be expected to be somewhat wrong in detecting the sharp bottleneck you see, because of the stochastic nature of base substitution. I would have thought, though, that the coalescences they inferred would show a cluster around the time of the bottleneck.

1 Like

I thought the same:

Yes, the shift forward in time is hard to explain as anything other than systematic error in the estimation methods. It is a robust error though. We see it across several methods and simulation. Because it is systematic, however, it should be possible to correct. Essentially, knowing the kernel function, we should be to deconvolute estimates to get a corrected estimate.

This may have some important implications in how we interpret, for example, the genetic signal for the population drop after the Toba Erruption, though we have not looked at this yet.

I think you are noting that if there were no recombination, the coalescent genealogy would be a tree and if the genome had enough sites changing, one could infer the coalescence times more precisely. With different sections of the genome showing different coalescent trees one has an “ancestral recombination graph”. Each tree within it is inferred fuzzily as there are not too many sites changing in it. The statistical inference is then not of the actual trees that are pasted together in the ARG, it is of what the population was doing – such as its size at various times.

Question: in cases like the one in your figure, was the bottleneck so far back that only a few lineages were still uncoalesced back then?

1 Like

I think I’ve posted this preprint from my friend and former colleague Elise Lauterbur previously. Perhaps it’s a little bit relevant here?

Abstract

Population genetics employs two major models for conceptualizing genetic relationships among individuals – outcome-driven (coalescent) and process-driven (forward). These models are complementary, but the basic Kingman coalescent and its extensions make fundamental assumptions to allow analytical approximations: a constant effective population size much larger than the sample size. These make the probability of multiple coalescent events per generation negligible. Although these assumptions are often violated in species of conservation concern, conservation genetics often uses coalescent models of effective population sizes and trajectories in endangered species. Despite this, the effect of very small effective population sizes, and their interaction with bottlenecks and sample sizes, on such analyses of genetic diversity remains unexplored. Here, I use simulations to analyze the influence of small effective population size, population decline, and their relationship with sample size, on coalescent-based estimates of genetic diversity. Compared to forward process-based estimates, coalescent models significantly overestimate genetic diversity in oversampled populations with very small effective sizes. When sampled soon after a decline, coalescent models overestimate genetic diversity in small populations regardless of sample size. Such overestimates artificially inflate estimates of both bottleneck and population split times. For conservation applications with small effective population sizes, forward simulations that do not make population size assumptions are computationally tractable and should be considered instead of coalescent-based models. These findings underscore the importance of the theoretical basis of analytical techniques as applied to conservation questions.

1 Like

I think this graph is what you want, but it needs some explanation. This graph shows the simulation (not estimated) coalescent times to the “critical TMRxA”. The critical TMRxA is the time the dataset coalesces down to x alleles (or lineages if you prefer), where x = 2 \cdot N_{\min}. So for a five person bottleneck, it is the TMR10A.

Of course, across the whole genome, because of recombination, there are different cTMRxA. So we have to aggregate some how. This graph shows the weighted mean and the weighted median.

The key thing you should observe that, without the bottleneck (right part of graph), coalescent times are more ancient than with the bottleneck (left part of graph). This pattern holds for all TMRxA (including the non critical ones), but the effect is greatly diminished for x < 2 \cdot N_{\min} and for all TMRxA when the bottleneck time is greater than the expected cTMRxA without a bottleneck time.


NOTE: that there is a known bug in this particular graph in computing the mean/median/TMRxA, so ignore the strange kinks, and some other features. The exact details are really beside the point here though. The corrected graph looks somewhat like this, without some of those bizarre features. I just don’t have the corrected one on hand.

Really helpful thanks. She compares coalescent vs forward simulation, and shows that they diverge dramatically. That’s a known issue, and she illustrates it nicely. I won’t get into our simulation method here, except to say that it avoids these problems.

1 Like

Have you validated the simulation method?

Yes, we have validated them and are continuing to do so. But as you know, there are many ways to validate. What specific sort of validation are you asking about?