Dr Michael G Strauss: Adam and Eve: Genetic Evidence

It’s certainly true that genetics is modeling much more complex systems than is typical for physics, and genetics models are correspondingly less precise. Even so, your statement seems a little sweeping. QCD was considered a viable and valuable model for decades before it could be used to calculate something as basic as the mass of the proton to within an order of magnitude, wasn’t it?

You seem to have confused two different sigmas. The sigma you’re talking about is the sample standard deviation, while the relevant sigma is the standard error on the mean, which is sigma/sqrt(N), where N seems to be about 10 million. In HEP terms, it’s analogous to reconstructing 10 million decays of a broad resonance and asking how confident you are that you haven’t gotten the rest mass wrong by a factor of five. In reality, this case is a little different, since we really want to know the maximum true age of four non-coalesced lineages, since that sets a limit on when a two-person bottleneck could have occurred. As I recall, @swamidass used the median of the estimated ages, which seems a conservative choice.

1 Like

Note that the model of heterozygosity in the Mouflon sheep study assumed neutral evolution, which may well be wrong in this case. By contrast, the results that @swamidass describes make no assumption about neutrality, and in fact the researchers report finding loci with both purifying and directional selection. The primary assumption in those results is that the mutation rate has been more or less constant, something that we have multiple reasons for thinking must be true.

1 Like

I’m not confusing sigmas. You cannot look at just the sigma on the mean. That is not the relevant thing here. You have to look at the sigma of the width. That gives the range of reasonable values. Again, if I’m interpreting things correctly the statistical analysis used would never hold up for a physics masters degree paper. I’m not trying to be demeaning or antagonistic, just observing the analysis.

You are confusing perturbative QCD with non-perturbative QCD. Perturbative QCD has made mathematical predictions from the beginning. Non-perturbative QCD is a low energy approximation and we know it is only an approximation that actually follows my previous statement precisely. It has been continually refined with complexity added because the early versions were not very accurate.


Sorry, but as a physicist this paragraph doesn’t make sense. Does the Mouflon sheep support or differ from the models used to make predictions about genetic diversity and the size of the human bottleneck? I’ve heard that the Mouflon sheep data differs from the predictions made by those models.

1 Like

But you’re not interpreting things correctly. The values in @swamidass’s histogram are not independent estimates of the same quantity. They are estimates of the time to 4 lineages for different segments of the genome, each of which has its own history and its own time to that coalescence point. The distribution of those values has a very large variance, so there is indeed a large intrinsic width to this distribution. What we want to know – what sets the limit on the time to a 2-person bottleneck – is the true upper edge of that distribution, which has no obvious connection to what you’re calculating. Even if there were no intrinsic width, though, and these were all independent, noisy measurements of a single age of the genome, your use of sigma here would still be incorrect: the uncertainty on the mean would tell you the uncertainty on the time to the bottleneck, whereas the standard deviation on the pictured distribution would represent the uncertainty in any given measurement.

Don’t assume, by the way, that physicists are necessarily more competent at statistics than biologists. I’ve been both a high energy physicist and a geneticist, and I can assure you that good population geneticists know at least as much about statistics as experimental physicists, and some of them quite a bit more.

I didn’t confuse perturbative and non-perturbative QCD – I just said QCD, i.e. the full model. Perturbative QCD is an approximation of QCD that works well at high energies, while other approximations have to be used for the non-perturbative regime, including for calculating hadron masses. The latter could not be calculated until quite recently, using any approximation. The earliest reasonable calculation that I can find of the proton mass was in 2008. None of this affects my (quite unimportant) point, which is that physicists too are willing to accept models that can’t estimate some things accurately if there’s nothing better available. Anyway, this is a tangent and not worth pursuing.

The Mouflon sheep study is orthogonal to the methods that we’re discussing for setting limits on the timing of a tight human bottleneck. The model used in the sheep study, the one that was in conflict with observation, makes much stronger assumptions.


@MStrauss I’m sorry I have not been able to participate in this conversation. I’m currently out of pocket. I want to comment on a couple things.

  1. @glipsnort is a highly competent and thoughtful scientist in this area, who has consistently demonstrated himself honest with the evidence. I suggest engaging most directly with him.

  2. You discuss difficulty understanding the validity of the models involved. I’m happy to explain them to you sometime, so you can make sense of it. This really not that complex, but it often been explained in very poor ways.

  3. You are lookin at Yadam and Meve as a potential bottleneck. The problem, however, is that there is much more information to consider, such as the other 95% of the genome. That changes the conclusion substantially. Until we engaged with 95% of the data, we can’t be certain of much from only looking at Yadam and Meve.

Any how, thanks for joining us here. Peace.


Dr. Strauss,
In the biomedical sciences it is rarely feasible to plan a study with a 5-sigma significance level. Biological data, and in particular human subjects data, introduces too many sources of variability, and distributional assumptions are often violated. We generally don’t have the option of letting a clinical trial run for another 500 patients, or easily expanding the size of a genetic database (and when we do, planned interim analysis is required).

You are correct that 2-sigma is not very convincing, and statisticians are aware of this problem. In these situations we can only try to make sure that conservative statistical assumptions are used, a sciencific rational exists, not over-state the results, and wait for a confirmatory study to come along.

My personal level of credibility is about 3-sigma, or a Bayes Factor>10, with a meaningful effect size to go along with it. :slight_smile:

Edit: Apologies, I neglected to notice I was intruding on an Office Hours topic. I don’t have any other relevant expertise beyond statistics in the biosciences, so I will return to the peanut gallery.

Let me also point out that there are more than four MHC alleles shared with other ape species, which of itself invalidates a 2-person bottleneck at any time.

1 Like

@MStrauss, you’ve had the privilege of interacting with @glipsnort on this for a while, a leading scientist in this area. Do you have any follow up questions for me at this time?

How many times must we go over this @John_Harshman. This conclusion of yours neglects convergent evolution of MHC alleles, and is not valid with out much more analysis.

Just to clarify. The only reason the Mouflon sheep arise is because of an article by Dennis Venema, where he explained PSMC (a population inference method). PSMC estimates population sizes based on the size distribution of homozygous stretches of the genome. Homozygosity goes up with a bottleneck. It’s opposite is heterozygosity.

In response to Venema, the mouflon sheep paper was noted by Fuz Rana (of RTB), and it shows that heterozygosity after a bottleneck was higher than expected by a purely neutral model. On this basis, Rana is skeptical of the PSMC estimates, which show a large human population going back 2 million years ago.

There are several problems with this line of reasoning:

  1. The mouflon paper itself notes that a model with purifying selection accounted for the observations. So it is not as if we don’t understand what caused the increased amount of heterozygocity. This article does not call into question our basic understanding of population genetics.

  2. We can’t be entirely sure there was a bottleneck as low as a single couple. This is, as I understand it, what the recorded history tells us. However, we do not know if additional individuals were introduced into the population or not.

  3. No one actually ran the PSMC algorithm to see if it could or would detect a bottleneck. Maybe it would. No one has actually done the test to find out.

  4. The PSMC algorithm is already well known to fail on bottlenecks more recent than 10,000 years ago. So even if PSMC failed to detect the bottleneck, this doesn’t tell us at all about its ability to detect bottlenecks in the more distant past.

The problem with PSMC is more fundamental. Despite assurances to the contrary, no on had done rigorous studies to establish he confidence of PSMC in detecting brief bottlenecks. It turns out that PSMC has very low power to detect them. For reasons totally different than Fuz raised with the Mouflon sheep, PSMC is not a valid way of disproving a bottleneck.

We designed TMR4A to get around this pit false. It makes much fewer assumptions, and does not rely on heterozygosity. More importantly, it does not estimate the average population size (like PSMC). Instead, it estimates the minimum possible population size. This is an important distinction. With this tooling in place, we find that a bottleneck of a single couple is ruled out more recently than 500 kya. More ancient than this? We don’t have data that tells us one way or another.

Convergence is unfortunately a common response to phylogenetic results one doesn’t like. We should discuss your evidence that these alleles are convergent. While the conclusion of homologous alleles may not be quite certain, would you agree that it’s still a live hypothesis and can’t be ruled out?

1 Like

There is strong evidence of convergence in this case. And also very strong quantitative evidence that it is driving evolution here (e.g. high Ka / Ks ratio). I’m not grasping here for a loophole.

Yes. But you can’t present it as a settle conclusion. We discussed the studies that could settle the question. I’d rather you explained those studies as a way that might eventually disprove a bottleneck at any point. If that is what you did, you’d get no argument from me. You can’t, however, say that transpecies variation as we currently understand it rules out a bottleneck.

I don’t recall the citation for this. Can you remind me?

Fair enough. Can we say that it renders a bottleneck unlikely, including a bottleneck older than 500,000 years?

I summarize the key paper here: Trans-Species Variation or Convergent Evolution?. Though there are many more in the literature.

Of close relevance to this, MCH Ka/Ks ratio can be as high as 10 or 12, which indicates the uncalibrated age of alleles can be as high as 10x to 12x the true age. It also indicates a high degree of selection, creating the conditions for multiple cases of convergent evolution.

No. It is not possible to interpret the MCH this way, not without much more study. It might all be an illusion caused by convergence (from common recent selective pressures). Or it might be extended or confirmed with MHC introns. Or we might good reason to think introns can’t tell us one way or another. We can hypothesize, but we just do not know yet. You are banking on the hypothesis that it will rule out a bottleneck. I have no problem with this. Perhaps you are right. Just do not present your hypothesis as a confirmed hypothesis, when it is no such thing.

So what can we say that appropriately tempers enthusiasm? We can say: "With further study, it is possible that trans-species variation will strongly challenge a single couple bottleneck at any point in human history, including as far back as millions of years ago."

I’d even encourage you to say, “the studies required to test this are difficult, so we can’t do them right now. Creationists that care about a bottleneck, however, we encourage them to do this scientific work to find out. Wee would respect and advice such an effort.”

I also emphasize that I have no dog in this fight. I personally don’t care one way or another how a study like this would go. I’m just insisting that we are honest with the public. If we aren’t, how can @MStrauss and others trust us on conclusions that are difficult for them? That is why this is so important. Do not overstate the evidence. Show intellectual empathy. Explain the evidence, but do not extend the conclusions further.

No, that isn’t what I’m doing. I’m banking on detailed molecular convergence being so rare as to be unlikely in any particular case. Rapid evolution under selection would tend to result in convergence only when there is only one path to adaptation, which a priori sounds unlikely to me. And what do those silent changes have to say?

For MHC, it appears we see convergence multiple times over in human populations. Convergence in MHC does not appear rare.

But is the convergence within human populations of sufficient magnitude to explain the similarities of alleles between humans and other apes? Has anyone analyzed the cross-species data looking for that? And again, what about silent changes?

@John_Harshman this has to be the 3rd (if not 4th) time we’ve circled this same question. These analyses have not been done. They also are confounded. Introns overcome those confounders. An intron study might get to the bottom of this. Until it is done, we just can’t say with certainty what this evidence tells us.

Please try and remember this going forward. We really always circle back to this same conclusion. Nothing has changed here since the last time we circled. I urge you to reread our large exchange about this and find some language that we both agree too, so we can avoid rehashing it over and over: John Harshman: Bottlenecks and Trans-Species Variation.

I’m not sure of that. It should be fairly simple to look at old data to see whether convergence is a plausible explanation, and again to examine especially the silent differences.