In school probability is often introduced with examples of dice, coin tosses, marble urns and the like. The sample sizes are kept small for illustrative purposes, the space of possible outcomes is well within a range one can comfortably count. Some weeks ago we saw one consequence of this, where Bill Cole would only define probability in terms of the counting measure, and completely overlook the possibility of non-uniform distributions, let alone non-finite sample spaces. What I’m spotting here with Gilbert is a similar situation.
“Randomness” is something many who do not work with it associate with a form of completely uncontrollable chaos. A random system to them is a system that can pretty much do what ever is within its reach at any point, and there is no characterizing or predicting it. Bluntly put, this is just straight up false. At sufficiently large sample sizes - which, depending on the concrete problem, may be trivially easy to achieve, like in the case of thousands to billions of a genome’s base pairs - “noise” is a grossly insufficient descriptor of any process, not because the process is something other than random, but because of how easy it is to tell the difference between even seemingly slightly different distributions.
Say you are given two dice, one of which is weighted such that one side is, say, 3% more likely than all the others. You will be hard-pressed to accurately and reliably tell which is which after only a dozen or so throws of each, even if you repeat the experiment another time or two. After some 12k throws of each, however, the likelihood to make an earnest mistake in assessing the difference between the outcome distributions is vanishingly small. You know what a fair dice looks like in the high sample size limit. Will every side have an equal share of all the throws? No. But if on the other dice one side dominates the others by as much as 3%, can you honestly argue that this is likely the fair one, and the other weighted?
Much the same argument applies with genome mutations. Surely, something that wildly changes between a handful of generations is likely not survival-critical. We can tell a non-viable organism by how unambiguously dead it is, after all. And this scales down from survival-critical to merely heavily or even slightly impactful. So, if some part of a genome mutates at a rate much different to the other, then the more conserved portion is likely to contain the more important functions. The most mutable portions, on the other hand, must be the ones that either serve functions that have at most a minimal impact on the organism’s fitness, or they serve no function at all. Crucially, we can perform experiments to measure and distinguish those regions reliably. Lastly, we call evolution that has no impact on an organism’s fitness “neutral”. The prediction is that there is (to a first approximation) a positive correlation between how quickly a genome region mutates and the impact mutations in that region have on the carrier organism’s fitness.
It’s not a case of “implies” anything. Correlations are symmetric. If high fitness impact of a region correlates with that region’s conservation, then conservation likewise correlates with fitness impact, and, therefore, low fitness impact correlates with low conservation. Whether there is a causal link, or any other assymmetric sort of entailment, or which direction it points is not something any amount of data can ever reveal to us. We may conjecture such things in the theory, if that helps us understand the model and its implications, but at the risk of introducing confusions, like about whether the converse follows. The data remains what it is, regardless of how well or poorly our models aid us in predicting it.