Now, in answer to Bill Cole’s recurrent bringing up of “statistics” as if that somehow, magically and for meaningless reasons, makes us able to extrapolate into very dissimilar and untested areas of sequence space, here’s a paper where they tried to do that exact thing.
They compared statistical(as in extrapolations from various sampling methods) vs a quasi-empirical(a simulation of empirical tests) approaches to try and assess the functionality of sequence space:
du Plessis L, Leventhal GE, Bonhoeffer S. How Good Are Statistical Models at Approximating Complex Fitness Landscapes? Mol Biol Evol. 2016 Sep;33(9):2454-68.
DOI: 10.1093/molbev/msw097
Abstract
Fitness landscapes determine the course of adaptation by constraining and shaping evolutionary trajectories. Knowledge of the structure of a fitness landscape can thus predict evolutionary outcomes. Empirical fitness landscapes, however, have so far only offered limited insight into real-world questions, as the high dimensionality of sequence spaces makes it impossible to exhaustively measure the fitness of all variants of biologically meaningful sequences. We must therefore revert to statistical descriptions of fitness landscapes that are based on a sparse sample of fitness measurements. It remains unclear, however, how much data are required for such statistical descriptions to be useful. Here, we assess the ability of regression models accounting for single and pairwise mutations to correctly approximate a complex quasi-empirical fitness landscape. We compare approximations based on various sampling regimes of an RNA landscape and find that the sampling regime strongly influences the quality of the regression. On the one hand it is generally impossible to generate sufficient samples to achieve a good approximation of the complete fitness landscape, and on the other hand systematic sampling schemes can only provide a good description of the immediate neighborhood of a sequence of interest. Nevertheless, we obtain a remarkably good and unbiased fit to the local landscape when using sequences from a population that has evolved under strong selection. Thus, current statistical methods can provide a good approximation to the landscape of naturally evolving populations.
The abstract gives it away already(my bolds). But let’s see some of what they write anyway. The introduction is good, so please go and read that(read the whole thing actually), I won’t copy-paste it all here as the paper is freely available.
In the methods they describe the properties of the model they use to generate their quasi-empirical fitness landscape. Long story short, they simulate the function of a noncoding RNA molecule that folds, and who’s function is dependent on it’s simulated 3dimensional structure, which in turn is dependant on it’s sequence:
We use this idea to compute quasi-empirical RNA fitness landscapes, where the fitness of a sequence is based on the similarity of its secondary structure to an ideal target structure, which is assumed to fulfill a hypothetical function that is highly dependent on its structural conformation.
We use the minimum free energy (MFE) structure of a real, functional RNA sequence as the target structure. The human U3 snoRNA (Marz and Stadler 2009; Marz et al. 2011), downloaded from Rfam (Griffiths-Jones et al. 2005), is used as a focal genotype to generate the fitness landscape used in the following sections. This is a noncoding box C/D RNA of 217 nt, making it long enough to form nontrivial structures and for its fitness landscape to be both complex and biologically relevant. We use a real sequence to ensure that the fitness landscape is generated around a sequence with a biological function. The fitness of a candidate sequence is the average selective value of all the structures in the suboptimal ensemble of a sequence (containing all structures with free energies within a bounded distance from the minimal free energy structure). The fitness function is detailed in the Materials and Methods section and is similar to that used in Cowperthwaite and Meyers (2007) and Cowperthwaite et al. (2005, 2006). The resulting fitness landscape is continuous, exhibits a high degree of semineutrality, and places sequences under a strong selective pressure to have similar structures as the target structure while maintaining high stabilities.
Okay, so they generate the fitness landscape based on the physical basis of function for a real functional RNA molecule.
Then they go on to do different kinds of sampling techniques, and then “train” their statistical models on these different sampling techniques to see how good they are at predicting the true fitness landscape of sequence space from various limited sampling techniques. I’m going to skip over most of that and jump straight to the conclusions that are relevant to arguments I have been making here:
We first verified our prediction that it is generally impossible for either model to explain any of the variation when trained on randomly sampled sequences (fig. 3A, Random). Next, we investigated the local neighborhood around a sequence of interest and found that within the restricted sequence space it is possible to sample densely enough to allow a quadratic model (and even a linear model) to reconstruct a fairly good approximation of the local landscape. The fit is much better for Complete Subset than for Random Neighborhood, although no sequence in either data set is more than eight mutations from the focal genotype.
The models do okay at predicting the fitness of sequences LOCALLY around sampled sequences, but fail miserably once the sequences start to diverge beyond 10% from the sampled ones. Regardless of what technique was used to sample the space.
When within a LOCAL area, different sampling techniques do matter with respect to predicting the function of highly similar sequences. It turns out that if you sample by evolving sequences, a statistical model trained on evolved sequences is among the best, because evolution retains information about what residues correlate to higher fitness.
But once you move away from local sequences, predictive power drops off into complete failure.
The predictive power is strongly correlated to the landscape size and decreases rapidly as the number of allowed mutations are increased. Similarly, we see that as the predictive power decreases the correlation between the true fitness and residual size increases. For randomly sampled sequences it is only possible to attain a high predictive power if the median Hamming distance between sequences in the training set is less than 20 mutations, equal to roughly 90% sequence conservation in the data set. Note that the median Hamming distance between sequences is only slightly smaller than the maximum Hamming distance between sequences. This is because most of the randomly sampled sequences contain the maximum number of allowed mutations, since landscape size grows exponentially with the number of allowed mutations. We further see that at median Hamming distances greater than 80 no prediction is possible. Thus, only on the smallest landscapes used here is it possible for a random sampling of 60,000 sequences to attain a dense enough sampling to produce a good fit.
So basically, when they get down into 70% sequence similarity, the models collapse into complete failure at predicting the functionality of unsampled sequences.
Discussion
We assess the usefulness of regression models accounting for main effects and pairwise interactions between loci to approximate complex fitness landscapes, here represented by quasi-empirical RNA fitness landscapes. Our results show that achieving a high enough sampling density is crucial in order to obtain a good description of a fitness landscape. The curse of dimensionality ensures that it is not possible to sample densely enough to allow a simple model to accurately predict the fitness of any sequence in realistic fitness landscapes. However, while it is impossible to provide a good approximation of a complete fitness landscape in all but the simplest cases it is still possible to obtain an accurate representation of local regions of the fitness landscape by restricting the space sampled for the training set. But, since no model can predict the fitness of sequences that fall too far outside of the variation in its training set, the composition of the training set is extremely important. Care should be taken to select sequences that restrict the sequence space to those sequences, we are interested in and also to select sequences that elucidate epistatic interactions between loci.