Optima in Evolution

Rumraket · January 18, 2020, 8:27pm

What does the word “simpler” mean here? Why is it simpler? The original D2 domain is only about 96 amino acids long*, and in this experiment it was replaced with a 139 amino acid long polymer. It has lower fitness, which for the D2 domain in the g3p protein merely implies it has a weaker binding affinity.

*) See:Holliger P, Riechmann L, Williams RL. Crystal structure of the two N-terminal
domains of g3p from filamentous phage fd at 1.9 A: evidence for conformational
lability. J Mol Biol. 1999 May 14;288(4):649-57. DOI: 10.1006/jmbi.1999.2720

Notice, though, that the function of interest could be found with mere substitutions and selection from an arbitrary amino acid sequence that did not function to begin with.

Note that the fitness of the initial defective fd-RP at the 0th generation is plotted at N = 1 as the random sequence RP3-42 in fd-RP was assumed to be “selected” among N = 1 arbitrarily chosen sequences. This assumption was confirmed by the observation that the infectivity of fd-RP was comparable to that of the deletion mutant phage lacking the D2 domain.

They didn’t have to test quintillions upon quintillions of different proteins to find something that aids in infectivity. And there really was a smooth climb on a “fitness surface” from that initial sequence up to an infectivity about 17.000 times higher than the phage with a defective protein. At which point they hit the rugged part of the landscape.

But so what? A protein capable of accelerating the infectivity function was found. The phage can live and infect E coli cells with that random protein.

No, it doesn’t show that. You didn’t pay attention to what the authors actually wrote.

The authors argue that given an arbitrary starting point in sequence space, and evolving purely by substitution+selection, it is not possible to climb continuously up to the global optimum because you’re going to reach some rugged plateau on the way up and get stuck on a local optimum there.

The question remains regarding how large a population is required to reach the fitness of the wild-type phage. The relative fitness of the wild-type phage, or rather the native D2 domain, is almost equivalent to the global peak of the fitness landscape. By extrapolation, we estimated that adaptive walking requires a library size of 10^70 with 35 substitutions to reach comparable fitness. Such a huge search is impractical and implies that evolution of the wild-type phage must have involved not only random substitutions but also other mechanisms, such as homologous recombination. Recombination among neutral or surviving entities may suppress negative mutations and thus escape from mutation-selection-drift balance. Although the importance of recombination or DNA shuffling has been suggested [30], we did not include such mechanisms for the sake of simplicity.

However, if large-scale mutations are allowed(as opposed to mere mutants of the starting sequence) it is possible. Which is why they suggest genetic recombination is how the global optimum was found in the wild-type sequence. Genetic recombination really does happen as I’m sure you know, and bacteriophages literally carry genes encoding proteins for nonhomologous recombination and replication of their own genome.
It is of course not difficult to see that this is a highly adaptive strategy in the evolutionary arms race between bacteria and their bacteriophage parasites.

Perhaps, just perhaps, that is the same reason why our own adaptive immune system also has a recombination component, as it allows large scale “jumps” in, and therefore accelerated exploration of sequence space, that few substitutions in an existing sequence alone might not be able to perform.

Another point worth note here is that they say the wild-type sequence is almost equivalent to the global optimum implied by the NK model when applied to their data. Almost. Though, even if we stipulate it really was the global optimum, it simply doesn’t follow that this could not evolve by mutation and selection.

Where are you getting this from that it will be trapped forever? It’s not found in the paper, and isn’t even implied. You seem to consider evolution occurring exclusively by substitution here. Why?

Rumraket · January 18, 2020, 8:32pm

Okay, cool. But ATP synthases are wildly divergent in sequence, and for some subunits, also in structure. So much so that some of their subunits are having their homologous relationship questioned. For example there are some open questions about whether the central and peripheral stalk components are actually related between V and F-type ATP synthases.

That would imply there are unfathomably large numbers of functionally equivalent and “perfect” ATP synthases out there, that either all occupy their own local optimum, all of which just so happen to also be perfectly efficient, or that there are fewer, perhaps even just one one gigantic plateau of high-fitness ATP synthase sequences.

Giltil · January 18, 2020, 9:26pm

To me, this amounts to magical thinking for there is no evidence whatsoever supporting this idea. But let’s imagine, for the sake of the argument, that it was true. Well, if it was true, there would be many more simpler, non optimal solutions than optimal, perfect ones. As a result, evolution would have had much more chance to find these non optimal solutions (local optima) than the perfect ones. IOW, under the evolutionary scenario, we wouldn’t expect to see highly functional and perfect solutions.
To conclude, I’d like to give the floor to @gpuccio , who has wonderfully summerized the issue debated here:

The ocean of the search space, according to the reasonings of neo-darwinists, should be overflowing with potential naturally selectable functions. This is not true, but let’s assume for a moment, for the sake of discussion, that it is.

But, as we have seen, simpler functions or solutions, when they exist, are much bigger functional islands than the extremely tiny functional islands corresponding to solutions with high functional complexity.

And yet, we have seen that there is absolutely no evidence that simpler solutuion, when they exist, are bridges, or ladder, to highly complex solutions. Indeed, there is good evidence of the contrary.

Given those premises, what would you expect if the neo-darwinian scenario were true? It’s rather simple: an universal proteome overflowing with simple functional solutions.

Instead, what do we observe? It’s rather simple: an universal proteome overflowing with highly functional, probably optimal, solutions.

IOWs, we find in the existing proteome almost exclusively highly complex solutions, and not simple solutions.

The obvious conclusion? The neo-darwinist scenario is false. The highly functional, optimal solutions that we observe can only be the result of intentional and intelligent design.

Rumraket · January 18, 2020, 9:40pm

But you’re the one referencing a paper, saying ATP synthase has a perfect efficiency.

And I’m pointing out there are many different ATP synthases, they are very divergent. It does not appear to be “stuck” anywhere in sequence space. So divergent in fact that some of the subunits might not actually be homologoues between the V and F-types.

Heck, the Thornton Lab reconstructed part of the evolution of increased complexity in the V-type (vacuolar) ATP synthases:
Finnigan GC, Hanson-Smith V, Stevens TH, Thornton JW. Evolution of increased
complexity in a molecular machine. Nature. 2012 Jan 9;481(7381):360-4. DOI: 10.1038/nature10724

Abstract

Many cellular processes are carried out by molecular ‘machines’-assemblies of multiple differentiated proteins that physically interact to execute biological functions. Despite much speculation, strong evidence of the mechanisms by which these assemblies evolved is lacking. Here we use ancestral gene resurrection and manipulative genetic experiments to determine how the complexity of an essential molecular machine–the hexameric transmembrane ring of the eukaryotic V-ATPase proton pump–increased hundreds of millions of years ago. We show that the ring of Fungi, which is composed of three paralogous proteins, evolved from a more ancient two-paralogue complex because of a gene duplication that was followed by loss in each daughter copy of specific interfaces by which it interacts with other ring proteins. These losses were complementary, so both copies became obligate components with restricted spatial roles in the complex. Reintroducing a single historical mutation from each paralogue lineage into the resurrected ancestral proteins is sufficient to recapitulate their asymmetric degeneration and trigger the requirement for the more elaborate three-component ring. Our experiments show that increased complexity in an essential molecular machine evolved because of simple, high-probability evolutionary processes, without the apparent evolution of novel functions. They point to a plausible mechanism for the evolution of complexity in other multi-paralogue protein complexes.

The implication here being the system isn’t really “stuck” in sequence or structure space. It is continuously changing and mutating, genes are duplicated and interfaces degrade.

So the question now becomes: Do you think they’re all perfectly efficient, or only a particular one of them? Was the vacuolar V-ATPase perfectly efficient before, or after it became more complex through constructive neutral evolution?

he ocean of the search space, according to the reasonings of neo-darwinists, should be overflowing with potential naturally selectable functions. This is not true, but let’s assume for a moment, for the sake of discussion, that it is.

But, as we have seen, simpler functions or solutions, when they exist, are much bigger functional islands than the extremely tiny functional islands corresponding to solutions with high functional complexity.

And yet, we have seen that there is absolutely no evidence that simpler solutuion, when they exist, are bridges, or ladder, to highly complex solutions. Indeed, there is good evidence of the contrary.

Given those premises, what would you expect if the neo-darwinian scenario were true? It’s rather simple: an universal proteome overflowing with simple functional solutions.

Instead, what do we observe? It’s rather simple: an universal proteome overflowing with highly functional, probably optimal, solutions.

IOWs, we find in the existing proteome almost exclusively highly complex solutions, and not simple solutions.

The obvious conclusion? The neo-darwinist scenario is false. The highly functional, optimal solutions that we observe can only be the result of intentional and intelligent design.

This is just a list of assertions. Does Gpuccio speak and it is so? Doesn’t the very first thing he say flat out contradict the Hayashi et al paper?

Mercer · January 18, 2020, 10:37pm

Gpuccio hasn’t shown anything to support that claim.

As we can see (and @colewd is afraid to see), The functional island for MYH7, one of the most functionally complex proteins in your body, is anything but tiny. How do you explain this conflict between reality and gpuccio’s wishful thinking?

nwrickert · January 18, 2020, 11:25pm

This all just seems weird.

There are no “perfect” solutions.

Humans are a local optimum. Bees are another local optimum. Frogs are yet another local optimum. And so on. The bio-diversity that we see is evidence of many local optima. Why would that require magical thinking?

Giltil · January 19, 2020, 9:14am

This is not true for proteins, as the example of ATP synthase shows.

How would you define local optimum for species. Local optimum makes sense in the context of sequence spaces, but does it make sense in the context of species?

Regarding magical thinking, I was referring to Rum’s idea that there are a myriad of local optima that just happen to be also optimal solutions.

Giltil · January 19, 2020, 11:44am

Let’s see. Imagine a 1000 aa protein, that, according to gpuccio methodology, would have a fonctionnal complexity of 700 bits. Now imagine that this same protein in human would also display a huge amount of snp polymorphism. You will agree with me that the highest possible level of polymorphism for this protein in human would be equal to twice the number people in the world, ie., about 18.10^9.
Assuming, for the sake of the argument, that gpuccio methodology does not take into account this polymorphism component (which I don’t believe), what would that change to the estimate of the FI? The answer is that the FI would be reduced of only 34 bits, still remaining at the formidable value of 666 bits. So you see, there is no hope for you on the side of polymorphism.

nwrickert · January 19, 2020, 5:06pm

That’s what fitness is all about.

Mercer · January 19, 2020, 5:25pm

Isn’t it odd that you use that metaphor while you are refusing to examine actual evidence?

Let’s not imagine anything. Let’s be scientific and test your imagination against reality. We have MYH7 and ACTA1, whose products even work with each other in the same structure!

What could be a better test?

It doesn’t, regardless of your belief.

I’d like to see your math.

Your subjective judgments are not very scientific. What we do in science is use controls.

Here, actin is the control. The myosin obviously has far more functional complexity and functional information than this actin.

Therefore, if you and gpuccio are correct, it should be much less polymorphic.

This actin is not polymorphic. The myosin is incredibly polymorphic, so much so that you and @colewd are afraid to even look.

Because the converse of the prediction is what is true, gpuccio’s hypothesis is falsified.

Why are you expending so much intellectual energy in the service of fooling others and yourself, Gil? Why do you so falsely describe your hypotheses as facts?

colewd · January 19, 2020, 5:38pm

This is true.

This does not follow. Your ignoring sequence length.

Rumraket · January 19, 2020, 5:58pm

But he’s right. Given the scenario he’s imagining, that would be the result. The problem is that FI can’t be estimated on the basis of the variation you see in living organisms. It has a relatively little effect how much variation you discover, on the FI calculation. Unless you discover an incredibly vast amount of variation. An amount of variation that physically couldn’t exist for proteins above certain size.

If you try to estimate FI on the basis of the number of discovered sequence variants, and you extrapolate from this that any variant discovered is tolerated in the context of another sequence, then it would still take an unfathomably large amount of variation to exist in the population to have a big effect on the FI calculation.

We can illustrate this with another hypothetical example. Suppose there’s some protein 100 amino acids long. The sequence space for that is 20^100 = ~ 1.3×10^130.
Suppose that protein is conserved in some clade at a 100% level. Gpuccio comes along and calculates the FI for that sequence because he thinks it’s the only one that works. He takes -log2 of 1/20^100 = ~432 bits.

Hold on, you say, you have not included all known variation of this protein. Suppose we discover that among all known variants of this protein across the tree of life, each position in the polymer can tolerate 10 other amino acids. An incredible amount of variation. That means there’s now, based on this variation, 10^100 sequences that can do the job of that protein. Even if we stipulate that we discover variants of the protein that are that divergent, how much does that affect the FI calculation? We reduce it by 100 bits(a lot), down to 332 bits of FI. Which is still quite a high number of bits.

We are still left with there being a density of functional proteins of roughly(10^100)/(20^100) = 1 in 8×10^31, if we estimate FI based on conservation. No amount of physically realistic discovered variation can bring a protein that has, say 700 bits, down to below the “500 bit” threshold that Gpuccio has concocted.

The problem with all this, again, is that you can’t actually estimate the fraction of sequence space that contains functional polymers based on mere extant variation, because you can’t extrapolate to unknown parts of sequence space from the discovered part. As I’ve been trying to get across now over and over again.
I’m okay with my explanations falling on deaf ears. That’s okay, I’ll still be here explaining why they’re wrong. There are other people in the world who are capable of seeing reason, I myself have benefited from reading the output of such people before.

Giltil · January 19, 2020, 6:38pm

Here it is.
700 bits is equal to log2(5,26.10^210). It means that for a protein whose FI is equal to 700 bits, the sequence space is 5,26.10^210 time bigger than the target space. Now, if you add 18x10^9 sequences to the target space, the FI is equal to -log2(18x10^9/5,26x10^210), that is, 666 bits.

colewd · January 19, 2020, 6:52pm

First you show good understanding of the issues with this post. We have more common ground than disagreement. The problem with your argument is that it basically says that statistical sampling theory is false. Yet there is lots of empirical evidence it works. Please understand I hear what you are saying and it seems logical but population size is minor issue in sampling theory.

Do you disagree with what I put in bold?

Mercer · January 19, 2020, 6:57pm

I’m glad we can agree on that, Bill.

I think that it is highly unlikely, given that I have purified both of these proteins with my own hands and looked at literally hundreds of SDS-PAGE gels in which the myosin runs at the top and the actin is near the bottom. The difference in size is impossible to ignore.

So let’s do the math together. We know the lengths. We now need to consider the numbers of different sequences known to work:

How many ACTA1 variants have been found in healthy humans?
How many MYH7 variants have been found in healthy humans?

Why haven’t you looked yet? Why are you projecting your ignorance onto me?

Mercer · January 19, 2020, 6:59pm

No, it doesn’t mean that at all. It’s merely the hypothesis that you are desperately trying to avoid testing against the evidence.

Can you really not distinguish between evidence and hypotheses? Between an extremely flawed calculation full of unsupported assumptions, and the thing allegedly being measured?

Please watch the Spinal Tap clip again. Nigel Tufnel, when asked if what his knob (“FI”) is measuring (sequence conservation and therefore functional information), replies, “This one goes to 11.” We then laugh at the stupidity of conflating the measurement (11) with the thing being measured (loudness).

Can you really not see that you are doing precisely the same thing with gpuccio’s imaginary “FI”?

Mercer · January 19, 2020, 7:06pm

How would you know?

No, it says that your attempt to apply it here is silly.

Let’s apply it to the number of MYH7 variants found in healthy humans, then.

It’s a non sequitur. If you are making claims about the size of populations, in this case the number of functional sequences in sequence space, then population size is the main issue.

Rumraket · January 19, 2020, 7:06pm

Now, in answer to Bill Cole’s recurrent bringing up of “statistics” as if that somehow, magically and for meaningless reasons, makes us able to extrapolate into very dissimilar and untested areas of sequence space, here’s a paper where they tried to do that exact thing.

They compared statistical(as in extrapolations from various sampling methods) vs a quasi-empirical(a simulation of empirical tests) approaches to try and assess the functionality of sequence space:
du Plessis L, Leventhal GE, Bonhoeffer S. How Good Are Statistical Models at Approximating Complex Fitness Landscapes? Mol Biol Evol. 2016 Sep;33(9):2454-68.
DOI: 10.1093/molbev/msw097

Abstract

Fitness landscapes determine the course of adaptation by constraining and shaping evolutionary trajectories. Knowledge of the structure of a fitness landscape can thus predict evolutionary outcomes. Empirical fitness landscapes, however, have so far only offered limited insight into real-world questions, as the high dimensionality of sequence spaces makes it impossible to exhaustively measure the fitness of all variants of biologically meaningful sequences. We must therefore revert to statistical descriptions of fitness landscapes that are based on a sparse sample of fitness measurements. It remains unclear, however, how much data are required for such statistical descriptions to be useful. Here, we assess the ability of regression models accounting for single and pairwise mutations to correctly approximate a complex quasi-empirical fitness landscape. We compare approximations based on various sampling regimes of an RNA landscape and find that the sampling regime strongly influences the quality of the regression. On the one hand it is generally impossible to generate sufficient samples to achieve a good approximation of the complete fitness landscape, and on the other hand systematic sampling schemes can only provide a good description of the immediate neighborhood of a sequence of interest. Nevertheless, we obtain a remarkably good and unbiased fit to the local landscape when using sequences from a population that has evolved under strong selection. Thus, current statistical methods can provide a good approximation to the landscape of naturally evolving populations.

The abstract gives it away already(my bolds). But let’s see some of what they write anyway. The introduction is good, so please go and read that(read the whole thing actually), I won’t copy-paste it all here as the paper is freely available.

In the methods they describe the properties of the model they use to generate their quasi-empirical fitness landscape. Long story short, they simulate the function of a noncoding RNA molecule that folds, and who’s function is dependent on it’s simulated 3dimensional structure, which in turn is dependant on it’s sequence:

We use this idea to compute quasi-empirical RNA fitness landscapes, where the fitness of a sequence is based on the similarity of its secondary structure to an ideal target structure, which is assumed to fulfill a hypothetical function that is highly dependent on its structural conformation.
We use the minimum free energy (MFE) structure of a real, functional RNA sequence as the target structure. The human U3 snoRNA (Marz and Stadler 2009; Marz et al. 2011), downloaded from Rfam (Griffiths-Jones et al. 2005), is used as a focal genotype to generate the fitness landscape used in the following sections. This is a noncoding box C/D RNA of 217 nt, making it long enough to form nontrivial structures and for its fitness landscape to be both complex and biologically relevant. We use a real sequence to ensure that the fitness landscape is generated around a sequence with a biological function. The fitness of a candidate sequence is the average selective value of all the structures in the suboptimal ensemble of a sequence (containing all structures with free energies within a bounded distance from the minimal free energy structure). The fitness function is detailed in the Materials and Methods section and is similar to that used in Cowperthwaite and Meyers (2007) and Cowperthwaite et al. (2005, 2006). The resulting fitness landscape is continuous, exhibits a high degree of semineutrality, and places sequences under a strong selective pressure to have similar structures as the target structure while maintaining high stabilities.

Okay, so they generate the fitness landscape based on the physical basis of function for a real functional RNA molecule.

Then they go on to do different kinds of sampling techniques, and then “train” their statistical models on these different sampling techniques to see how good they are at predicting the true fitness landscape of sequence space from various limited sampling techniques. I’m going to skip over most of that and jump straight to the conclusions that are relevant to arguments I have been making here:

We first verified our prediction that it is generally impossible for either model to explain any of the variation when trained on randomly sampled sequences (fig. 3A, Random). Next, we investigated the local neighborhood around a sequence of interest and found that within the restricted sequence space it is possible to sample densely enough to allow a quadratic model (and even a linear model) to reconstruct a fairly good approximation of the local landscape. The fit is much better for Complete Subset than for Random Neighborhood, although no sequence in either data set is more than eight mutations from the focal genotype.

The models do okay at predicting the fitness of sequences LOCALLY around sampled sequences, but fail miserably once the sequences start to diverge beyond 10% from the sampled ones. Regardless of what technique was used to sample the space.

When within a LOCAL area, different sampling techniques do matter with respect to predicting the function of highly similar sequences. It turns out that if you sample by evolving sequences, a statistical model trained on evolved sequences is among the best, because evolution retains information about what residues correlate to higher fitness.

But once you move away from local sequences, predictive power drops off into complete failure.

The predictive power is strongly correlated to the landscape size and decreases rapidly as the number of allowed mutations are increased. Similarly, we see that as the predictive power decreases the correlation between the true fitness and residual size increases. For randomly sampled sequences it is only possible to attain a high predictive power if the median Hamming distance between sequences in the training set is less than 20 mutations, equal to roughly 90% sequence conservation in the data set. Note that the median Hamming distance between sequences is only slightly smaller than the maximum Hamming distance between sequences. This is because most of the randomly sampled sequences contain the maximum number of allowed mutations, since landscape size grows exponentially with the number of allowed mutations. We further see that at median Hamming distances greater than 80 no prediction is possible. Thus, only on the smallest landscapes used here is it possible for a random sampling of 60,000 sequences to attain a dense enough sampling to produce a good fit.

So basically, when they get down into 70% sequence similarity, the models collapse into complete failure at predicting the functionality of unsampled sequences.

Discussion

We assess the usefulness of regression models accounting for main effects and pairwise interactions between loci to approximate complex fitness landscapes, here represented by quasi-empirical RNA fitness landscapes. Our results show that achieving a high enough sampling density is crucial in order to obtain a good description of a fitness landscape. The curse of dimensionality ensures that it is not possible to sample densely enough to allow a simple model to accurately predict the fitness of any sequence in realistic fitness landscapes. However, while it is impossible to provide a good approximation of a complete fitness landscape in all but the simplest cases it is still possible to obtain an accurate representation of local regions of the fitness landscape by restricting the space sampled for the training set. But, since no model can predict the fitness of sequences that fall too far outside of the variation in its training set, the composition of the training set is extremely important. Care should be taken to select sequences that restrict the sequence space to those sequences, we are interested in and also to select sequences that elucidate epistatic interactions between loci.

Mercer · January 19, 2020, 7:10pm

I think that you are putting far too much effort into this. Our Nigelian friends here are just going to tell you that their amp (“FI”) goes to 11.

Rumraket · January 19, 2020, 7:13pm

I’m not doing it for them, I’m doing it for people who really are interested in what is true.

Topic		Replies	Views
Strong evidence that a search algorithm can find high functionality in an astronomically large search space Conversation Science , Artificial-Intelligence	44	2517	September 25, 2021
Functions are not so rare at all, and definitely not isolated, in sequence space of biopolymers Conversation Science	48	2752	July 19, 2021
Frustrated Evolutionary Networks Conversation	7	772	June 3, 2021
Gauger and Mercer: Bifunctional Proteins and Protein Sequence Space Office Hours Design	188	7395	November 15, 2018
Mercer's Work on Protein Function and Sequence Space Office Hours Design	5	808	June 19, 2021

Optima in Evolution

Abstract

Abstract

Discussion

Related topics