Functions are not so rare at all, and definitely not isolated, in sequence space of biopolymers

Found this paper on arXiv:

At odds with a traditional view of molecular evolution that seeks a descent-with-modification
relationship between functional sequences, new functions can emerge de novo with relative ease. At early times of molecular evolution, random polymers could have sufficed for the appearance of incipient chemical activity, while the cellular environment harbors a myriad of proto-functional molecules. The emergence of function is facilitated by several mechanisms intrinsic to molecular organization, such as redundant mapping of sequences into structures, phenotypic plasticity, modularity, or cooperative associations between genomic sequences. It is the availability of niches in the molecular ecology that filters new potentially functional proposals. New phenotypes and subsequent levels of molecular complexity could be attained through combinatorial explorations of currently available molecular variants. Natural selection does the rest.

It’s hard to pick something out to quote specifically as the entire things is worth a read. It says basically everything and more I’ve been saying on this subject for a long time.


Here’s another very relevant one:

This one is particularly interesting because here they test the capacity of Darwinian evolution to create a function de novo in a nonfunctional, minimally complex biopolymer.

My bold:


The spontaneous emergence of function from pools of random sequence RNA is widely considered an important transition in the origin of life. However, the plausibility of this hypothetical process and the number of productive evolutionary trajectories in sequence space are unknown. Here we demonstrate that function can arise starting from a single RNA sequence by an iterative process of mutation and selection. Specifically, we describe the discovery of both specific ATP or GTP aptamers - with micromolar affinity for their nucleotide ligand - starting each from a single, homopolymeric poly-A sequence flanked by conserved primer binding sites. Our results indicate that the ab initio presence of large, diverse random sequence pools is not a prerequisite for the emergence of functional RNAs and that the process of Darwinian evolution has the capacity to generate function even from single, largely unstructured RNA sequences with minimal molecular and informational complexity.

A couple of highlights from the discussion:

This canonical ATP aptamer fold appears to be a privileged molecular solution for ATP
binding, as it has been isolated multiple times by in vitro selection on ATP-agarose in
independent selections against ATP containing cofactors such as NAD18 and SAM19, 35 and was even identified in several bacterial and eukaryotic genomes21, 24. This parallels the case of the Hammerhead ribozyme motif36, 37 and suggests that both of these motifs may represent minimal optimal solutions in RNA sequence space. The isolation of the canonical ATP aptamer motif in the T5 selection may also indicate its resilience to high mutation rates.

The isolation of a GTP binding aptamer (T8R16 / 409)(Fig. 4) from one of the ATP selection experiments may seem fortuitous, but is less surprising considering the previous discovery of numerous GTP binding aptamers at a relatively short mutational distance from the canonical ATP aptamer motif38. Furthermore, GTP aptamers in general appear to be much more structural and functional diverse and may therefore also be more abundant in sequence space. Indeed, “CA” repeats as short as 3 nucleotides39 or G-quadruplex motifs40 have been reported to bind GTP.

So, not isolated, nor rare.

How structure and function (the phenotype) of a biopolymer relates to its sequence (the
genotype) is often discussed in terms of a fitness landscape of peaks and valleys with
evolution viewed as a random walk in the direction of adaptive peaks. The shape e.g. degree
of “ruggedness” of such landscapes is an active area of research8, 25, 41, 42 and has clear implications for both the likelihood of reaching optimal fitness peaks and the best adaptive strategies. For example, comprehensive studies of both a GTP aptamer and a kinase ribozyme fitness landscape found them to be substantially rugged with few neutral networks (i.e. mutational paths) connecting adaptive peaks8, 25. However, the discovery of nucleotide binding aptamers from single starting sequences as described here suggests that there must be permissive mutational trajectories connecting adaptive peaks even to single, unstructured, non-functional sequences.


Yes! There are a bunch of ways to do A Thing. And there are a bunch of ways to get from doing A Thing to doing Another Thing.

I’m sure we’ll stop hearing otherwise from creationists any day now.


They basically never respond to threads that show evidence contrary to their unsupported axiomatic beliefs. It’s almost like they deliberately ignore them.



Yes, but there are immensely more bunch of ways not to do A thing.

Without intelligence, very unlikely

If the probability of A Thing is one in a million, and there are a million opportunities for That Thing, then the probability A Thing will occur at least once is about 63%. Further, ~63% of All Other Things will also occur.

Replace “a million” with any large number, and this still holds true.


I really don’t see what you could possibly base that view on. It is at odds with so much of biochemistry.

Basically all proteins have alternative or related potential functions they can perform, be it enzymes or DNA binding proteins like transcription factors. Which can be significantly improved with relatively little adaptation.


12 posts were split to a new topic: Sal’s Flower?

All of life has some intelligence – even bacteria.

That’s what the ID people are missing. They want to insist on an external source of intelligence. But the intelligence that matters is right there within the biosphere – hidden in plain sight.

Quantify, please.


But how many of those do Another thing? Answer: A lot.

That’s not what @T_aquaticus requested.

Yes, but there are still immensely more bunch of ways not to do anything at all.

It’s… the topic of this thread. Did you not read the subject of the thread before you started posting in it?

The provided article says the opposite. It shows that ‘function’ is easy, optimized function is difficult.

If it can vary from a number smaller than the effective population size of bacteria, then it is safe to say the low end does not represent a problem.

The authors unfortunately ignored an important aspect of the the D2 domain’s functionality, namely it’s association with two other proteins. The coevolution of these proteins would reduce the difficulty of optimization at any point in the evolution. Regardless, it is a non-issue for any function, as that requires far fewer searches.

Your own article shows that function is high in sequence space when function is defined!


Define ‘immensely’. 5-10 orders of magnitude? Sure. The often suggested >100 orders of magnitude? Certainly not. Having a one in a billion chance of getting a functional protein from de novo translation is entirely sufficient.

1 Like

2 posts were merged into an existing topic: Sal’s Flower?

That has to be getting close to saying that function doesn’t exist at all. What do we get if we multiply the number of known functions by 100 orders of magnitude?

I don’t follow your question in the context of my previous comment. Sorry.

This is an important point with respect to how evolution finds new functional proteins. We have to be mindful that what determines success in evolution is effect on fitness. Not HOW it has effect on fitness. Just that it has effect on fitness. There are innumerable numbers of ways something can positively contribute to fitness.

A novel protein might be an enzyme that speeds up a useful chemical reaction (of which there are many millions), it might assist another protein in folding(of which all organisms encode thousands, and which can all be potentially assisted and stabilized in billions of possible ways), it might simply buffer against misfolding or enhance overall temperature stability, it might block or reduce expression of a useless gene, or enhance expression of another. It might insert into the membrane and interact with other molecules there, it might act as a chelating agent against toxic metals or other charged ions. The possibilities are virtually endless.

That means even if each of all the different possible, particular protein functions (chelating some toxic alloy of Nickel, say) - are relatively rare in sequence space - the fact that there are so many different possible functions that could contribute to fitness besides that specific one, means it might not be all that rare to discover one that is useful. Even if that specific function is rare.

You could have individual functions be (say) as rare as 10-40, but if there are 1032 different useful functions that could positively affect fitness, then naively you might say the probability of discovering some useful function by one randomly chosen sample from sequence space is only 10-8.

Now add to this the phenomenon of constant exposure of organisms to novel environmental challenges, and the fact that there is an ongoing pervasive low-level transcription of most of the genome for basically all organisms. With many millions of different transcripts being produced at low levels from both coding and non-coding regions, both in-and-out of frame, and from opposite strands of DNA - the opportunity for discovering something useful massively increases.


Funny, there’s an entire article dedicated to discussing that statement:

Some key statements:

It is now well accepted that most — and probably all — extant enzymes are, in fact, promiscuous [5, 6].

Recent large-scale studies, both computational and experimental, have opened our eyes to the enormous functional diversity among existing enzyme superfamilies, the vastness of ‘promiscuity space,’ and therefore the seemingly limitless potential for future evolutionary innovation. Baier et al. surveyed the functional diversity, as represented by Enzyme Commission (EC) numbers, in five common superfamilies [7•]. Each superfamily contained enzymes from all six of the EC classes (Figure 1a). Furnham et al. went further and used a phylogenetic approach [8] to reconstruct the evolutionary histories of 379 superfamilies from the Class, Architecture, Topology, Homology (CATH) database, and to ask how often a change in EC number was observed over the course of their evolution [9•]. While 81% of the functional changes were within an EC class, every possible change between EC classes was also observed (Figure 1b), with the exception of a change from a ligase (EC class 6) to an isomerase (EC class 5). These bioinformatics studies emphasize that there is little, if anything, that constrains particular catalytic chemistries to particular folds.

Four high-throughput experimental studies (reviewed in detail elsewhere [7•]) have reached a similar conclusion. Dozens of enzymes from within the cytosolic glutathione transferase [10], β-keto acid cleavage enzyme [11], metallo-β-lactamase [12], and haloalkanoate dehalogenase [13••] superfamilies were each tested for activity towards a range of different substrates. In each case, many enzymes were found to have multiple functions in vitro . In the most comprehensive study, 217 members of the haloalkanoate dehalogenase superfamily were expressed, purified, and screened for phosphatase or phosphonatase activity towards 167 substrates (most of which were naturally occurring metabolites). The authors discovered breathtakingly broad substrate specificities. A median of 15.5 substrates were recognized by each enzyme, 50 of the enzymes could utilize 40 or more substrates, and remarkably, one enzyme could utilize 143 [13••].