Review article on the likelihood of functionality in amino acid sequence space?


I have been very interested in the very long and often noisy conversations on the likelihood of functionality in amino acid sequence space. I would like to read a good review article that summarizes those 5000 papers that have been alluded to. Failing that, a detailed blog post from a biologist who has closely followed the literature would be nice to read.

Suggestions, anyone?


This is well meaning but poorly defined question, but perhaps @mercer or @art has a good suggestion.

1 Like

Thanks, Joshua.

I’ll try to clarify. One of the big arguments going on in multiple threads here revolves around the density of protein function in the amino acid sequence space. Proponents of ID cite Axe’s paper and say it is infinitesimally rare. Proponents of mainstream biology such as @mercer and @art cite a large corpus of ~5000 papers and say functionality is vastly denser in the sequence space.

I would like to have a good summary of what mainstream biology has learned about the density of function in the amino acid space. I can’t read the 5000 papers, and the discussion threads here on the forum can be rather noisy. The din of intellectual battle is not bad, per se; it’s just that the cut-'n-thrust of arguments between scientists can seem a little baffling to onlookers like myself.

Thus I am looking for a good review article or blog post that I can learn from and recommend to others. Ideally, the article would explain both why Axe’s paper does not generalize well, along with what has been learned from the work of other researchers.

Is that clearer? If so, can anyone point me to such a resource?




@Chris_Falter this is a hard question to answer because “function” is poorly defined. Density of a particular function might be rare, but there is a combinatorially large number of possible functions too. They are also closely connected in sequence space.

So in regards to a specific function, rare, but landing on one of any function might be easy.

1 Like

That is the tough part. There are millions and perhaps billions of possible functions, so it is pragmatically impossible to say that a given amino acid has no binding or enzymatic function.

An easier question to ask is if the sequence improves fitness.

There are issues with the experimental design, such as the influence of overexpression, but on its face the paper does suggest that function is common in random sequences.

1 Like

Hi Chris,

Unfortunately, I haven’t anything immediately at hand that would fit the bill. I suspect that a comprehensive review article may not exist (although I would be delighted to be wrong). Our discussions here draw from several disparate fields - abzymes, random combinatorial studies, gene and protein evolution, structure/function (mutagenesis), and probably more - and it would be pretty hard to draw everything into one review that could find a home somewhere. (Hard to believe that everyone does not share our own obsession with this subject, eh?).

Have you read through my old blog post that tries to tie together Axe’s work with some other aspects of the field? If so, if you have questions or recommendations that I could use to add on, then maybe I can find some time to expand (and expound on) things.

1 Like

Thanks, @Art, @T_aquaticus, and @swamidass! I will take a look at the two links suggested in this thread.


@art, what do you think about spearheading writing a review wiht some of us, including @mercer. This could be fun.


The plot thickens as the heart of a teacher collides with the intricate nuances of specialized knowledge. Calling it, potentially, “fun” is what we all hope will ring true in the latest chapter of getting at the central enigmas. Go for it!!

1 Like

Older paper (2002), but a good place to start into the relevant literature:

Available as pdf here:

If you use Google Scholar, or just Google, you can follow the citation threads forward in time, and you’ll hit a lot of useful studies.

1 Like

I found this section to be of interest:

There have been discussions on how many possible folds there are, and the distribution of those folds among different proteins. While this study isn’t gospel, it does give us a ballpark figure to work from.

You can track down papers that cite Axe here.

Or here.

1 Like

Heh, many of them are from various creationists, including Axe, Gauger, Leisola, Abel, and so on.

It is probably impossible to get a “true” average for the density of functional proteins in amino acid sequence space. One problem is that the question depends on environmental context. A protein that is a nonfunctional and misfolding piece of junk to a hyperthermophilic bacterium living in a hydrothermal vent will work just fine under more moderate physiological conditions in a human skincell.

That means you have to average over all possible physical environments, which would include all possible genetic contexts, including all possible chaperone proteins. That complicates the picture almost beyond comprehension. It is possible that most proteins which depend on a particular fold to function, would fail to fold on their own, but could be folded with the help of chaperones. Are they then truly nonfunctional?
How would you score the functionality of such a protein in some essay? You measure whether it rescues growth in some deletion strain, it seems to fail to fold on it’s own, doesn’t rescue growth, and then you conclude it’s a nonfunctional protein. But is there some conditions under which it would work? There’s no hope at ever giving a complete picture of the true functional landscape of protein sequence space for these reasons, so Axe’s study is in effect a fool’s errand. That’s why you’re likely to find a poverty of articles really attempting to give a serious estimate for the “likelihood of functionality in amino acid sequence space”. It just can’t be done.

A much more interesting question, in part because it is somewhat amenable to empirical analysis, is to study the interconnectedness of known protein functions. Could sequence with function A be turned into sequence with function B through mutations without becoming nonfunctional, and did such a transition actually happen in the history of life on Earth? There are articles that attempt to answer questions of a similar nature, concerning how the different known proteins of life are distributed in protein sequence space, and the frequency with which functional shifts have occurred.
For example:
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Laskowski RA, Orengo CA, et al. (2012) Exploring the Evolution of Novel Enzyme Functions within Structurally Defined Protein Superfamilies. PLoS Comput Biol 8(3): e1002403. Exploring the Evolution of Novel Enzyme Functions within Structurally Defined Protein Superfamilies


In order to understand the evolution of enzyme reactions and to gain an overview of biological catalysis we have combined sequence and structural data to generate phylogenetic trees in an analysis of 276 structurally defined enzyme superfamilies, and used these to study how enzyme functions have evolved. We describe in detail the analysis of two superfamilies to illustrate different paradigms of enzyme evolution. Gathering together data from all the superfamilies supports and develops the observation that they have all evolved to act on a diverse set of substrates, whilst the evolution of new chemistry is much less common. Despite that, by bringing together so much data, we can provide a comprehensive overview of the most common and rare types of changes in function. Our analysis demonstrates on a larger scale than previously studied, that modifications in overall chemistry still occur, with all possible changes at the primary level of the Enzyme Commission (E.C.) classification observed to a greater or lesser extent. The phylogenetic trees map out the evolutionary route taken within a superfamily, as well as all the possible changes within a superfamily. This has been used to generate a matrix of observed exchanges from one enzyme function to another, revealing the scale and nature of enzyme evolution and that some types of exchanges between and within E.C. classes are more prevalent than others. Surprisingly a large proportion (71%) of all known enzyme functions are performed by this relatively small set of 276 superfamilies. This reinforces the hypothesis that relatively few ancient enzymatic domain superfamilies were progenitors for most of the chemistry required for life.


And I would add that I don’t see a scientific need for one.

Well, let me see if I can stir up the scientific need for one. :wink:

If anyone wants to see how counterintuitive – from an evolutionary perspective – the problem of the origin of protein fold superfamilies (FSF) can become, watch this talk, to the Royal Swedish Academy of Sciences, by Charles Kurland of Lund University in Sweden. I’ve timestamped the relevant starting point:

Kurland and his co-author Ajith Harish have been arguing vehemently in a series of publications for a FSF-rich LUCA – not a minimal cell, but rather one packed with “three fourths of the unique protein domain-superfamilies encoded by extant genomes.” (As an aside, Kurland can be amusingly blunt in this talk; e.g., “Rational thought is not a tool among phylogenists, actually” – ouch.) This inference of a domain-rich LUCA, of course, raises the question of the origin, or mode of origin, of the starting FSF abundance.

In a 2015 paper, Kurland and Harish are characteristically blunt. What they call the “seductive” model of continuous movement through protein sequence space, promoted by (for instance) John Maynard Smith, during the dominance of gradualistic or classical neo-Darwinism, won’t work. Cells do not tolerate the transitional states required to span the sequence and functional distances between discrete FSFs, but use sophisticated housecleaning machinery to sweep them out:

Nevertheless, a problem not solved by the modular assembly of natural proteins remains. This is the issue raised by Maynard Smith [73] when he asked, “Are all existing proteins parts of the same continuous network, and if so, have they all been reached from a single starting point?” The answer that might have been attractive to molecular biologists in 1970 was that indeed there might have been one or a few ancestral proteins. But, that answer is not now so obvious given the lack of homology between different SFs.

Of course, the epistatic editing system is made up of proteins and that system or pathway may not have evolved before the ancestral cohort of circa 1500 SFs belonging to MRUCA’s SF repertoire had evolved. In that eventuality, three fourths of all the extant SFs in modern genomes may have evolved under conditions in which SFs with sequence homology to other SF-coding sequences were tolerated. Presumably, this tolerance could be maintained until a minimum diversity of functions had evolved in the SF repertoire [52]. After that minimum diversity of SFs was attained, the fitness of the ancestral cells might have been improved by the implementation of the epistatic editing system. That is to say, the epistatic editing system may not have laundered the proteins evolving during an earlier period of cellular life.

During this putative earlier epoch the rates of evolution of novel proteins may have been much more rapid than in modern times precisely because the editing of misfolded proteins was minimal. Of course the cost of such facile structural evolution is more frequent cellular accidents due to aggregation of misfolded protein.

(From here: The phylogenomics of protein structures: The backstory - PubMed)

Notice that Kurland and Harish have to suspend the normal cellular rules to derive the rich FSF diversity needed in the LUCA starting set: “The evolution of novel proteins may have been much more rapid than in modern times.” In terms of abductive logic, this parallels directly the move made by many workers who study the Cambrian Explosion: the pace of developmental evolution, modifying phenotypes, was dramatically different back then, to such a degree that we can no longer expect to observe such evolution today.


Well that sure sounds dramatic. Is it true though?

It’s a joke. Are jokes true?

Or do they point to a underlying reality, which gives them their punch?

You must decide. (I thought it was funny, anyway.)

1 Like

There are far more scientific ways to stir than by frantically moving the goalposts by conflating functions with superfamilies of folds. :wink:


Sorry I’m late to this party. One other resource that comes to mind is Andreas Wagner’s book Arrival of the Fittest. It has been a while since I read it, but as I recall there was substantive discussion of the density and connectedness of functionality in gene, protein, and metabolism space. It might be useful for a lay reader looking for a place to start.

I should add the caveat that some the text and certainly some of the promotional material about the book is, shall we say, flowery in positioning the book as the missing piece that Darwin never knew about and so on. One has to sell copies, after all. Still, I believe there is some substance underneath, and the positive reviews in Nature and elsewhere lead me to think I’m not completely misremembering.