Review article on the likelihood of functionality in amino acid sequence space?

Chris_Falter · March 12, 2019, 6:23pm

Hi,

I have been very interested in the very long and often noisy conversations on the likelihood of functionality in amino acid sequence space. I would like to read a good review article that summarizes those 5000 papers that have been alluded to. Failing that, a detailed blog post from a biologist who has closely followed the literature would be nice to read.

Suggestions, anyone?

swamidass · March 12, 2019, 7:16pm

This is well meaning but poorly defined question, but perhaps @mercer or @art has a good suggestion.

Chris_Falter · March 14, 2019, 5:00pm

Thanks, Joshua.

I’ll try to clarify. One of the big arguments going on in multiple threads here revolves around the density of protein function in the amino acid sequence space. Proponents of ID cite Axe’s paper and say it is infinitesimally rare. Proponents of mainstream biology such as @mercer and @art cite a large corpus of ~5000 papers and say functionality is vastly denser in the sequence space.

I would like to have a good summary of what mainstream biology has learned about the density of function in the amino acid space. I can’t read the 5000 papers, and the discussion threads here on the forum can be rather noisy. The din of intellectual battle is not bad, per se; it’s just that the cut-'n-thrust of arguments between scientists can seem a little baffling to onlookers like myself.

Thus I am looking for a good review article or blog post that I can learn from and recommend to others. Ideally, the article would explain both why Axe’s paper does not generalize well, along with what has been learned from the work of other researchers.

Is that clearer? If so, can anyone point me to such a resource?

Thanks!

Chris

swamidass · March 14, 2019, 5:11pm

@Chris_Falter this is a hard question to answer because “function” is poorly defined. Density of a particular function might be rare, but there is a combinatorially large number of possible functions too. They are also closely connected in sequence space.

So in regards to a specific function, rare, but landing on one of any function might be easy.

T_aquaticus · March 14, 2019, 5:23pm

That is the tough part. There are millions and perhaps billions of possible functions, so it is pragmatically impossible to say that a given amino acid has no binding or enzymatic function.

An easier question to ask is if the sequence improves fitness.

It is generally assumed that new genes arise through duplication and/or recombination of existing genes. The probability that a new functional gene could arise out of random non-coding DNA is so far considered to be negligible, since it seems unlikely that such a RNA or protein sequence could have an initial function that influences the fitness of an organism. We have here tested this question systematically, by expressing clones with random sequences in E . coli and subjecting them to competitive growth. Contrary to expectations, we find that random sequences with bioactivity are not rare. In our experiments we find that up to 25% of the evaluated clones enhance the growth rate of their cells and up to 52% inhibit growth. Testing of individual clones in competition assays confirms their activity and provides an indication that their activity could be exerted either by the transcribed RNA or the translated peptide. This suggests that transcribed and translated random parts of the genome could indeed have a high potential to become functional. The results also suggest that random sequences may become an effective new source of molecules for studying cellular functions, as well as for pharmacological activity screening.
Random sequences are an abundant source of bioactive RNAs or peptides - PMC

There are issues with the experimental design, such as the influence of overexpression, but on its face the paper does suggest that function is common in random sequences.

Art · March 14, 2019, 5:23pm

Hi Chris,

Unfortunately, I haven’t anything immediately at hand that would fit the bill. I suspect that a comprehensive review article may not exist (although I would be delighted to be wrong). Our discussions here draw from several disparate fields - abzymes, random combinatorial studies, gene and protein evolution, structure/function (mutagenesis), and probably more - and it would be pretty hard to draw everything into one review that could find a home somewhere. (Hard to believe that everyone does not share our own obsession with this subject, eh?).

Have you read through my old blog post that tries to tie together Axe’s work with some other aspects of the field? If so, if you have questions or recommendations that I could use to add on, then maybe I can find some time to expand (and expound on) things.

Chris_Falter · March 14, 2019, 5:27pm

Thanks, @Art, @T_aquaticus, and @swamidass! I will take a look at the two links suggested in this thread.

Yours,
Chris

swamidass · March 14, 2019, 5:53pm

@art, what do you think about spearheading writing a review wiht some of us, including @mercer. This could be fun.

Guy_Coe · March 14, 2019, 6:15pm

The plot thickens as the heart of a teacher collides with the intricate nuances of specialized knowledge. Calling it, potentially, “fun” is what we all hope will ring true in the latest chapter of getting at the central enigmas. Go for it!!

pnelson · March 14, 2019, 9:16pm

Older paper (2002), but a good place to start into the relevant literature:

ncbi.nlm.nih.gov

Did evolution leap to create the protein universe?

B Rost, Current opinion in structural biology, Jun 2002

The genomes of over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects, the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins, and have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathway. Many proteins do not have close homologues in different species (orphans) and there could even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous and that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to be to develop better methods for comparative proteomics.

Available as pdf here:

If you use Google Scholar, or just Google, you can follow the citation threads forward in time, and you’ll hit a lot of useful studies.

T_aquaticus · March 14, 2019, 10:02pm

I found this section to be of interest:

There have been discussions on how many possible folds there are, and the distribution of those folds among different proteins. While this study isn’t gospel, it does give us a ballpark figure to work from.

Art · March 14, 2019, 10:19pm

You can track down papers that cite Axe here.

Or here.

Rumraket · March 16, 2019, 12:12am

Heh, many of them are from various creationists, including Axe, Gauger, Leisola, Abel, and so on.

Rumraket · March 16, 2019, 12:29am

It is probably impossible to get a “true” average for the density of functional proteins in amino acid sequence space. One problem is that the question depends on environmental context. A protein that is a nonfunctional and misfolding piece of junk to a hyperthermophilic bacterium living in a hydrothermal vent will work just fine under more moderate physiological conditions in a human skincell.

That means you have to average over all possible physical environments, which would include all possible genetic contexts, including all possible chaperone proteins. That complicates the picture almost beyond comprehension. It is possible that most proteins which depend on a particular fold to function, would fail to fold on their own, but could be folded with the help of chaperones. Are they then truly nonfunctional?
How would you score the functionality of such a protein in some essay? You measure whether it rescues growth in some deletion strain, it seems to fail to fold on it’s own, doesn’t rescue growth, and then you conclude it’s a nonfunctional protein. But is there some conditions under which it would work? There’s no hope at ever giving a complete picture of the true functional landscape of protein sequence space for these reasons, so Axe’s study is in effect a fool’s errand. That’s why you’re likely to find a poverty of articles really attempting to give a serious estimate for the “likelihood of functionality in amino acid sequence space”. It just can’t be done.

A much more interesting question, in part because it is somewhat amenable to empirical analysis, is to study the interconnectedness of known protein functions. Could sequence with function A be turned into sequence with function B through mutations without becoming nonfunctional, and did such a transition actually happen in the history of life on Earth? There are articles that attempt to answer questions of a similar nature, concerning how the different known proteins of life are distributed in protein sequence space, and the frequency with which functional shifts have occurred.
For example:
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Laskowski RA, Orengo CA, et al. (2012) Exploring the Evolution of Novel Enzyme Functions within Structurally Defined Protein Superfamilies. PLoS Comput Biol 8(3): e1002403. Exploring the Evolution of Novel Enzyme Functions within Structurally Defined Protein Superfamilies

Abstract

In order to understand the evolution of enzyme reactions and to gain an overview of biological catalysis we have combined sequence and structural data to generate phylogenetic trees in an analysis of 276 structurally defined enzyme superfamilies, and used these to study how enzyme functions have evolved. We describe in detail the analysis of two superfamilies to illustrate different paradigms of enzyme evolution. Gathering together data from all the superfamilies supports and develops the observation that they have all evolved to act on a diverse set of substrates, whilst the evolution of new chemistry is much less common. Despite that, by bringing together so much data, we can provide a comprehensive overview of the most common and rare types of changes in function. Our analysis demonstrates on a larger scale than previously studied, that modifications in overall chemistry still occur, with all possible changes at the primary level of the Enzyme Commission (E.C.) classification observed to a greater or lesser extent. The phylogenetic trees map out the evolutionary route taken within a superfamily, as well as all the possible changes within a superfamily. This has been used to generate a matrix of observed exchanges from one enzyme function to another, revealing the scale and nature of enzyme evolution and that some types of exchanges between and within E.C. classes are more prevalent than others. Surprisingly a large proportion (71%) of all known enzyme functions are performed by this relatively small set of 276 superfamilies. This reinforces the hypothesis that relatively few ancient enzymatic domain superfamilies were progenitors for most of the chemistry required for life.

Mercer · March 16, 2019, 3:56am

And I would add that I don’t see a scientific need for one.

pnelson · March 16, 2019, 1:08pm

Well, let me see if I can stir up the scientific need for one.

If anyone wants to see how counterintuitive – from an evolutionary perspective – the problem of the origin of protein fold superfamilies (FSF) can become, watch this talk, to the Royal Swedish Academy of Sciences, by Charles Kurland of Lund University in Sweden. I’ve timestamped the relevant starting point:

Kurland and his co-author Ajith Harish have been arguing vehemently in a series of publications for a FSF-rich LUCA – not a minimal cell, but rather one packed with “three fourths of the unique protein domain-superfamilies encoded by extant genomes.” (As an aside, Kurland can be amusingly blunt in this talk; e.g., “Rational thought is not a tool among phylogenists, actually” – ouch.) This inference of a domain-rich LUCA, of course, raises the question of the origin, or mode of origin, of the starting FSF abundance.

In a 2015 paper, Kurland and Harish are characteristically blunt. What they call the “seductive” model of continuous movement through protein sequence space, promoted by (for instance) John Maynard Smith, during the dominance of gradualistic or classical neo-Darwinism, won’t work. Cells do not tolerate the transitional states required to span the sequence and functional distances between discrete FSFs, but use sophisticated housecleaning machinery to sweep them out:

Nevertheless, a problem not solved by the modular assembly of natural proteins remains. This is the issue raised by Maynard Smith [73] when he asked, “Are all existing proteins parts of the same continuous network, and if so, have they all been reached from a single starting point?” The answer that might have been attractive to molecular biologists in 1970 was that indeed there might have been one or a few ancestral proteins. But, that answer is not now so obvious given the lack of homology between different SFs.

Of course, the epistatic editing system is made up of proteins and that system or pathway may not have evolved before the ancestral cohort of circa 1500 SFs belonging to MRUCA’s SF repertoire had evolved. In that eventuality, three fourths of all the extant SFs in modern genomes may have evolved under conditions in which SFs with sequence homology to other SF-coding sequences were tolerated. Presumably, this tolerance could be maintained until a minimum diversity of functions had evolved in the SF repertoire [52]. After that minimum diversity of SFs was attained, the fitness of the ancestral cells might have been improved by the implementation of the epistatic editing system. That is to say, the epistatic editing system may not have laundered the proteins evolving during an earlier period of cellular life.

During this putative earlier epoch the rates of evolution of novel proteins may have been much more rapid than in modern times precisely because the editing of misfolded proteins was minimal. Of course the cost of such facile structural evolution is more frequent cellular accidents due to aggregation of misfolded protein.

(From here: The phylogenomics of protein structures: The backstory - PubMed)

Notice that Kurland and Harish have to suspend the normal cellular rules to derive the rich FSF diversity needed in the LUCA starting set: “The evolution of novel proteins may have been much more rapid than in modern times.” In terms of abductive logic, this parallels directly the move made by many workers who study the Cambrian Explosion: the pace of developmental evolution, modifying phenotypes, was dramatically different back then, to such a degree that we can no longer expect to observe such evolution today.

Rumraket · March 16, 2019, 1:21pm

Well that sure sounds dramatic. Is it true though?

pnelson · March 16, 2019, 1:23pm

It’s a joke. Are jokes true?

Or do they point to a underlying reality, which gives them their punch?

You must decide. (I thought it was funny, anyway.)

Mercer · March 16, 2019, 8:27pm

There are far more scientific ways to stir than by frantically moving the goalposts by conflating functions with superfamilies of folds.

AndyWalsh · March 21, 2019, 6:31pm

Sorry I’m late to this party. One other resource that comes to mind is Andreas Wagner’s book Arrival of the Fittest. It has been a while since I read it, but as I recall there was substantive discussion of the density and connectedness of functionality in gene, protein, and metabolism space. It might be useful for a lay reader looking for a place to start.

I should add the caveat that some the text and certainly some of the promotional material about the book is, shall we say, flowery in positioning the book as the missing piece that Darwin never knew about and so on. One has to sell copies, after all. Still, I believe there is some substance underneath, and the positive reviews in Nature and elsewhere lead me to think I’m not completely misremembering.

Topic		Replies	Views
Miller: Axe Decisively Confirmed? Conversation Science , Design	31	4630	February 23, 2019
Ancestral beta-lactamase enzyme Conversation Science , Design	50	2168	August 2, 2020
Beta-Lactamase, Antibody Enzymes, and Sequence Space Conversation Science , Design	191	8222	July 16, 2020
Functions are not so rare at all, and definitely not isolated, in sequence space of biopolymers Conversation Science	48	2833	July 19, 2021
Gauger and Mercer: Bifunctional Proteins and Protein Sequence Space Office Hours Design	188	7805	November 15, 2018

Review article on the likelihood of functionality in amino acid sequence space?

Abstract

Related topics