Recent advances in the study of the protein universe and its evolution

This is a thread about recent findings about the protein universe, somewhat specifically about the occurrence (frequency, likelihood) and evolution of protein structure and function. This includes the origin and occurrence of protein folds.

The thread is not about the work of Douglas Axe, whose 2004 paper fails to address the questions of interest. This failure is due to a fatally flawed experimental approach which, even if it had been better designed, was never adequate to answer questions of evolution and structure of the protein universe. For discussion of Axe’s failed paper, see this ongoing thread and especially @Art Hunt’s comments in that thread but also here at PS several years ago expanding on his analysis first posted online almost 20 years ago.

It’s also not about gods or the nature of reason.

We’re in the Side Conversations section, which allows rapid responses (no wait for moderation) and unlimited posts. I will be attempting to moderate the discussion, by flagging and/or moving posts that are off-topic or inappropriate (repetitive, dishonest, misleading, AI-generated, and especially quotemined). I hope there isn’t much slop, but when I see it, I’ll send to either the ongoing Axe thread or to the Argument Clinic.

Here are the first four papers we can discuss. I’ll post each one separately with my comments then I hope we’ll hear from others.

  1. In silico evolution of globular protein folds from random sequences
    PNAS, open access, June 2025.

  2. Genetics, energetics, and allostery in proteins with randomized cores and surfaces
    Science, not open access, July 2025. I will provide links to the preprint version and the PubMed Central version, and can send the PDF on request.

  3. The Emergence of Novel Versus Known Three-Dimensional Structures from Random Sequences
    bioRxiv preprint, open access, December 2025

  4. Uncovering new families and folds in the natural protein universe
    Nature, open access, Sept 2023

3 Likes

A few recent (last 2-3 years) have begun to explore the protein universe with tools that didn’t exist (and, I think, could only be dreamed about) ten years ago: 1) AlphaFold and its descendants, which enable the detailed visualization and initial characterization of proteins that “exist” only as a sequence; and 2) AI/machine learning (ML) tools such as LLMs, that enable the exploration of sequence space at vast and increasing speeds and capacities. This fact is noted by the authors of our paper.

This open access paper is IMO the most important one to read in this thread, not because it provides a final definitive answer to a particular question but because the authors clearly delineate the questions and consider how to address them. The work itself is pretty technical but the text and especially the Discussion should be approachable to non-specialists. I provide some excerpts below but again, the paper is fun to read and I encourage those who are genuinely curious about protein fold evolution to simply read it (and maybe have a look at the references they cite).

You can get the gist from the Significance paragraph:

Origin of protein folds is an essential early step in the evolution of life that is not well understood. We address this problem by developing a computational framework approach for protein fold evolution simulation (PFES) that traces protein fold evolution in silico at the level of atomistic details. Using PFES, we show that stable, globular protein folds could evolve from random amino acid sequences with relative ease, resulting from selection acting on a realistic number of amino acid replacements. About half of the in silico evolved proteins resemble simple folds found in nature, whereas the rest are unique. These findings shed light on the enigma of the rapid evolution of diverse protein folds at the earliest stages of life evolution.

The Abstract is blunt about the difficulty of the question:

The origin and evolution of protein folds are among the most challenging, long-standing problems in biology. We developed Protein Fold Evolution Simulator (PFES), a computational approach that simulates evolution of globular folds from random amino acid sequences with atomistic details. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Iteration of this process allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how stable, globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as emergence of distinct folds not known to exist in nature. We show that evolution of small globular protein folds from random sequences, on average, takes 1.15 to 3 amino acid replacements per site, depending on the population size, with some simulations yielding stable folds after as few as 0.2 replacements per site. These values are lower than the characteristic numbers of replacements in conserved proteins during the time since the Last Universal Common Ancestor, suggesting that simple protein folds can evolve from random sequences relatively easily and quickly. PFES tracks the complete evolutionary history from simulations and can be used to test hypotheses on protein fold evolution.

The first paragraph of the Discussion:

The emergence of stable, globular protein folds from random sequences is arguably the principal problem of protein evolution and one of the major challenges in the study of the origin of life. Once diverse globular folds evolved, the subsequent evolution of proteins throughout the 4 billion year history of life on Earth was a relatively straightforward and fairly well understood process. Successful attempts to computationally imitate protein evolution have been made previously, but these studies either focused on a particular part of the protein with amino acid substitutions that did not change the protein fold (49–51) or were limited to the exploration of simplified models in lattice space and reduced amino acid alphabets (7, 18). Here, in contrast, we simulated protein evolution with all-atom models, allowing us to observe how protein folds nucleated from random sequences and how the accumulation of amino acid substitutions led to large-scale conformational rearrangements, changing the general architecture of the fold. This critical difference that was made possible by the advent of powerful tools for protein structure prediction, such as AlphaFold and large language model based ESMfold, creates the opportunity to recapitulate protein fold evolution in detail not accessible previously.

In silico evolution of globular protein folds from random sequences
PNAS, open access, June 2025.

See also a PS thread from last year, just after the paper was published: Maybe evolution of protein folds is.... easy

3 Likes

Thanks for digesting that paper. Can we define “protein fold” so as to be able to tell whether two proteins display the same or different folds?

2 Likes

Yes but it’s going to be arbitrary. Loops can grow, helices can bend, sheets can extend, and so on. Protein fold space is ultimately fluid.

That said, if you decide on some sort of limit (I don’t even know how they do this, but i suppose with some elaborate set of rules you can just state that “if the sheet grows by more than X residues, if the loop turns by more than X degrees, etc.”) it’s become a different fold.

1 Like

I’ve heard somewhere that there is only a small number (how small not clear) of protein folds found in nature. It would be good to link this claim with the definition under which it’s true.

2 Likes

According to this paper:

CLASSIFICATIONS OF FOLDS
Early Work on Classification
Early on, scholars concluded that one can catalog and classify all natural protein folds (71). This idea was supported by initial data: The structures solved in the early days of structure determination (1970s and 1980s) included many examples of the same folds, such as lysozyme-like folds, NAD(P)-binding Rossmann folds, globin-like folds, trypsin-like serine proteases, and immunoglobulin-like β-sandwich folds. In theory, there is a general consensus on the level of similarity required for two proteins to share the same fold: The proteins must share (a) the same secondary structures with similar three-dimensional arrangement (denoted architecture) and (b) the same path through the structure taken by the polypeptide chain (denoted topology). Thus, in the 1990s, two teams headed by Murzin and Orengo, respectively, embarked on the heroic effort of building the SCOP (structural classification of proteins) (81) and CATH (class, architecture, topology, homology) (84) catalogs of all protein folds. For historical accuracy, we note that around that time FSSP (families of structurally similar proteins) (53), a database of protein structural similarities found automatically, was also created. These classifications provide an ordered view of structure space, with the goal of facilitating a better understanding of its characteristics and evolution.

It goes on from there to discuss many reasons why it’s actually not that straightforward. It’s rather extensive.

2 Likes

Do they mention how many folds there are under the definition they provide? Even though it’s not that simple.

1 Like

2 Likes

My understanding is that this claim/conclusion is under debate. From the 2023 paper linked below:

…in recent years, novel protein folds have rarely been discovered, suggesting that nearly all folds existing in nature have been found. This does not necessarily indicate that all folds accessible to the polypeptide chain have been uncovered. Although debated, it has been suggested that nature may have sampled only a small fraction of the possible fold space during evolution. We investigated this hypothesis through de novo protein design for the folds that have not been sampled by natural evolution.

Below is the abstract; clearly this question (how many folds have been found by evolution?) is open and interesting. I think, BTW, that the paper describes some clear criteria for what a “new fold” is, which they would need in order to do their analysis.

A fundamental question in protein evolution is whether nature has exhaustively sampled nearly all possible protein folds throughout evolution, or whether a large fraction of the possible folds remains unexplored. To address this question, we defined a set of rules for β-sheet topology to predict novel αβ-folds and carried out a systematic de novo protein design exploration of the novel αβ-folds predicted by the rules. The designs for all eight of the predicted novel αβ-folds with a four-stranded β-sheet, including a knot-forming one, folded into structures close to the design models. Further, the rules predicted more than 10,000 novel αβ-folds with five- to eight-stranded β-sheets; this number far exceeds the number of αβ-folds observed in nature so far. This result suggests that a vast number of αβ-folds are possible, but have not emerged or have become extinct due to evolutionary bias.

Exploration of novel αβ-protein folds through de novo design

2 Likes

Very similar to estimates of the number of body plans.

How do you mean? In the sense that folds and body plans involve subjective judgments? Or that the estimates are debated?

Both, and really the subjectivity results in the disagreement.

1 Like

This next paper is a dense tour de force of molecular biology and biophysics. Here is a paragraph from the Introduction in which the authors explain the problem and the challenge (I removed references, numbering in the dozens, for readability):

Quantifying the effects of individual mutations or pairs of mutations provides little information about the genetic and energetic architecture of cores and surfaces because it only explores very local sequence space, revealing the outcome when one or two side chains are changed. Rather, what is needed are experiments at scale, in which the side chains of many positions are simultaneously changed in many different combinations — an approach referred to as combinatorial mutagenesis or core/surface randomization. However, to date, the number of alternative core and surface genotypes that have been experimentally characterized is extremely small for any protein. The lack of experimental data limits our understanding of genetic and energetic architecture and our ability to predict sequence evolution over large evolutionary distances.

Link at Science:

Genetics, energetics, and allostery in proteins with randomized cores and surfaces

Preprint version is dated May 2024, the day after the manuscript was submitted to Science, so this is the manuscript before peer review and revision:
https://www.biorxiv.org/content/10.1101/2024.05.11.593672v1.full

DM me for the PDF.

Structured abstract:

INTRODUCTION
Proteins typically contain hydrophobic amino acids buried in their cores and polar amino acids on their solvent-exposed surfaces. The rules governing which combinations of the 20 possible amino acids constitute stable and functional protein cores and surfaces are not well understood. This is partly because of the combinatorial explosion of possibilities when considering more than a few residues — experimental characterization of all combinations quickly becomes daunting.

RATIONALE
To better understand the genetic architecture of protein cores and surfaces, we designed experiments in which we quantified the stability of tens of thousands of proteins with randomized cores and surfaces, using reduced amino acid alphabets to bias toward stable combinations. For proteins with randomized cores, we also quantified their ability to bind to a ligand through a surface binding interface.

RESULTS
We found that very large numbers of proteins with randomized core or surface sequences are stable. However, we also observed that stable proteins with alternative core sequences quite frequently have impaired binding to a ligand; i.e., they are functionally impaired. We used our data to train energy models to accurately predict the stability and binding of proteins with randomized sequences. These models are simple and interpretable, with mutations having fixed additive energetic effects and a small contribution from energetic interactions between specific pairs of mutations. These energy models successfully identify the combinations of amino acids present in natural proteins that have evolved over more than a billion years, with only rare energetic interactions that we experimentally identify that prevent the transplantation of cores between highly diverged proteins.

CONCLUSION
Our results show that vast numbers of amino acid combinations can replace the core or surface of a small protein and that both the stability and binding of these proteins with randomized sequences can be predicted with simple energy models. These models also identify amino acid combinations present in natural proteins. However, changing the core of a small protein frequently disrupts its ability to bind a ligand, presumably through changes in surface conformation or altered dynamics. Indirect “allosteric” effects of mutations may thus be an important influence on the evolution of protein sequences.


I’ll follow up tonight or tomorrow with an outline of the paper and some of my comments.

1 Like

Haven’t forgotten. Hope to post outline and comments on the Escobedo et al. paper this weekend.

In 2011 I heard a talk at the Evolution meetings by Tobias Sikosek, who was then a postdoc in Toronto, I think in Larry Moran’s department. He was emphasizing that proteins do not just have a single folding structure but can bounce around among alternative minima in the energy landscape for folds. That can help them make a more-gradual transition among folds, by changing the frequency with which they are in particular local minima. Some papers from him then:

Tobias Sikosek and Hue Sun Chan. 2014.
Biophysics of protein evolution and evolutionary protein biophysics
J R Soc Interface* (2014) 11 (100): 20140419
(https://royalsocietypublishing.org/rsif/collection/145/Journal-of-the-Royal-Society-Interface-reviews)
.https://doi.org/10.1098/rsif.2014.0419

Sikosek T, Krobath H, Chan HS (2016) Theoretical Insights into the Biophysics of Protein Bi-stability and Evolutionary Switches. PLoS Comput Biol 12(6): e1004960. Theoretical Insights into the Biophysics of Protein Bi-stability and Evolutionary Switches

2 Likes

Of course! They transition between METAstable structures. That’s obvious to those of us who study proteins from the perspective of interesting/important functions. The more stable structures are far more easy to determine. The troponin complex is a fine example of this.

1 Like

Finally following up on Escobedo et al. first discussed here. Below is an excellent outline of the paper (from Gemini):


This outline summarizes the research article “Genetics, energetics, and allostery in proteins with randomized cores and surfaces,” which investigates the sequence-function landscape of protein hydrophobic cores and surfaces using high-throughput combinatorial mutagenesis and thermodynamic modeling.

I. Introduction and Rationale

  • Problem Statement: Understanding the rules governing stable and functional protein cores is limited by the “combinatorial explosion” of potential sequence combinations, making experimental characterization of more than a few residues difficult.

  • Core vs. Surface Paradigms: While core residues are typically conserved and sensitive to mutation, surface residues evolve faster but are critical for functions like ligand binding.

  • Competing Hypotheses:

  1. Complex Mapping: Dense side-chain packing creates complex genotype-phenotype maps with high-order epistatic interactions.

  2. Simple/Malleable Mapping: Core packing is malleable, allowing many independent or additive energetic contributions to stability.

  • Experimental Objective: Quantify the stability and binding of tens of thousands of proteins with randomized cores and surfaces to define their genetic and energetic architecture.

II. Experimental Design and Scale

  • Model Systems: Three structurally diverse small proteins: FYN-SH3 (human tyrosine-protein kinase), CI-2A (barley protease inhibitor), and CspA (E. coli cold shock protein).

  • Combinatorial Mutagenesis: Seven core residues in each protein were randomized using a reduced hydrophobic alphabet (F, I, L, M, V), creating libraries of 78,125 variants per protein.

  • Phenotypic Assays:

  1. AbundancePCA: A validated cellular assay quantifying folded protein concentrations over three orders of magnitude as a proxy for stability.

  2. BindingPCA: Quantified binding affinity between core-randomized FYN-SH3 variants and a high-affinity peptide ligand.

  • In Vitro Validation: Representative variants were characterized via Circular Dichroism (CD) and NMR to confirm cooperative two-state folding and conserved secondary structure despite altered side-chain packing.

III. Genetic Architecture of Protein Cores

  • Vast Solution Space: Although ~93% of core randomizations significantly reduced abundance, thousands of alternative core sequences remained stable (e.g., >12,000 for FYN-SH3).

  • Mutational Robustness: A surprising number of variants with 5, 6, or even 7 core substitutions maintained high stability, indicating that core packing is highly degenerate and malleable.

  • Simple Energetic Architecture:

  1. Additive Trait Models: Linear models captured ~63% of the variance in stability.

  2. Thermodynamic Models: Using a sigmoidal Boltzmann partition function, second-order models (including pairwise couplings) captured up to 90% of the variance.

  • Sparsity: Most energetic couplings are effectively zero; stability is primarily driven by the additive effects of individual mutations.

IV. Allostery and Functional Constraints

  • Functional Impairment: Many variants with stable alternative cores showed significantly reduced ligand binding.

  • Core-Surface Coupling: Binding surfaces are often allosterically coupled to the core. As the number of core mutations increased, the negative impact on binding affinity outpaced the impact on stability.

  • Predictive Modeling: Simple additive energy models can also predict binding affinity, though including pairwise energetic couplings (epistasis) improves performance, highlighting the role of allosteric interactions.

V. Evolutionary Implications

  • Predicting Natural Evolution: Energy models trained on a single protein (FYN-SH3) successfully identified core combinations found in natural homologs across a billion years of evolution.

  • Core Transplantation: The models identify rare energetic couplings that prevent “transplanting” a core from a distant homolog into the FYN-SH3 scaffold, explaining why some seemingly stable cores fail in different contexts.

  • Allostery as a Constraint: The study suggests that sequence evolution is constrained not just by stability, but by the indirect allosteric effects core mutations have on surface functions.

2 Likes