Simulating 500 million years of evolution with a language model

Two things I note about this new paper:

First, the tool (a LLM called ESM3) looks fun and interesting. The authors describe it thus:

ESM3 achieves a scalable generative model of the three fundamental properties of proteins, sequence, structure, and function, through language modeling.

They can prompt it with parameters/info far more detailed than amino acid sequence. They claim that it differs from previous models in these ways:

Previous generative modeling efforts for proteins have focused primarily on individual modalities, leveraging complex architectures and training objectives for structure that represent proteins as three-dimensional objects. To date, the only language models that have been scaled are for protein sequences. In ESM3 sequence, structure, and function are represented through alphabets of discrete tokens.

Second, they use ESM3 to make a new/different version of GFP. That’s my interpretation of the protein they generate/discover, because their process (which they call “chain-of-thought”) did start with constraints (parameters, inputs) based on known anchors of function in GFP. (For aficionados, they prompted the model with 6 key residues, plus the structure of a 14-residue section, of GFP.) The new protein is thus (my interpretation here) a GFP variant, albeit one that is very different in primary sequence. They calculate the difference between this new variant and its closest “relative” in the GFP family to be equivalent to 500 million years of evolution (hence the title).

In other words, they didn’t prompt the tool with “make me a protein that fluoresces in X range, and make it smaller than GFP, please and thank you” but it does seem the tool is far more powerful than previous ones, and maybe someday the prompts can be a lot more open ended.

The paper is a prepub version and is not OA but it was preprinted less than 3 weeks ago. I suspect the final publication will be OA but am happy to share the PDF on request. The tool and related toys are here.

4 Likes

Very interesting paper, though the result with a functional GFP variant with only 58% sequence identity isn’t all that surprising even in the context of the stringent requirements for the flourescence function of GFP. Many natural proteins are known that can diverge some times to extreme degrees (below the ~5% sequence identity expected from just random chance) yet remain functional, and have done so over the history of life.

But in any case AI is certainly poised to revolutionize our understanding of protein evolution, especially the ability to test many long-standing conjectures about how both novel proteins evolve from non-coding DNA, and how the first proteins might have evolved. I recently read this:
https://www.biorxiv.org/content/10.1101/2024.11.10.622830v1.full

In silico evolution of globular protein folds from random sequences

Harutyun Sahakyan, Sanasar G. Babajanyan,Yuri I. Wolf, Eugene V. Koonin
doi: https://doi.org/10.1101/2024.11.10.622830

Abstract

The origin and evolution of protein folds are among the most challenging, long-standing problems in biology 1,2. Although many plausible scenarios of early protein evolution leading to fold nucleation have been proposed 3-8, realistic simulation of this process was not feasible because of the lack of efficient approaches for protein structure prediction, a situation that changed with the advent of powerful tools for fast and robust protein structure prediction, such as AlphaFold 9,10 and ESMFold11. We developed a computational approach for protein fold evolution simulator (PFES) with atomistic details that provide insights into the mechanisms of evolution of globular folds from random amino acid sequences. PFES introduces random mutations in a population of protein sequences, evaluates the effect of mutations on protein structure, and selects a new set of proteins for further evolution. Repeating this process iteratively allows tracking the evolutionary trajectory of a changing protein fold that evolves under selective pressure for protein fold stability, interaction with other proteins, or other features shaping the fitness landscape. We employed PFES to show how globular protein folds could evolve from random amino acid sequences as monomers or in complexes with other proteins. The simulations reproduce the evolution of many simple folds of natural proteins as well as the evolution of distinct folds not known to exist in nature. We show that evolution of a stable fold from random sequences, on average takes 3 to 8 amino acid replacements per site, suggesting that simple but stable protein folds can evolve relatively easily. These findings could shed light on the enigma of the rapid evolution of protein fold diversity at the earliest stages of life evolution. PFES tracks the complete evolutionary history from simulations that describes intermediate states at the sequence and structure levels and can be used to test versatile hypotheses on protein fold evolution.

3 Likes

How cool, thanks!

1 Like

That’s some very cool math.

WARNING: OLD GUY’S MEMORIES FOLLOW

All that protein folding brings back memories of working at the computer center of a state university back in the late 1970’s. I used to go into my office in the middle of the night because the response time on the CDC 6600 TELEX system was much better and I ran some of my most resource intensive programs around 3am. While waiting for a VERSATEC plotter output, I found myself standing there watching a long series of protein-folding diagrams using up much of the roll of special thermal paper. That stimulated my interest in the topic and I got a chance to meet the biology professor behind the folding simulation diagrams.

4 Likes

It’s amazing to think how much time has been spent working on the protein folding problem. And it’s still not really fully solved, we have just found short-cut solutions that seem able to give decently accurate guesses if it has sufficiently good training data.

4 Likes

This topic brings to mind my first exposures to Intelligent Design arguments, some of the lamest of which were appeals to Levinthal’s paradox. (I also recall some Young Earth Creationists diatribes trying to exploit the protein folding topic. As always, they had virtually no grasp of the math and concepts like Rubisco-binding protein and other chaperone proteins.)

Considering my theological positions regarding God and the universe—both then and now—I was somewhat surprised by my almost immediate negative response to I.D. when it first became a topic of discussion among Christians. (Honestly, I really wanted to like it but didn’t. I couldn’t if I considered the evidence or lack thereof. I continue to consider I.D. poorly constructed philosophy which happens to involve scientific topics but doesn’t engage/employ valid scientific methodologies.) I was struck by the logic fallacies, the lazy preaching to the choir (i.e., easy propaganda), and what I would call an amateurish grasp of the topics. Decades later I have essentially the same impressions of I.D. arguments even though I have many friends and former colleagues of the I.D. camp.

Oops, I guess I wandered a bit off on a tangent. My main point is that the topic of protein-folding helped me identify early on that I was not impressed by I.D. arguments.

3 Likes

I also would guess that this is a case of historical contingency. The program took a very different path and ended up in a very different spot in the vast sequences space, which yielded the same function nonetheless.

Additionally, it also possible that the evolution of GFP in nature was very different. Perhaps the natural GFP evolved from a non-fluorescent precursor first, with a very different function. If this is the case then the structure of the ancestral GFP was constrained to serve a different task. In contrast, the program that was tasked to find the GFP function probably wasn’t constrained by a specific starting point (correct me if I am wrong on this).

As it turns out, I came across a paper which points out that the GFP domain is similar to a domain of Nidogens, which - quote - “are neither colored nor fluorescent, but instead they serve as protein-binding modules that participate in control of the extracellular matrix formation during development”. Although, as far as I can tell, it doesn’t state whether this was the ancestral function or not.

2 Likes

The program was definitely constrained (see the OP) but not by a “specific starting point.”

Right, but they basically started with the GFP “active site”. So it is very much contingent on the starting point.

In an effort to generate new GFP sequences, we directly prompt the base pretrained 7B parameter ESM3 to generate a 229 residue protein conditioned on the positions Thr62,
Thr65, Tyr66, Gly67, Arg96, Glu222, which are critical residues for forming and catalyzing the chromophore reac-tion (Fig. 4A). We additionally condition on the structure of residues 58 through 71 from the experimental structure in 1QY3, which are known to be structurally important for the energetic favorability of chromophore formation (53). Specifically, sequence tokens, structure tokens, and atomic coordinates of the backbone are provided at the input, and generation begins from a nearly completely masked array of tokens corresponding to 229 residues, except for the token positions used for conditioning.

Still impressive to me that this kind of simulation is possible and results in functional protein sequences. They even synthesize the generated proteins and test them for function in E coli.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.