Simulating 500 million years of evolution with a language model

Two things I note about this new paper:

First, the tool (a LLM called ESM3) looks fun and interesting. The authors describe it thus:

ESM3 achieves a scalable generative model of the three fundamental properties of proteins, sequence, structure, and function, through language modeling.

They can prompt it with parameters/info far more detailed than amino acid sequence. They claim that it differs from previous models in these ways:

Previous generative modeling efforts for proteins have focused primarily on individual modalities, leveraging complex architectures and training objectives for structure that represent proteins as three-dimensional objects. To date, the only language models that have been scaled are for protein sequences. In ESM3 sequence, structure, and function are represented through alphabets of discrete tokens.

Second, they use ESM3 to make a new/different version of GFP. That’s my interpretation of the protein they generate/discover, because their process (which they call “chain-of-thought”) did start with constraints (parameters, inputs) based on known anchors of function in GFP. (For aficionados, they prompted the model with 6 key residues, plus the structure of a 14-residue section, of GFP.) The new protein is thus (my interpretation here) a GFP variant, albeit one that is very different in primary sequence. They calculate the difference between this new variant and its closest “relative” in the GFP family to be equivalent to 500 million years of evolution (hence the title).

In other words, they didn’t prompt the tool with “make me a protein that fluoresces in X range, and make it smaller than GFP, please and thank you” but it does seem the tool is far more powerful than previous ones, and maybe someday the prompts can be a lot more open ended.

The paper is a prepub version and is not OA but it was preprinted less than 3 weeks ago. I suspect the final publication will be OA but am happy to share the PDF on request. The tool and related toys are here.

2 Likes