Language Models Generate Functional Proteins With Completely Novel Sequences

A recurring theme in creationism and especially ID, is that proteins are brittle, that the coding sequence must be very exact to function, and the odds of randomly generating the required order is one in a million gazillion raised a bazillion. This argument, which may be genuinely persuasive to the lay person, has never honestly conveyed the full biochemical understanding. For one, the same folding, binding, releasing, or catalyzing function could be served by variations, or even essentially alternate, proteins. So these recent papers dealing with AI techniques developed for natural language processing applied to generate proteins are a case in point, as they indicate nature does not implement all possible solutions.

Language models generalize beyond natural proteins

Here we demonstrate that language models generalize beyond natural proteins to generate de novo proteins, different in sequenceand structure from natural proteins. We experimentally validate a large number of designs spanning diverse topologies and sequences. We find that although language models are trained only on the sequences of proteins, they are capable of designing protein structure, including structures of artificially engineered de novo proteins that are distinct from those of natural proteins. Given the backbone of ade novo protein structure as a target, the language model generates sequences that are predicted to fold to the specified structure. When the sequence and structure are both free, language models produce designs that span a wide range of fold topologies and secondary structure compositions, creating proteins which overlap the natural sequence distribution as well as extend beyond it. Designs succeed experimentally across the space of sampled proteins, including many designs that are distant in sequence from natural proteins.

AI technology generates original proteins from scratch

Scientists have created an AI system capable of generating artificial enzymes from scratch. In laboratory tests, some of these enzymes worked as well as those found in nature, even when their artificially generated amino acid sequences diverged significantly from any known natural protein.


Interestingly AI protein-language models are now starting to be employed to find homologous relationships between proteins that previously couldn’t be determined to be related using sequence-based alignments alone(where structural information was previously necessary, which usually required crystal structures from already known to be homologous proteins, which could be a problem because obtaining structures from some proteins couldn’t be done):

Recently, pLMs were also leveraged for establishing homologous relationships between sequences. While this is achievable with standard alignment tools [14], whenever the comparison falls into the so-called twilight zone [16], the pairwise signal gets blurry. This is where pLMs shine by capturing relationships way beyond simple sequence comparisons, uncovering otherwise undetected evolutionary relationships that can guide, for example, protein annotation or structure prediction efforts.