Gauger and Mercer: Bifunctional Proteins and Protein Sequence Space


(Ann Gauger) #27

@Mercer @Art

How does your experiment demonstrate this? It’s possible to have many different sequences encode proteins that fold into nearly identical folds, and have the same catalytic activity. They can differ substantially in sequence. Beta-lactamases for example. Changing one amino acid in myosin is nothing to that. Compared to the whole sea of possible sequences, that particular beta-lactamase activity was very rare, even though there are probably tens of thousands (or more) of sequences that can carry out the function. It’s the proportion between total possible sequences of a given length, and the number of them that perform a particular function that matter. That’s what Doug was measuring.

I’d like to recommend a paper by Doug Axe.
Axe DD (2010) The case against a Darwinian origin of protein folds. BIO-Complexity 2010(1):1-12. doi:10.5048/BIO-C.2010.1

I’ll quote here the section I think is relevant.

The proportion of protein sequences that perform specified functions.
One study focused on the AroQ-type chorismate mutase,
which is formed by the symmetrical association of two identical
93-residue chains [24]. These relatively small chains form a very
simple folded structure (Figure 5A). The other study examined
a 153-residue section of a 263-residue beta-lactamase [25]. That
section forms a compact structural component known as a domain
within the folded structure of the whole beta-lactamase (Figure
5B). Compared to the chorismate mutase, this beta-lactamase domain
has both larger size and a more complex fold structure.
In both studies, large sets of extensively mutated genes were
produced and tested. By placing suitable restrictions on the allowed
mutations and counting the proportion of working genes
that result, it was possible to estimate the expected prevalence of
working sequences for the hypothetical case where those restrictions
are lifted. In that way, prevalence values far too low to be
measured directly were estimated with reasonable confidence.
The results allow the average fraction of sampled amino acid
substitutions that are functionally acceptable at a single amino
acid position to be calculated. By raising this fraction to the power
ℓ, it is possible to estimate the overall fraction of working sequences
expected when ℓ positions are simultaneously substituted
(see reference 25 for details). Applying this approach to the data
from the chorismate mutase and the beta-lactamase experiments
gives a range of values (bracketed by the two cases) for the prevalence
of protein sequences that perform a specified function. The
reported range [25] is one in 10^77 (based on data from the more
complex beta-lactamase fold; ℓ = 153) to one in 10^53 (based on
the data from the simpler chorismate mutase fold, adjusted to the
same length: ℓ = 153). As remarkable as these figures are, particularly
when interpreted as probabilities, they were not without
precedent when reported [21, 22]. Rather, they strengthened an
existing case for thinking that even very simple protein folds can
place very severe constraints on sequence.
Rescaling the figures to reflect a more typical chain length of
300 residues gives a prevalence range of one in 10^151 to one in
10^104. On the one hand, this range confirms the very highly many-to-
one mapping of sequences to functions. The corresponding
range of m values is 10^239 (=20^300/10^151) to 10^286 (=20^300/10104),
meaning that vast numbers of viable sequence possibilities exist
for each protein function. But on the other hand it appears that
these functional sequences are nowhere near as common as they
would have to be in order for the sampling problem to be dismissed.
The shortfall is itself a staggering figure—some 80 to 127
orders of magnitude (comparing the above prevalence range to
the cutoff value of 1 in 5×10^23). So it appears that even when m
is taken into account, protein sequences that perform particular
functions are far too rare to be found by random sampling.

Sorry, you’ll have to look at the original to get the exponents and references right.

When I spoke of promiscuous enzymes I was not referring to myosin, but to the many cases in the literature where promiscuous enzymes have been pushed to favor one substrate over another. Sometimes this can be done by directed evolution, for example.

In your example, you made a single change to the binding pocket that allowed a modified substrate to bind. That’s got nothing to do with promiscuity. I guess I was trying to be general. I agree, my failure to read carefully was in part due to my prejudices, as you call them. Not a good thing, but I have already apologized, and I am certainly not the first scientist to do so.

(Arthur Hunt) #28

Ann, you know my response to Axe’s work. But, just for my own curiosity, how many of Axe’s beta-lactamase variants (the original crippled reference sequence and the various positive and negative variants) did he actually assay for activity? How active were these? What was the range of activities seen? What was the activity cut-off for the plating assay he used?

Besides the flaw in the strategy (this issue awaits a longer response from someone somewhere else in this forum), these issues weigh heavily on the conclusions you want to draw here. And they affect tremendously the numbers you and Axe toss around.

(The Honest Skeptic) #29

Link to Axe paper cited by @Agauger above:

(Arthur Hunt) #30

Another thought experiment, submitted to provoke discussion:

In a random DNA sequence, the “distance” between an ATG triplet and one of the three stop codons will be 21 codons, on average. (It’s a thought experiment, so I will take the liberty and keep this simple. I am full aware that this value is better represented as a range and will be affected by many, many variables, but the number 21 is a convenient ball-park value for this exercise.) In other words, a typical orphan gene arising from a random sequence is going to encode a 20 amino acid polypeptide.

Using Axe’s per-residue estimate for the “probability” of function, it is easy to calculate that (in round numbers), about 1 in 10^10 such peptides will possess enzymatic function. Moreover, if we grant that, say, 1 in 10^10 of all the cells on earth will, in a generation, “yield” a single new orphan (a pretty safe, in fact extremely conservative, estimate) owing to the spontaneous appearance of a transcription start site, then it is apparent that the probability of new function arising in the biosphere is, NOT 1 in 10^10, but essentially 1. (Again, I am keeping things simple. Some of the mathematicians reading this can revise my estimates, but the revisions won’t much affect the bottom line here.).

If we generalize Axe’s estimate to all possible enzyme activities (such as ID proponents are wont to do), then we see that any conceivable new function likely appears on almost a daily basis in the biosphere (that may have as many as 10^30 organisms at any one time).

(S. Joshua Swamidass) #31

@art it would be helpful to observers to work out that math step by step.

(Ann Gauger) #32

I think I understand the first part of your calculation. Because of the way that the genetic code is structured, on average a stop codon will occur once every 20 amino acids or so in a random sequence. That has in fact been the reason for excluding short peptide sequences as functional proteins in the past. It’s been assumed that anything 20 amino acids long is not going to be a true enzyme. Attitudes on this have changed recently, with many peptides having been shown to be functional.

Now for the second part. Axe’s calculation of the per residue probability had to do with conversion between protein sequences that varied in length and their measured probabilities. He wasn’t working with an estimate of short peptides. I don’t think your calculation can be extrapolated as you have.

(Arthur Hunt) #33

Sure. Remember, this is a thought experiment, not a comprehensive theoretical treatise…

Since 3 of the 64 codons are stop codons, a round number for the frequency of a stop codon in a random sequence will be once every 21 codons or so. (3 divided by 64, throw away the remainder since I don’t want to clutter things up…)

For Axe’s per-residue “probability”, he assumes a uniform average probability, and raises it it to the power reflecting the length of the protein. Thus, for his number of 10^-77, this comes out to somewhere in the 0.33 or so range per residue. (Again, this is rough, off the top of my head. Precision here doesn’t change the final point.)

Thus, for a 20-mer (that is the expected size of a newly-occurring random polypeptide), the “probability” or function is 0.33 raised to the 20th power. Again, using round numbers, about 1 in 10^10.

I figure on about 10^30 cells in the biosphere. I forget where I got this from, but includes all bacteria, and the volumes of oceans, etc., etc. I figure that at least 1 in 10^20 of these will see a mutation that will create a new transcriptions start site. I am too lazy to provide a citation for this, but I am pretty sure that, given the known rates of mutation and the relatively uncomplicated nature of promoters, such a mutation will probably occur at least once with every round of replication, even in a bacterial genome. Certainly, using a value which is, in essence, 1 in 10^-10 for such mutations is pretty conservative here. (As has been seen elsewhere in this discussion, a new gene arises basically via “creation” of a promoter.)

Thus, if a function arises once in every 10^10 new proteins, and we have 10^20 new proteins with each passing generation, the probability of getting a new function is essentially 1. (I won’t show my work here - hopefully, readers can at least grasp this.)

Again - THIS IS JUST A THOUGHT EXPERIMENT!!!. I want to provoke some general discussion about the approach that ID proponents take when they talk about function, accessibility, sequence space, and the like. Do not take these rough estimates any more seriously than needed to see where I am trying to lead things (or be led, as the case may be). Please.

(Ann Gauger) #34


I’d like to begin by first saying that Doug‘s work in the 2004 paper is independent of anything he and I did together. Our conclusions did not depend on the previous work. We were asking a different question.
However I will try to explain what I know about how the work was done, based on my own use of the technique for another project.

The lactamase variants that Doug used in his experiment were first a plasmid that had a deletion in the active site of beta-lactamase, and was not capable of carrying out the enzymatic reaction. Call it Delta. Then there was a second plasmid carrying a beta lactamase gene that had been heavily mutated almost to the point of no activity. Call it basal. Third was a plasmid carrying a wild type beta-lactamase gene (call it wt). The main negative control was the host bacterial strain lacking any plasmid at all. This established the level of antibiotic resistance that the strain itself had. This then was compared to cells carrying delta, which had a MIC ( minimal inhibitory concentration) of 5 µg per ml, and to cells with basal, which had a MIC of 10 µg per ml and to cells with wt, with a MIC of 6000 µg per ml. These numbers had to be recalibrated any time a new batch of naive cells were used, because the base line changed. But these are the numbers I observed in the lab using Doug’s plasmids and they are in pretty good agreement with his. Other controls: only freshly poured plates with fresh ampicillin, as its potency changes quickly, and exactly the same pouring technique and drying protocol. Our numbers were reproducible, and the difference between wild type and heavily mutated lactamase activity was about three orders of magnitude.

(Arthur Hunt) #36

Hi Ann,

I expect your last sentence is correct. But the way Axe’s work is cited by ID proponents indicates otherwise. Are there any rules or even guidelines that tell us when Axe’s estimates apply and when they do not? Put another way, why are you comfortable using the per-residue estimate for polypeptides of 90-300 amino acids or more, but not, say, 50 or 20 amino acids? I don’t see why there should be some sort of distinction, given that function can be realized with very small peptide motifs.

(Arthur Hunt) #39

What is the experimentally-determined relationship between the activity of different beta-lactamase variants and MIC? I think this sort of calibration is called for. Because MIC will reflect, not just inherent catalytic properties of the enzyme, but also expression levels and factors that go along with these.

This may seem to be a bit tendentious, but I believe it possible that conclusions drawn from plating assays that are at the very limits of sensitivity stand a fair chance of being incorrect, absent some sort of independent confirmation. If I recall correctly (and I admit I may be wrong), you may have encountered some curious results that reflect the different ways in which plasmid-borne characters may influence cell growth.

(Ann Gauger) #40

Those estimates are based on measures of enzymatic activity. Enzymes are rarely less than 50 amino acids in length. As I understand it, peptides are not enzymes. They act more like binding factors or signal peptides, and do not have a true enzymatic activity.

Doug’s estimate, and those he cites like chorismate mutase, are for true enzymes. The function he speaks of is enzymatic function. Simple binding can occur much more easily and with much shorter sequence. It is enzymatic activity in all its complexity and glory that needs explaining.

People who work on the emergence/evolution of proteins often propose that proteins were much shorter in the beginning, and composed of fewer amino acids, in order to get around the problem of the search through sequence space for function. The trouble is that there is a limited range of things that can be accomplished with those restrictions, and certainly not the range of activities we see now.

There is an interesting article published in ASBMB a few years ago on the origin of proteins. A leading researcher in the field referred to it as “something like close to a miracle.”

(Ann Gauger) #41

To control for the inherent variability, we always used competent cells from the same batch, and plasmid from the same prep, with all controls, and ampicillin from the same batch. That gave us reproducible MIC measurements. MIC is actually a standard measure of antibiotic activity, or resistance. Assays with pure enzyme or plasmid do not duplicate in vivo experience.

To do Doug’s experiment it was necessary to use MIC assays, in order to screen hundreds of millions of variants at once. You can’t do that with enzyme assays. And it’s actually a pretty sensitive assay. Cells carrying the mutant library are plated at a range of concentrations of ampicillin, and their growth is compared to negative and positive controls side by side. In three out of the four libraries, Doug recovered clones that grew at the threshold concentration, but in reduced numbers compared to those plated without ampicillin. The fourth library yielded nothing, and must have been extremely sensitive to substitution. This tells me that the assay is working as it should, because a range of responses were seen.

(Arthur Hunt) #42

Ann, this does not agree with what you and Axe published in your 2015 Biocomplexity paper (Axe DD, Gauger AK (2015) Model and laboratory demonstrations that evolutionary optimization works well only if preceded by invention—Selection itself is not inventive. BIO-Complexity 2015 (2):1–13.). In that paper, you report that a plasmid that expresses a “junk protein” derived from the TEM beta lactamase, one that cannot have any enzyme activity, nonetheless confers a very modest level of resistance to ampicillin. (This protein is, I believe, the protein you call delta above.). Cells with delta showed an MIC of 5-10 micrograms/ml, while cells with no plasmid showed an MIC of 3-5 micrograms/ml. In other words, cells carrying a plasmid incapable of expressing an enzyme with any sort of activity nonetheless yielded a higher MIC towards ampicillin. To me, this sounds like a false positive, in that your assay yields a positive indication of beta-lactamase activity even though no enzyme is present in the cell.

This is why I picked up on this matter. Maybe you can clarify this.

(Arthur Hunt) #43

I believe this distinction is artificial. Any peptide that can bind a transition state can be an enzyme. (After all, enzyme catalysis is, at its heart, about binding and stabilizing the transition state.). This is important if one is talking about the origins of enzyme activity - one doesn’t need huge, rigorously structured polypeptides. Any small domain or peptide (separate or embedded in a larger protein) that acquires such a capability can, in principle, become catalytically active. Since it is well-established that small peptides (as small as 7 amino acids) can form binding sites for ligands, then it is quite reasonable to talk about 20-mers as one would 100-mers.

So, what are we discussing? The origins of enzyme activity? If so, then the relevant measure here is enzyme activity, not MIC.

I realize that we probably won’t come to any sort of agreement about this, but I think it may help readers here to know that there are valid concerns about the ways Axe’s experiments are done. And how his work is interpreted and applied, such as to short vs. long polypeptides.

(Arthur Hunt) #44

Which is why Tawfik gave up, right?

Oh, wait, he hasn’t given up.

I don’t lend any credence to out-of-context snippets like this. I don’t expect many here will either.

(Ann Gauger) #45

A MIC is a minimum inhibitory concentration. Cells without plasmid were killed by a lower dose of ampicillin. That’s what that means. The delta plasmid, as Doug said in his paper, appeared to have some sort of mass effect, namely its presence amplified the rapid breakdown of ampicillin by hydrolysis. Lots of things do this to ampicillin. Air, water, sunlight, non-specific protein. Doug verified that the delta plasmid had no enzymatic activity by mutating a key amino acid in what remained of the active site. No change in function. But this sensitivity is precisely why we used such careful controls. And it should be an argument in favor of Doug’s results, because it shows it doesn’t take much to boost apparent enzyme activity.

(Ann Gauger) #46

This is entirely hypothetical, unless you can provide an example where a peptide does indeed carry out a true enzymatic reaction, not just binding.

MIC is a measure of enzyme activity.

Straining at gnats and swallowing camels, in my opinion, the gnats being the critique and the camels the assumptions about peptides.

(Arthur Hunt) #47

Hi Ann,

At the risk of trying your patience, I am still not following your logic here.

As I am understanding things, differential growth on ampicillin (higher MICs) is supposed to reflect enzyme activity, in Axe’s 2004 paper and in the 2015 paper. However, if a plasmid that cannot encode an active enzyme also yields a higher MIC, then I do not see how this assay can be a very reliable measure of beta-lactamase activity, especially when we are talking about the low levels of activity that are expected in studies such as the 2004 paper.

If you cannot rule out or filter false positives, then the plating assay loses much or most of its utility. Calling false positives some sort of support for some other fanciful scenario doesn’t really help here.

(Arthur Hunt) #48

Like this?

A very indirect measure, and one that does not necessarily reflect actual enzyme activity. Your own work shows this, Ann.

In my opinion, Ann, an assay that cannot differentiate zero activity from low activity, which is precisely what your assay is, is pretty unreliable for the sorts of experiments you and Axe wish to conduct.

Peptides can be enzymes. MIC’s do not measure enzyme activity.

(Ann Gauger) #49


Then how do you account for the higher level of MIC for the basal and wt enzymes, if it’s not a measure of enzyme activity?

But our assay does distinguish between zero and low enzymatic activity. Compare delta, basal and wt. They are distinguishable. Doug’s experiment was conducted at the basal level, which is consistently higher than the delta level. What is responsible for the difference, if not reproducible, reliable enzymatic activity?

I looked at the peptide paper abstract and some of their other work. It seems that decarboxylation of oxaloacetate can be accomplished by a number of means. The double disulfide bond explains the protein’s stability, and it does accelerate the rate, but what I could read did not say how. They engineered three lysines into the peptide, which apparently caused the formation of the disulfide bonds. 18 residues with 4 cysteines and three lysines is an unusual and highly reactive peptide. I’ll bet it works on ampicillin too.:grinning:

I’m reserving judgment on its status as an enzyme. We’ll have to disagree.