A Ubiquitin Response to Gpuccio

Ah yes! Thanks for reminding me, Bill. Didn’t TSZ have a thread discussing an ENV article by Dr Durston. I recall he joined the conversation at TSZ at some point.

As far as I can parse this, I’d agree. “Needles and haystacks” is a bad analogy.

If you are a population of organisms, you are not searching sequence space. You are tripping over new sequences as you stumble over them. You can pick one up and see if it is an improvement on the one you already have or discard it. Variation and selection, put simply.

We can make statistical estimates and that is all that is necessary here to make a design inference. It appears that functional information is a subset of mutual information.

Simply because you have more binding sites so if using Szostak estimate of 10^11 odds of binding ATP the odds of a single protein binding two small molecules would be 10^22. Simple the probability of binding A and B.

This statement is true but not relevant.

A correction or point of clarification is called for here, I believe. The “odds” cited here are the fractional population of functional sequences (in this case, ATP-binding polypeptides) in sequence space. They are not really “odds” or probabilities. To arrive at these, one needs to factor in the total population of polypeptides. Thus, for example, the “odds” of finding one such ATP-binding polypeptide in 10,000 liters of a 1 femptomolar solution of random polypeptides are 1, not 10^-11. The “odds” of finding ten different ATP-binding polypeptides in such a solution would also be 1, not 10^-110 (if I am getting Bill’s math correct here).

Putting this in a different context, it is likely that a biosphere with 10^30 bacteria would generate, with each generation of bacterial growth, 10^20 or more new polypeptides through the usual processes of mutation, recombination, rearrangement, and the like. In other words, in the Earth’s biosphere today, the “odds” of generating new ATP-binding motifs such as described by Szostak are essentially 1. Most definitely not 1 in 10^11. The same goes for most any function that occurs in 1 in 10^11 to 1 in 10^20 or so different polypeptides.

In case my last point is not clear, I am saying that it is very likely that most any new function arises by chance every day in the biosphere. (In case anyone is wondering, the fates of all of these are governed by well-established rules of population genetics - drift and natural selection.)

1 Like

Here is a post by gpuccio at UD referencing comparison with Durston’s methods.

George Castillo:

Here are your answers:

  1. The protein families listed in Durston’s paper are 35. To many of them my method cannot be applied because they are not proteins present in the human proteome. I use human proteins as probes to measure functional complexity, therefore I can only do that with proteins present in the human proteome. Moreover, my database is restricted only to human verified proteins in Uniprot, that is the about 20000 reliable reference sequences identified in humans. So, proteins like Vif (Virion infectivity factor), Viral helicase1, Bac luciferase, SecY, DctM and many others in the list have no clear homologues in the human proteome.

  2. The relationship with length is very strong in my data as in Durston’s, as expected. I am adding two scatterplots for deuterostomia – not vertebrates and for cartilaginous fish, the two groups that are important for the computation of the jump in vertebrates, at the end of the OP. Consider that my values are given for about 20000 proteins.

  3. My evolutionary history plots represent the individual value for individual proteins. So, I do not understand what error bars you are referring to. If you want the distribution of the reference values for organism groups, I can give you the standard deviation values, even if I don’t understand what is their utility in this context. However, here they are:

Cnidaria: mean 0.5432765 baa; sd 0.4024939 baa

Cephalopoda: mean 0.5302676 baa; 0.3949502 baa

Deuterostomia (not vertebrates): mean 0.6705278 baa; sd 0.4280898 baa

Cartilaginous fish: mean 0.9491001; sd 0.5180335 baa

Bony fish: mean 1.06373 baa; sd 0.4992876 baa

Amphibians: mean 1.106878 baa; sd 0.509575 baa

Crocodiles: mean 1.2175 baa; sd 0.5166932 baa

Marsupialia: mean 1.354032 baa; sd 0.5016414 baa

Afrotheria: mean 1.628872 baa; sd 0.43412 baa

However, as explained, these are just standard deviations of the values for the whole human proteome as compared to each group of organisms. In no way are they “error bars”. Moreover, as you can certainly understand from the values of the standard deviations, the distributions here are certainly not normal.

When comparing values for different groups of proteins, indeed, I always use non parametric methods, such as Wilcoxon test for independent groups. For examples, I have identified a group of 144 human proteins which are involved, according to Go functions, in neuronal differentiation. You may wonder if the jump from prevertebrate to vertebrate human conserved information is significantly higher in this group, as compared to all other human proteins.

And it is. The median value in the neuronal differentiation group is 0.4534413 baa, as compared to the median value of 0.2629808 baa in the rest of human proteins. The difference is highly significant. p value, as computed by the Wilcoxon test, is 1.202e-12. I am adding the boxplot for that comparison at the end of the OP.

This is just an example of how a correct analysis can be done using my values as applied to different protein groups.

  1. Not sure what you mean. I have already given the plots by size at point 2.

While I agree this is a valid point you make in your paper for bacteria it does not address new function appearing in animals with much smaller populations.


I don’t see why ATP binding would apply across the board. You have millions of B-cell lines that each have their own randomly generated antibody, and your immune system doesn’t have any trouble finding antibodies that bind to a whole host of different pathogens and toxins. That is just with a few million proteins.

One method is very clear in the Durston paper you cited and I posted a discussion gpuccio had regarding his differences with Durston’s methods.

I am not claiming anything across the board. I was showing how proteins that bind multiple proteins are rarer in sequence space then proteins that bind a single molecule. Multi protein binding is rampant in the cell nucleus where many of gpuccio’s analyzed proteins come from.


Solid argument :kissing_heart:

Well, I do my best. You, on the other hand, don’t appear to have an argument, a position, any evidence, anything to offer. It becomes, well, fruitless to engage. What’s the point, Bill?

1 Like

I have made the argument based on data and simple statistical application. Szostaks paper showed binding a single ATP for a 70AA protein. Now what if the protein has to bind ATP and another molecule. I estimate this is 10^22 based on Szostak’s numbers. If you read through the discussion I already made this argument. Again Alan you are making false claims. I don’t think you are dishonest as I have known you for many years.

I think you are responding without reading and following the post.

Durston’s method does not even attempt to find all combinations of amino acids that would produce a specific function, so that can’t be used. If gpuccio is using a different method that actually determines function then now would be a good time to discuss it. I think the entire scientific community would be very interested in this method.

Again, antibodies are a perfect example. For a single surface protein on a bacteria you can have multiple antibodies that bind to different portions of that bacterial protein. It is rather easy to get a protein that binds to multiple other proteins. With antibodies, we are talking about millions of proteins, in the range of 1x10^7.

Hi Bill, you keep making the same mistake. 10^-11 is not the probability, and one certainly would not multiply this as you do to estimate the likelihood of getting a bifunctional protein.

To illustrate, starting again with a scenario (the current biosphere) in which 10^20 new proteins arise each day. The numbers of ATP-biinding proteins in this collection would be about 10^9, and the probability that any of these would be bifunctional would be calculated from the ratio obtained by dividing 10^9 by 10^11 (NOT 10^11 squared). Thus, for example, in a year, the probability of getting a bifunctional protein would be 1, not some unseemingly low number.

I am curious, why do you cling to a scenario wherein one and only one protein comes about? This makes no biological or chemical sense, and leads to lots of irrelevant calculations.


To some cases his wider group of proteins may point to function better then gpuccio’s as gpuccio is at the sub function level. In reality real cellular function is quite complex and involves many proteins. An illustration of this is studying the cell cycle which is part to the cell division process. Durston is measuring bits per AA as is qpuccio so both are very relevant to the discussion.

I believe the antibodies are just a niche discussion and poorly represents other nuclear proteins as they have limited substrate size bind a single substrate and use hyper mutation.

This seems to be a distraction from the central point: there is no current method for finding all combinations of amino acids that will produce a specific function.

I would be interested in any science you can find to back those beliefs. It is also interesting that hyper mutation can produce protein binding.

This is not a valid argument as both are using statistical estimates which is a valid scientific method. If you want to argue your point then attack the accuracy of their measurement but not with a “it could be anything” argument.

Why do you believe I am wrong? I agree hypermutation is an interesting application but it is also very different then other nuclear proteins.

BTW: On my way to LA to see my kids and grandson. Will be off line the rest of the day.

Those are not valid estimates of all possible combinations of amino acids that will produce a specific function. Those are just estimates of the amino acids vital for function when you start with a given protein. It doesn’t tell you how many different sequences you can start with.

Because you keep deflecting.

It estimates the substitutability of AA’s in the proteins as their current functional role. Why do you think there are more sequences that can do the job? Remember that these proteins are interdependent with other proteins.

I am avoiding this because I have had a long conversation w Dennis V at Bio Logos and have concluded it tells us almost nothing about protein origins.

There is no reason to think that there aren’t.

I don’t see how that changes anything. Let’s look at the following scenario.

Our arbitrary model protein is called ENZ. This protein cleaves a small molecule called BIO, and this activity is beneficial to the cell which is why the sequence is conserved over time. Along comes a protein that binds to ENZ called BIND1. When BIND1 binds to ENZ it causes ENZ to attache the products from the cleavage of BIO to another molecule, and this reaction is beneficial to the cell. Since it is beneficial, the new binding site that evolved in BIND1 is kept, as is the binding site in ENZ. Along comes BIND2 which also binds to a portion of ENZ and produces a beneficial modification of how ENZ works. Again, the sequence for binding ENZ is conserved in BIND2, and the binding site is now conserved in ENZ.

You seem to think that it requires evolution within ENZ to bind another protein, but this is backwards. You are ignoring the evolution of binding sites in other proteins, and how interactions between these evolving proteins can produce beneficial interactions which will lead to the conservation of sequence in both proteins.

1 Like