Origin of Proteins

I’am tired now. I hope I can answer you tomorrow.

When did I make this claim?

I actually agree, but I would make the same claim for a 400 million year old protein. All we get is an artifact of the diversity of the clade in which it happened to originate, the strength of purifying selection operating on it over this period of time, and the rate of molecular evolution in the lineages that make up that clade.
As such, we never really get an actual estimate of it’s FI, because we will never be able to determine whether there are strictly more functional sequences out there that are selected against because they have lower fitness, and the clade in which it evolved has limited diversity.

That’s been my whole damn criticism from the beginning. I am glad you’re starting to see, even if only for this limited example, why this FI nonsense is useless.

But 400 millions years latter, it would then become possible.

Why? Why 400 million years? How do you tell the difference between the number of different sequences that are possible, and the number of sequnces that evolved but were constrained by purifying selection? How do you know variants you’re not seeing aren’t actually functional, but just have lower fitness and hence why they don’t appear in the population(because they were selected against)?

How do you know that there isn’t a similarly functional sequence much much further away in sequence space?

This how gpuccio’s method works. And it makes perfect sense.

No, it doesn’t. Because you’d STILL not be anywhere near having sampled the possible diversity out there. You seem to be saying something oddly fantastic, that 400 million years of evolution has the capacity to thoroughly sample close to all of sequence space for a long sequence. How do you figure that?

But isn’t your claim, essentially, that sequence space is so vast evolution never could sample even a tiny fraction of it for large sequences? You’re trying to have your cake and eat it too.

The standard of falsification could use a little improvement :slight_smile:

Just to be clear, measuring changes in living populations in the lab is not a valid way of testing the claim that evolution can’t produce proteins with 500+ bits of FI. Is this correct?

You’ve just made one yourself. What we are seeing is a contraint given to us by time, and clade diversity. And sequence space is so much more unfathomably vast, particularly for a long protein sequence, than all of evolutionary history have had the capacity to sample.

And yet you want to claim this almost infinitesimal sampling done is enough for us to conclude there’s probably very little more functional diversity out there.

It depends on the conditions in the lab. If you were to simulate a program like Weasel and keep the mutation rate low enough you could generate 500 bits of FI but you need the original FI in the experiment to do this.

If you are looking at a protein with a 600 million year evolutionary history, how do you factor in the “original FI”?

Good question. Will think about it. I am out for the rest of the day.


The thief and 100 safes example was just plain wrong. Gpuccio wasn’t doing the math right, nor does it accurately represent how proteins and function actually work. Much was written about it; here, here, here.
How the math should be done is here.

No worries. I didn’t get home home from the airport last night until 2 AM, and was up before 7 AM. Not sure how I am still able to function. :crazy_face:

1 Like

What is the answer, Bill?

Please explain WHY it makes perfect sense a calculation with no time variable would return a completely different value depending on when the calculation was done instead of just asserting it. Bill sure can’t.

1 Like

The number of trials required required divided by the mutations per unit time you are measuring.

Example: 10 trials required one mutation per year takes 10 years. 10 trials two mutations per year takes 5 years.

What constitutes a “trial”? You keep forgetting to say.

That answer is so wrong it’s not even wrong.

Here is the answer:

10^12 new cells. 0.003 mutations per cell.

0.003 x 10^12 = 3 x 10^9 new mutations in a single generation.


I suspect the Birthday Paradox may be in play.

How many people does it take for there to be a 50/50 chance that two of them will have the same birthday? How many people does it take for there to be a 99.9% chance that two of them have the same birthday? Assuming a normal distribution of birthdays, the answers are 23 and 75, respectively.

These probabilities are counterintuitive, but they are correct. The mistake you seem to be making is waiting for a match, and then calculating the probability of getting that match. If you did this, you would probably say there needs to be 365/2 people to have a 50/50 match, but this is wrong. If you try to understand the Birthday Paradox, you may have a better idea of where we are coming from.

1 Like

The 400 million year number is pulled right out of thin air. Why 400 million years? Why not 45 mya, or 200 mya, or 900 mya? What is the relationship between time and the degree to which we can confidently estimate FI? How does this relationship take purifying selection into account? What distinguishes diversity absent because it just has lower fitness, vs diversity absent because it’s truly nonfunctional?

So all the members of some species originally has a functional protein with a specific sequence. Say a decent protein of roughly 150 amino acids. What happens subsequently (whether it originated by design or evolution) after the origin of this protein over time is this sequence mutates in different individuals as they reproduce.

Over some generations the area immediately surrounding the protein in sequence space will be sampled by point mutations and some insertions and deletions ocurring in different members of the population. Some of these mutations are discarded by natural selection, they are either lethal or deleterious compared to the original sequence. Others are allowed because they’re either nearly neutral, or perhaps even beneficial. Suppose 100 million such sequences, nearby in sequence space, are sampled over 10 million years.

More generations pass and the same thing happens. Perhaps one mutation has become fixed over this 10 million year period. In the vicinity of this new sequence with a fixed mutation, more sampling into the immediate vicinity in sequence space is sampled. Some mutations are discarded, others are allowed. Another 100 million sequences are sampled over the next 10 million years. You get the picture.

Okay, so fast forward 400 million years, we have 40 times 100 million sequences have been sampled. That’s 4 billion sequences. Let’s suppose there are twenty million such populations with this protein, having evolved in parallel, all sampling into the neighborhood of this protein. So we have 20^6 times 40^9 sequences sampled over those 400 million years. Realistically there will have been a lot of parallel sampling of the same space, it’s not all going to be unique sequences never before tested, but let’s just be extremely generous and pretend it is. So that’s 8x10^15 unique sequences sampled.

How big is the total sequence space for a 150 aa protein? ~1.4 x 10^195

That still leaves \sim1.3999...9992\times10^{195} sequences yet to be sampled. And we are to believe we can conclude with good confidence that no other sequence with the function exists in that space. Because we have sampled approximately 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000561% of it.

No no, Gpuccio says we can do it in 400 million years, and @Giltil haven’t seen any good reasons to think otherwise. Sampling \sim5.6\times10^{-178}% of sequence space is enough.

Now let’s imagine I’m off by a factor of one trillion. That in fact there’s been one trillion times more sequences sampled. That’s \sim5.6\times10^{-166}% of sequence space sampled. And that’s all we need to know there’s no more functional sequences in that space. Gpuccio says so, and it makes sense to @Giltil. He just can’t see any good rebuttals to that.

That’s like giving @Giltil one of the entire observable universe’s atoms, it happens to be a hydrogen atom, and now Gilbert is confident that there are no iron atoms in the universe. He’s seen enough atoms to know. Except, it’s much worse.

Edit: Thanks to Roy I have corrected an exponent error above.


If that post doesn’t make it clear what the problem with Gpuccio’s method is to our resident IDcreationists, then I dare say this conversation has become pointless.


Is this even possibly an informative conversation? Is anyone finding value in it?