That may be what Gpuccio intends(as in he aims at that in his head), but not how the method actually works. It’s based on trying to extrapolate functional variation from homologous sequences, hence it’s not based *really* on function but on sequence similarity. That’s why.

So if there was a very dissimilar sequence that could perform the function, it would not be detected as homologous, Gpuccio would not be able to find it in some database when using a similarity based-search, it would not be classified as belonging to the clade of sequences with a similar name (ubiquitin, alpha actin, ATP synthase subunit beta, or w/e), and so Gpuccio would miss it in his calculation.

Hence, his method is not based *on functions*, but on sequence similarity and annotation classifications. Which means he’s only ever going to be looking within particular sequence-based families judged to be homologous by annotators.

You contest it because it doesn’t take into account the possibility that other dissimilar solutions may exist in the sequence space. Although this is true, it doesn’t affect the soundness of his analysis.

Of course it does. We can show that with a simple hypothetical example. Keefe & Szostak evolved multiple different ATP binding proteins from a library of about about 10^12 different random sequences 80 amino acids in length. The total sequence space of which is about 1.2×10^104. They find that the function ATP binding exists at a frequency of about 10^-11. (Btw that experiment has been repeated by another lab and they found basically the same thing).

Now let’s suppose we use Gpuccio’s method to try to derive the FI for the function from one of those sequences the Szostak lab found, having diverged over 400 million years. We find that over that time period, it has diversified quite a lot, so much so that every position in the sequence has at least 3 other amino acids in some variant. We calculate from that, that there are 4^80 = ~1.5×10^48 possible sequences that can implement the ATP binding function in the sequence space for proteins 80 amino acids long.

So now we try to derive the fraction of sequence space 80 amino acids long that can bind ATP:

(1.5×10^48)/(1.2×10^104) = 1.25×10^-56

But the real frequency of the function is in the 10^-11 to 10^-12 range as revealed in multiple empirical experiments. Yet basing our calculation on homologous sequences, even if an *enormous* amount of variation has been generated such that every single position is known to have 3 possible alternative amino acids(for a total of 4), we still end up with **a 45 order of magnitude underestimation**.

FI is useless for establishing the true fraction of sequence space able to perform some function of interest, because it isn’t physically possible for evolution, even over the entire duration of life’s existence on Earth, to generate all the variation that would be needed for us to be able to extrapolate the true number for FI. All of which I explained here.

To see this, let’s take again the ATP synthase example. Gpuccio has calculated that it has a FI of 1297 bits.

Let’s be clear that you’re talking about the beta-subunit of ATP synthase, you can’t really calculate the FI for the whole machine as it is made of multiple independent proteins, and it would be meaningless to even attempt do so since it evolved from protein subunits that had other functions on their own. And the beta-subunit is homologous to the alpha subunit in the catalytic hexamer, which is basically just the same protein repeated six times into a hexagonal oligomer.

Now imagine that 1000 other dissimilar solutions exist in the sequence space that can perform the same function. In that case, the FI associated with the function of ATP synthase would be reduced by 10 bits.

If 1 billion other dissimilar solutions existed, the reduction would be of 30 bits.

If 10^100 other dissimilar solutions existed, the reduction would be of 332 bits.

If 10^250 other dissimilar solutions existed, the reduction would be of 830 bits, leading to a FI of 467 bits, which is just below the threshold that warrants a design inference.

So you see, in order to dismiss the design inference for ATP synthase, you have to imagine that there exist about 10^250 dissimilar solutions in the sequence space!!! Given that no evidence whatsoever exist that a single alternative dissimilar solution exist for implementing the function of ATP synthase, your case is, say, week, to say the least.

The problem with all this is twofold. First of all as just stated, even supposing there were that many different functional proteins for the molecule, they couldn’t physically exist or be generated by the evolutionary process, so any estimation of FI based on similarity would be unable to correctly estimate it since it’s merely an extrapolation based on extant known variation. Thanks for making that point for me.

Second: ATP synthase subunit beta evolved from simpler precursors able to perform similar functions(which, ironically, is related to the ATP binding function evolved by Keefe & Szostak 2001). We’ve been over this.

That would also make FI useless for giving any hints about whether some protein function is evolvable, because even if that function now is very rare in sequence space, it is possible it can evolve incrementally from a simpler precusor that is much more frequent in that space. As all evidence shows is the case for the relationship between the extant and ancestral function of ATP synthase subunit beta. These matters are even further complicated by the fact that proteins can have multiple functions, so one function that is highly abundant in sequence space, can give rise to another that is very rare but happens to overlap in some cluster of sequences.

For these reasons you simply can’t establish the relationship between FI and sequence space based on homologous sequences generated over evolutionary history, and even if you could do that, you can’t derive from that relationship that protein X could not evolve because you’re still only considering a sort of *de novo* evolution where the function has to emerge as-is, instead of deriving from some a simpler and more frequent function, or an entirely different one.