I’m not aware of any protein where this imaginary scenario is true. Short of very specific active site residues exactly where a protein binds a substrate *(a cysteine residue coordinating a metal cofactor, for example), most residues exhibit intramolecular epistasis. Meaning for a constrained residue which if changed, would negatively affect the functional performance of the protein, there is some mutation elsewhere in the protein that can open up for the constrained residues to change neutrally. So they’re very rarely “absolutely required” for function. Another factor to consider is that other residues in place of those “absolutely required” ones are often times just significantly deleterious. They still work, the protein still functions, strictly speaking. But it’s functional performance is reduced such that carriers of this protein have significantly lower fitness. Such mutations are unlikely to reach fixation if they occur, hence we generally don’t see them at high frequencies in any populations. But if we were to correctly estimate the FI for some particular protein’s function, we’d need to include those unknown sequences and we’d need to known how many of them there are in order to plug in the correct number of sequences that meet the minimal threshold for function.
Now, I’m just pointing this out to correct what appears to be a misapprehension of yours on how mutations constrain protein function. This is actually not all that important to Gpuccio’s claim that mutations at synonymous sites reach saturation at 400 million years(though it does have some effect on what we should take “constrained” to mean).
How this protein would have looked like say 10 millions years later in the different vertebrate species of that time. Obviously, the 70% required for its function would have remained unchanged. As for the 30 % neutral part, only a small fraction of the positions would have changed through the action of neutral drift. IOW, 10 millions years later, the homology signal of the neutral part of the protein would have remained very strong.
Now, what about this same protein 100 millions years later in the different vertebrate species of the time? Well, whereas the constrained part (70%) would have remained unchanged, a large fraction of the neutral positions would have changed. However, a sufficient number of neutral positions would still have remained unaffected by the effect of drift, so that a homology signal for the neutral part of the protein would still be detectable.
But 400 millions years later, whereas the constrained part would still be there unchanged, the homology signal of the neutral part would no longer be detectable.
The bottom line here is that the homology signal corresponding to the neutral part of a protein decreases with time until a point P where it becomes undetectable, whereas the homology signal corresponding to the constrained part of the protein remains constant with time. This is why it is only when P is reach that one can safely conclude that all the homology signal observed for a given protein belong necessarily to the constrained, functional part of it. IOW, far from being useless, P is a crucial element of the reasoning for determining FI.
I’m sorry but I really don’t see how this is relevant to estimating FI. You are speaking about detecting homology, not calculating FI. FI is calculated for FUNCTIONS. It doesn’t matter whether the sequences are homologous or not. If two completely unrelated proteins can do the same function, they both need to be counted among the minimum number of sequences that meet the minimal threshold for function.
Also, you are making one long argument for why evolution, starting from a functional sequence, is limited in the degree to which it can explore sequence space for the function. In other words, you’re really just stating reasons for why we can’t make any claims about whether there are other functional, but evolutionarily unrelated sequences out there in that space. Which is what I have been saying all along. That taking some functional protein sequence, evolution will be exploring the space immediately surrounding it with point mutations, perhaps with a few jumps if we include larger deletions or insertions, but that this exploring is always constrained to the immediate surroundings of some already functional sequence. So we simply can’t claim no other functions are out there, so we can’t say we really know the FI for any protein’s function.
Suppose you have two groups of apparently unrelated proteins of equal length, both of which can do the same job. Group 1 proteins can do the job, and they’re all homologous to each other, but they’re not homologous to group 2. But group 2 can also do the job, but they’re only homologous to each other, not to group 1. They are able to perform the same function, despite having no detectable homology. What’s the FI? Well technically both those proteins are sequences in the same sequence space, and they can both do the same function. So they’re both groups of proteins that meet the minimal threshold for function, yet with no detectable homology(you wouldn’t find group 1 proteins when doing BLAST searches for group 2, or the other way around).
Both of those groups of sequences would occupy some area of sequence space but isolated from each other, but even when the total number of all the sequences were added together, for even a relatively short protein sequence they would only occupy an infinitesimal fraction of that sequence space. Restricting ourselves to only looking at homologous sequences actually risks us underestimating how many sequences are out there that meet the minimal threshold.
So why is homology important for calculating FI again? It obviously can’t be.
Edit:spelling