The 400 million year number is pulled right out of thin air. Why 400 million years? Why not 45 mya, or 200 mya, or 900 mya? What is the relationship between time and the degree to which we can confidently estimate FI? How does this relationship take purifying selection into account? What distinguishes diversity absent because it just has lower fitness, vs diversity absent because it’s truly nonfunctional?
So all the members of some species originally has a functional protein with a specific sequence. Say a decent protein of roughly 150 amino acids. What happens subsequently (whether it originated by design or evolution) after the origin of this protein over time is this sequence mutates in different individuals as they reproduce.
Over some generations the area immediately surrounding the protein in sequence space will be sampled by point mutations and some insertions and deletions ocurring in different members of the population. Some of these mutations are discarded by natural selection, they are either lethal or deleterious compared to the original sequence. Others are allowed because they’re either nearly neutral, or perhaps even beneficial. Suppose 100 million such sequences, nearby in sequence space, are sampled over 10 million years.
More generations pass and the same thing happens. Perhaps one mutation has become fixed over this 10 million year period. In the vicinity of this new sequence with a fixed mutation, more sampling into the immediate vicinity in sequence space is sampled. Some mutations are discarded, others are allowed. Another 100 million sequences are sampled over the next 10 million years. You get the picture.
Okay, so fast forward 400 million years, we have 40 times 100 million sequences have been sampled. That’s 4 billion sequences. Let’s suppose there are twenty million such populations with this protein, having evolved in parallel, all sampling into the neighborhood of this protein. So we have 20^6 times 40^9 sequences sampled over those 400 million years. Realistically there will have been a lot of parallel sampling of the same space, it’s not all going to be unique sequences never before tested, but let’s just be extremely generous and pretend it is. So that’s 8x10^15 unique sequences sampled.
How big is the total sequence space for a 150 aa protein? ~1.4 x 10^195
That still leaves \sim1.3999...9992\times10^{195} sequences yet to be sampled. And we are to believe we can conclude with good confidence that no other sequence with the function exists in that space. Because we have sampled approximately 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000561% of it.
No no, Gpuccio says we can do it in 400 million years, and @Giltil haven’t seen any good reasons to think otherwise. Sampling \sim5.6\times10^{-178}% of sequence space is enough.
Now let’s imagine I’m off by a factor of one trillion. That in fact there’s been one trillion times more sequences sampled. That’s \sim5.6\times10^{-166}% of sequence space sampled. And that’s all we need to know there’s no more functional sequences in that space. Gpuccio says so, and it makes sense to @Giltil. He just can’t see any good rebuttals to that.
That’s like giving @Giltil one of the entire observable universe’s atoms, it happens to be a hydrogen atom, and now Gilbert is confident that there are no iron atoms in the universe. He’s seen enough atoms to know. Except, it’s much worse.
Edit: Thanks to Roy I have corrected an exponent error above.