Hi Art, thank you for your contribution, that allows me to clarify some important point.
You are right of course, the bitscore of a BLAST search and the value of FI as -log2 of the ratio between target space and search space are not the same thing. But the point is: the first is a very good estimator of the second, provided that some conditions are satisfied.
The idea of using conserved sequence similarity to estimate FI is not mine. I owe it completely to Durston, and probably others have pointed to that concept before. It is, indeed, a direct consequence of some basic ideas of evolutionary theory. I have just developed a simple method to apply that concept to get a quantitative foundation to the design inference in appropriate contexts.
The condition that essentially has to be satisfied is: sequence conservation for long evolutionary periods. I have always tried to emphasize that it is not simply sequence conservation, but that long evolutionary periods are absolutely needed. But sometimes that aspect is not understood well,/ so i am happy that i can emphasize it here.
I will be more clear. A strong sequence similarity between, say, a human protein and the chimp homologue of course is not a good estimator of FI. The reason for that should be clear enough: the split between chimps and humans is very recent. Any sequence configuration that was present in the common ancestor, be it functionally constrained or not, will probably still be there in humans and chimps, well detectable by BLAST, just because there has not been enough evolutionary time after he split for the sequences to diverge because of neutral variation. IOWs, we cannot distinguish between similarity due to functional constraint and passive similarity, if the time after split is too short.
But what if the time after split is 400+ million years, like in the case of the transition to vertebrates, or maybe a couple billion years, like in the case of ATP synthase beta chain in E. coli and humans?
According to what we know about divergence of synonimous sites, I would say that time windows higher than 200 million years begin to be interesting, and probably 400+ million years are more than enough to guarantee that most of all the sequence similarity can be attributed to strong functional constraint. For 2 billion years, I would say that there can be no possible doubt.
So, in this particular case of long conservation, the degree of similarity becomes a good estimator of functional constraint, and therefore of FI. The unit is the same (bits). The meaning is the same, in this special context.
Technically, the bitscore measures the improbability of finding that similarity by chance in the specific protein database we are using. FI measures the improbability of finding that specific sequence by a random walk from some unrelated starting point. If the sequence similarity can be attributed only to functional constraint, because of the long evolutionary separation, then the two measures are strongly connected.
Of course, there are differences and technical problems. We can discuss them, if you want. The general idea is that the BLAST bitscore is a biased estimator, because it always underestimates the true FI.
But that is not the important point, because we are not trying to measure FI with great precision. We just need some reliable approximation and order of magnitude. Why?
Because in the biological world, a lot of objects (in this case, proteins) exhibit FI well beyond the threshold of 500 bits, that can be conventionally be assumed as safe for any physical system to infer design. So, when I get a result of 1250 bits of new FI added to CARD11 at the start of vertebrate evolution, I don’t really need absolute precision. The true FI is certainly much more than that, but who cares? 1250 bits are more than enough to infer design.
To all those who have expressed doubts about the value of long conserved sequence similarity to estimate FI, I would simply ask the following simple question.
Let’s take again the beta chain of ATP synthase. Let’s BLAST again the E. coli sequence against the human sequence. And, for a moment, let’s forget the bitscore, and just look at identities. P06576 vs WP_106631526.
We get 335 identities. 72%. Conserved for, say, a couple billion years.
My simple question is: if we are not measuring FI, what are we measuring here?
IOWs, how can you explain that amazing conserved sequence similarity, if not as an estimate of functional specificity?
Just to know.