Reading through @Josh’s comments, it appears to me that you (@Josh) might be confusing classical information with functional information. That is why your approach gives such an obviously wrong answer when applied to cases where there is a loss of functional information, such as in cancer.
Testing your approach: It should be clear that your K-L approach does not work when it comes to quantifying functional information. By your lights, if I had a flash drive containing the digital information required to build a next generation laptop and I accidentally reformatted the flash drive, such that it contained only randomized gibberish, your calculations would show that we actually gained a large amount of functional information!!! This is just one of three reasons I suspect you are confused between classical information and functional information. I have not yet addressed the other two mistakes you are making. In classical information theory even randomized gibberish is “information”.
There are two issues here: 1) how can we estimate the functional information required to code for a protein family and, 2) how well does the Pfam data represent functional sequence space? I have had discussions with a variety of biologists. Their only concern/criticism is issue (2) … whether the sampling in Pfam is an adequate representation of functional sequence space. None of them have a problem with the method; you are unique. When I initially submitted the paper for review, one of the reviewers was very critical of the paper, explicitly stating that he/she had reservations about applying classical information theory to protein sequences. After a few tweaks and some clarification, the editor sent the revision, with comments, back to that critical reviewer and he/she was completely satisfied. So has every other biologist I have discussed this with.
No. It is statements like the above that really leave me bewildered as to why you insist on misrepresenting me and what I am saying. I can respond to the p53 example either way and have done so (either using section A or section B in my paper), but let me try again.
As you insist, let’s set the ground state back to the naked, lifeless planet earth, assuming that all possible options of amino acids or sequences (whichever you prefer) are equi-probable so that we have the special case where the ground state I describe in my paper is “the null state”.
H (ground) = H (null)
We know that without any functional constraints to produce an active p53 protein, almost any sequence or aa distribution will do. This has become abundantly obvious for every biological protein we have ever looked at; it is a piece of cake to string together aa’s to get non-functional sequences. Not so the other way around. Given the above, we get
H (ground) = H (null) ≈ H (non-functional-cancerous).
So using the method described in section A, the amount of information I would have to hand you, in order for you to build a non-functional p53 protein, would be
H (null) – H (non-functional-cancerous) ≈ 0 bits.
Essentially, we don’t need any information to produce non-functional p53; all we have to do is simply let the system produce non-constrained aa sequences.
I can demonstrate the same outcome using Hazen’s approach, but I focus here only on mine.
This works for any probability distribution you care to invoke, except for the one probability distribution that defines functional p53 sequences or aa distributions.
Your approach, on the other hand, gives us an answer that is right by classical information theory, but clearly wrong when it comes to functional information.
You have agreed that the difference in Shannon uncertainty between two states can be either positive or negative, so the solution to your problem of always getting a positive, increase in functional information, even when functional information is lost, is staring us all in the face.
To put it in words: The reduction in Shannon uncertainty to go from a non-functional ground state to a functional state is equal to the amount of functional information required to reduce that uncertainty. Conversely, the increase in Shannon uncertainty in going from a functional state to a non-functional state is equal to the amount of functional information lost. I did not directly address the topic of functional information loss in my paper, but Eq. 2 in my paper works just fine.
My apologies for not yet addressing this but I would like to stay on current topic at least for this post. I need to answer your question though, as it is also an objection Josh has raised, and other biologists in general, and needs to be addressed. Teaser: I think the Pfam data leans heavily in my favour.