Post 15: The original, very simple example where we considered only 1 site (not the entire protein sequence) seems to have lead to some unnecessary confusion. We will look at that one more time, and then move on to the more realistic example where I think there will be less confusion.
Example #1: A single site
Given: H(MaxEnt) occurs when all 20 amino acids are equi-probable and, for this example only, H(Func) requires one, specific amino acid (which is highly unrealistic in almost all real-life cases).
H(MaxEnt) = log2(20) = 4.32 bits
H(Func) = log2(1) = 0 bits
FI = FSC = ∆H = H(MaxEnt) – H(Func) = 4.32 bits (using Eq. 5 in my paper).
For H(cancer) we need to know how many amino acids are tolerated at that site in a tumour. Without that information, we cannot estimate H(cancer). Just for the sake of this toy example, however, let us imagine that we sample a very large number of tumours and find that 7 different amino acids are now tolerated with equal probability at that site.
The ground state in going from a normal, functional gene to this cancerous gene is not H(MaxEnt); it is H(Func). To clarify, presumably we are going from a physical state defined by a functional gene to a mutated gene physical state defined by whatever restraints the cancerous state imposes. This requires that we use the more general Eq. 2 in my paper.
Therefore,
H(cancer) = log2(7) = 2.80 bits
FI = FSC = ∆H = H(Func) – H(cancer) = - 1.52 bits, indicating a loss of functional information from the original physical state.
I don’t understand where you get a H(maxent ground state) of 1,686 bits for a single site problem, but I expect you are considering the entire sequence, in which case we should look at the second, more realistic problem.
Example #2: A more realistic problem using p53
(we can use other databases later. Let’s continue with the Pfam MSA for the sake of this semi-toy, but more realistic example).
Given: A core sequence length of 187 aa’s after insertions are stripped out
H(MaxEnt) = log2(20^187) = 808 bits
H(p53) = 401 Bits. (estimated by my program)
FI = FSC = ∆H = 808 – 401 = 407 bits.
Your number of 1,686 bits is not possible for a core sequence length of only 187 aa’s, but I expect you are using a much longer sequence, which requires great care. Including insertions and tandem repeats will artificially inflate FI. We must stick with the core sequence for the p53 domain.
Estimating H(cP53) is more challenging. We do not need to know what new function it is serving in a cancerous tumour, but we do need an idea of what amino acids the new function (if there is one) tolerates at what sites along the sequence for this hypothetical cancerous function. To do that we will need a large MSA consisting of only of non-redundant cP53 that tests positive for adequate sequence space sampling.
So we are in a poor position to estimate H(cP53) but we can make a very rough estimate based on the knowledge that the TP53 gene is the most frequently mutated gene (>50%) in human cancer(Surget et al., 2013). That strongly suggests it is no longer providing a required function(s) within the tumour or, at the very least, the number of aa’s is increased in normally highly conserved sites. So what we can say is that the data suggests,
H(p53) < H(cP53).
Therefore, FI = ∆H = H(p53) - H(cP53) < 0,
indicating a loss of functional information in the degradation of the p53 gene to cP53.