Durston: Functional Information


(Kirk Durston) #64

Post 15: The original, very simple example where we considered only 1 site (not the entire protein sequence) seems to have lead to some unnecessary confusion. We will look at that one more time, and then move on to the more realistic example where I think there will be less confusion.

Example #1: A single site

Given: H(MaxEnt) occurs when all 20 amino acids are equi-probable and, for this example only, H(Func) requires one, specific amino acid (which is highly unrealistic in almost all real-life cases).

H(MaxEnt) = log2(20) = 4.32 bits

H(Func) = log2(1) = 0 bits

FI = FSC = ∆H = H(MaxEnt) – H(Func) = 4.32 bits (using Eq. 5 in my paper).

For H(cancer) we need to know how many amino acids are tolerated at that site in a tumour. Without that information, we cannot estimate H(cancer). Just for the sake of this toy example, however, let us imagine that we sample a very large number of tumours and find that 7 different amino acids are now tolerated with equal probability at that site.

The ground state in going from a normal, functional gene to this cancerous gene is not H(MaxEnt); it is H(Func). To clarify, presumably we are going from a physical state defined by a functional gene to a mutated gene physical state defined by whatever restraints the cancerous state imposes. This requires that we use the more general Eq. 2 in my paper.


H(cancer) = log2(7) = 2.80 bits

FI = FSC = ∆H = H(Func) – H(cancer) = - 1.52 bits, indicating a loss of functional information from the original physical state.

I don’t understand where you get a H(maxent ground state) of 1,686 bits for a single site problem, but I expect you are considering the entire sequence, in which case we should look at the second, more realistic problem.

Example #2: A more realistic problem using p53

(we can use other databases later. Let’s continue with the Pfam MSA for the sake of this semi-toy, but more realistic example).

Given: A core sequence length of 187 aa’s after insertions are stripped out

H(MaxEnt) = log2(20^187) = 808 bits

H(p53) = 401 Bits. (estimated by my program)

FI = FSC = ∆H = 808 – 401 = 407 bits.

Your number of 1,686 bits is not possible for a core sequence length of only 187 aa’s, but I expect you are using a much longer sequence, which requires great care. Including insertions and tandem repeats will artificially inflate FI. We must stick with the core sequence for the p53 domain.

Estimating H(cP53) is more challenging. We do not need to know what new function it is serving in a cancerous tumour, but we do need an idea of what amino acids the new function (if there is one) tolerates at what sites along the sequence for this hypothetical cancerous function. To do that we will need a large MSA consisting of only of non-redundant cP53 that tests positive for adequate sequence space sampling.

So we are in a poor position to estimate H(cP53) but we can make a very rough estimate based on the knowledge that the TP53 gene is the most frequently mutated gene (>50%) in human cancer(Surget et al., 2013). That strongly suggests it is no longer providing a required function(s) within the tumour or, at the very least, the number of aa’s is increased in normally highly conserved sites. So what we can say is that the data suggests,

H(p53) < H(cP53).

Therefore, FI = ∆H = H(p53) - H(cP53) < 0,

indicating a loss of functional information in the degradation of the p53 gene to cP53.

(S. Joshua Swamidass) #65

What is Negative FI?

So now we are crystal clear that FSC can be negative. Great. How do you interpret negative FSC? Recall, you write:

FI = FSC = delta H

FI is defined as:

-log(N / M)

Where N is the number of functional sequences, and M is the total of number sequences. That means that a negative FSC means there is a negative FI, so…

-log(N / M) < 0

Working out the algebra, we arrive at:

N > M

How is it possible for the number of functional sequences to be greater than the total number of sequences? With some similar algebra, we can also derive N = M 2^{-FSC}. This means that if FSC is, say, -10, there are 2^10 times more functional sequences than sequences possible.

How do you make sense of this contradiction?

Most of us would take it as a clear indication that there is an error in your derivation. I can even point out the error too: you are using delta H instead of KL. As soon as you use a different base state than maxent, moreover, the relationship to FI is no longer valid. Why are you unconcerned by this nonsensical result?

Still Haven’t Answered the Question

I just do not understand why you can’t give me straight answer to this question, based on the toy example that you yourself proposed. Can you please try? I do not understand why you can’t just give me a straight answer here. Can you please just humor me here?

What do you compute as these numbers? It is not sufficient hat you have only discussed the single site where there is a mutation. I am waiting for you to consider the whole protein, using the formula that you proposed. Using this formula, I get these numbers:

Do you agree? If not, why not? If you do agree, how do you make sense of the prior computations you made, that are not consistent with these numbers?

(S. Joshua Swamidass) #66

@Kirk, I am confused by this. At first I thought you had caught an arithmetic error (which I would have gladly retracted).

P53, however, has 393 amino acids, not 187. Wheere are you coming from?

This also is not what I am suggesting. We are NOT in a poor position ot estimate H(cP53). It is very easy to do this. I can give you thousands of sequences of cP53. Your estimate, we will see, is very poor. That is one anomaly you have to make sense of in this analysis.

Still, I agree that it will likely be negative, so I agree that:

You still have not explained how this could possibly the right way to compute FI. How do you interpret a negative FI? As you have put down in the math,

This appears to be a clear contradiction in your math, that I can resolve…

Can you help ne out on these points?

Also, I"m still waiting for your H(cP53), H(P53) and H(maxent) for the toy example you gave.

(Kirk Durston) #67

What is negative FI?

Given: FI = ∆H = Ha – Hb

∆H is a measure of the change in the physical system in going from some initial state (a) to some functional state (b). If ∆H is positive, FI is required to go from state (a) to state (b). If ∆H is negative, less FI is required for state (b) than state (a). FI is therefore lost.

If FI = -log(N/M), can N>M?

Yes. For example, if state (a) represents P53 in a healthy cell, and we assume the sequence is more highly conserved than cP53 in state (b), and the total number of permissible sequences in state (a) is M, and the total number of permissible sequences in state (b) is N, and we have a case where N>M and FI will be negative.

The key point here is that for cancer, the ground state (state (a)) is not MaxEnt. Instead, it is H (P53). If one wishes to estimate how much functional information is required to code for a cP53 family from scratch then one must use H(maxent). But in cancer, we are going from H(P53) to H(cP53).

So in this case, the answer to the question, “does cancer create FI or lose FI?”, the negative ∆H reveals that FI is lost in going from the healthy to the cancerous state.

Straight answer for the whole sequece?

I am puzzled by your statement, “I am waiting for you to consider the whole protein”. That is what I did in the previous post, unless by “whole protein” you mean the full sequence of 393 aa’s rather than the 187 aa’s. I literally went to Pfam and got my sequence data from there.

Pfam only provides the MSA for the P53 DNA binding domain. When I strip out the insertions to get the smallest possible core sequence, I end up with only 187 aa’s (which is close to the consensus length for the DNA binding domain).

Perhaps by the “whole protein” you are extending the original toy, one-site, single aa example to a sequence of 393 sites, where only one aa is tolerated at each site. That is not what I originally was talking about, but if that is the case, then I agree with your numbers for H(P53) and H(MaxEnt), but I don’t see how you would get H(cP53) = 4 for the whole sequence. I thought we were looking at 7 aa’s tolerated at one site. Extending this to 393 sites, that should give us H(cP53) = 595. In real life, of course, H(P53) is not even remotely close to 0, nor would be H(cP53).

Using ExAc or CTAG

I think you said that ExAC has a hundred thousand sequences for P53. I have not used it before, but for P53, it says it only contains a total of 567 variants. I get 487 variants from Pfam, but that is after stripping out the insertions, which would have given a large number. So these hundred thousand sequences on ExAC must include an enormous number of redundant/identical sequences which, after removing the redundant ones, reduce to only 567 variants. I’m happy to use whatever database you prefer, but I would need to take some time to familiarize myself with them. If, however, Pfam is doing a much better job of filtering out all redundant/identical entries, then it Pfam would be more efficient to use.