Durston: Functional Information


(Kirk Durston) #47

Reading through @Josh’s comments, it appears to me that you (@Josh) might be confusing classical information with functional information. That is why your approach gives such an obviously wrong answer when applied to cases where there is a loss of functional information, such as in cancer.

Testing your approach: It should be clear that your K-L approach does not work when it comes to quantifying functional information. By your lights, if I had a flash drive containing the digital information required to build a next generation laptop and I accidentally reformatted the flash drive, such that it contained only randomized gibberish, your calculations would show that we actually gained a large amount of functional information!!! This is just one of three reasons I suspect you are confused between classical information and functional information. I have not yet addressed the other two mistakes you are making. In classical information theory even randomized gibberish is “information”.

There are two issues here: 1) how can we estimate the functional information required to code for a protein family and, 2) how well does the Pfam data represent functional sequence space? I have had discussions with a variety of biologists. Their only concern/criticism is issue (2) … whether the sampling in Pfam is an adequate representation of functional sequence space. None of them have a problem with the method; you are unique. When I initially submitted the paper for review, one of the reviewers was very critical of the paper, explicitly stating that he/she had reservations about applying classical information theory to protein sequences. After a few tweaks and some clarification, the editor sent the revision, with comments, back to that critical reviewer and he/she was completely satisfied. So has every other biologist I have discussed this with.

No. It is statements like the above that really leave me bewildered as to why you insist on misrepresenting me and what I am saying. I can respond to the p53 example either way and have done so (either using section A or section B in my paper), but let me try again.

As you insist, let’s set the ground state back to the naked, lifeless planet earth, assuming that all possible options of amino acids or sequences (whichever you prefer) are equi-probable so that we have the special case where the ground state I describe in my paper is “the null state”.

H (ground) = H (null)

We know that without any functional constraints to produce an active p53 protein, almost any sequence or aa distribution will do. This has become abundantly obvious for every biological protein we have ever looked at; it is a piece of cake to string together aa’s to get non-functional sequences. Not so the other way around. Given the above, we get

H (ground) = H (null) ≈ H (non-functional-cancerous).

So using the method described in section A, the amount of information I would have to hand you, in order for you to build a non-functional p53 protein, would be

H (null) – H (non-functional-cancerous) ≈ 0 bits.

Essentially, we don’t need any information to produce non-functional p53; all we have to do is simply let the system produce non-constrained aa sequences.

I can demonstrate the same outcome using Hazen’s approach, but I focus here only on mine.

This works for any probability distribution you care to invoke, except for the one probability distribution that defines functional p53 sequences or aa distributions.

Your approach, on the other hand, gives us an answer that is right by classical information theory, but clearly wrong when it comes to functional information.

You have agreed that the difference in Shannon uncertainty between two states can be either positive or negative, so the solution to your problem of always getting a positive, increase in functional information, even when functional information is lost, is staring us all in the face.

To put it in words: The reduction in Shannon uncertainty to go from a non-functional ground state to a functional state is equal to the amount of functional information required to reduce that uncertainty. Conversely, the increase in Shannon uncertainty in going from a functional state to a non-functional state is equal to the amount of functional information lost. I did not directly address the topic of functional information loss in my paper, but Eq. 2 in my paper works just fine.

My apologies for not yet addressing this but I would like to stay on current topic at least for this post. I need to answer your question though, as it is also an objection Josh has raised, and other biologists in general, and needs to be addressed. Teaser: I think the Pfam data leans heavily in my favour.

(John Mercer) #48

Hi Kirk,

First, I hope that you are not trying to portray an analogy as a test of an approach or a model.

Second, your analogy makes no sense to me, because the genomes of cancer cells are in no way randomized gibberish.

If I may make a suggestion, instead of thinking of them as cancer, think of taking some cells from your body and growing them in culture. Most of them won’t last long, but if you culture enough of them, you will select for variants that are immortalized and will continue to divide.

If you view it this way, it is easier to see that you are simply subjecting the cells to radically different selective environments that cause evolution and an increase in information, removing value judgments about cancer “(non-functional-cancerous)” that seem to be clouding your understanding of this as biology.

It’s not a big leap when you consider that many cells immortalized in culture will grow as tumors when you inject them in whole animals.

Here’s another scenario:

I have a mouse with a murine leukemia virus (MuLV) provirus (retroviral genome integrated into the host chromosome). That provirus has a single-base mutation that prevents it from being infectious. If that inactivating mutation is reverted, the virus activates, replicates, and starts carcinogenesis.

How is that reversion not an increase in information?

(John Mercer) #49

I’m confused. Are you saying that “the Pfam data” represents “all the proteins examined thus far”?

(S. Joshua Swamidass) #50

Thanks for the concern. I’m not confusing the two. I’m just asking you to be consistent.

This is not my approach. This is your approach. I’m just asking for us to use FSC to compute the FI of cancer using the same approach you published on pfam. You seem unwilling to do this, but that is all I’m asking for at this time.

This is not true. It is entertaining that you’d think so. The calculation would show something quite different. It seems like you don’t know how KL would be computed. Very interesting.

Regardless, your inferences are incorrect because your guess at what KL of FI is wrong. The rest of your logic is predicated in that erroneous calculation.

You can expect this to change going forward.

It appears I’m the first computational biologist with training in information theory to have looked over your math. It is clearly in error, but in a way that would be hard for most biologists to recognize. With this thread, and the thread on Cancer information (Computing the Functional Information in Cancer), and observers like @evograd, @rjdownard, @Art, and @mercer, the methodological errors will be more widely known.

The difference is that I am a computational biologist that has been applying information theory to biology for decades now. Very few people have the ability to competently review your work. I am one of them.

I’m sorry if I’m misrepresenting you. In your paper, FSC is computed using a maxent ground state. I think we agree that this is what you did. Right?

I’m saying we should see how that approach computes the information of cancer. I think it is in error on cancer, and it is also in error on protein families. You can’t use it for protein families if it can’t pass the basic control of cancer genomes. For you compute the correct FI of a protein family, among other things, you have to use the equivalent of “normal” which i the “ancestral” state in protein families. You didn’t do that for pfam, so the calculation is similarly invalid for pfam.

This is an helpful image. Your current approach starts from that ground state. However, it is not how evolution says things arose. So it is starting from a a strawman of a ground state. You have to start from the ancestral state, not a maxent state.

Nope. I am sorry, but that is not the right calculation. Applying the formula from your paper, the FSC paper, produces about 6 billion bits of FSC. Would you like step through the calculation? That should be fun.

Also “cancer” is not “non-functional.” As @Mercer has already noted. It is a precisely defined functional state that requires precise control of gene expression and new protein functions.

After letting go of the idiosyncratic use of the term “Shannon uncertainty” (which I do not agree with), I do agree that delta H can be negative. However, it does not mean what you think it does. Also, it is not clear that cancer has reduced H. You have to demonstrate this, and I am not sure it is true.

Except you haven’t even produced a single correct calculation. Clearly I am talking about information in a different way than you imagine I am.

The information to erase a hard drive is much FI at all, perhaps as low as one bit. It also increases the function of the hard drive by increasing it’s capacity (I.e.) its function. Stepping through this example might be instructive. It seems you didn’t realize that the FI using KL would be trivially low in this case. Also, if we want to use FSC or Delta H, we don’t know if it goes up, down, or stays the same. There is not enough details to tell. For KL, however we can know quite easily.

Likewise, the information of cancer using a base state of maxent is not 0; it is, rather, about 6 billion bits of FSC. That also is an example worth stepping through.

That is turns out to be false. KL divergence is the amount of information required to change between states, delta H is not. To make a simple analogy. Let us say I have two documents, that say totally different things, and accomplish different functions, D1 and D2. Let us say that the complexity is the same, so H(D1) = H(D2). So the delta H is zero. Now let’s say I have D1, but want to change into D2. How much information do I need to do this?

Turns out that it is definitively not 0, even though delta H = 0. Instead, it will take KL(D2 || D1) bits of information to create D2 starting from D1. What you need is KL. This is a foundational concept in information theory. There really is no way around this.

It seems that what is going on here is that you haven’t yet appreciate the error you made in thinking KL = delta H. Yes, that is true if and only if the base state is uniformly random. However, as soon as you deviate form this, it is no longer true that FI = -log (W), and the formula you are using are no longer correct. You just realized this recently, earlier in this thread.

Of course, you have to argue that detla H is correct for there to be any coherence to your argument, but it is not. The example I just gave you is an example of that.

At this point, you have incorrectly inferred how to compute KL on cancer and in the erased hard drive. Based on that incorrect inference, you made incorrect computations on the FSC of cancer. That seems to be where we should start, taking a large number of cancer genomes, and computing their FSC by your published method.

It will produce a FSC computation of 6 billion bits. From their we can discuss the next steps forward. What do you think?

(S. Joshua Swamidass) #51

That is reasonable. I encourage you to keep the two directions separate in your posts. If this develops, we will put it in a separate thread.

I agree with @Mercer that you seem to be misunderstanding how cancer works. It certainly is not non-functional.

And here:

I would agree with @Mercer here. It seems you have a very narrow, and incorrect, view of how cancer develops. For example, a p53 inactivation mutation is not nearly enough to create cancer. Most tumors are benign, even though they do not control their growth. What is the difference between a benign tumor and cancer? The cancer has acquired several new functions that the benign tumor does not have.

(S. Joshua Swamidass) #52

2 posts were merged into an existing topic: Side Comments on Durston

(John Mercer) #54

Here’s another one that’s even more intuitive (I hate the math).

The goal is to ID human oncogenes.

We transfect mouse cells with tumor DNA and control nontumor DNA from a patient, literally adding information. Some cells form colonies and grow as tumors in test animals. The human DNA causing the transformation is identified (by methods that aren’t important for this concept). This is how human oncogenes were identified by Mariano Barbacid first, and by many others later.

Kirk, can you think of any way at all you can interpret that oncogene as a loss of information, functional or otherwise?

(Kirk Durston) #55

I thought I had made this clear some number of posts ago. I am not using KL. My initial reference to KL starting in post #2, was due to misreading the particular definition dealing with Shannon information, as you pointed out, where I had mistakenly “saw” it equated to delta-H. To keep saying I am using K-L is misrepresenting me and my paper. I mentioned this before … neither my paper nor Hazen’s makes any reference whatsoever to K-L, so to keep insisting that my paper is using K-L is to misrepresent my paper. May I direct your attention, once again, to general Eqs. (2) and (6). You should be able to see that that is not K-L; it is delta-H. Delta-H should not be confused with K-L.

Please see page 11 of Shannon’s original paper where he refers to H as measure of “information, choice and uncertainty” after discussing Shannon entropy as uncertainty in the previous page. A reduction in uncertainty is a very useful concept when it comes to how much information was involved in that reduction.

We were talking about p53. There is no earthly way p53 can contain or requires 6 billion bits of functional information to encode.

First: Although K-L can be reduced to my Eq. (4), it only takes a glance at my paper to see that the general case from which the special case of Eq. 4 is derived, is not K-L, but Eq. 2, which embraces non-uniform probability distributions. It is essential that you do not confuse delta-H with K-L when discussing my paper.

Second: I thought I made this clear several times already. If the ground state cannot be described by a uniform probability distribution, then we use the more general Eq. (2) that I supply in my paper. It is easy. I do it all the time with the non-uniform probability distributions of the genetic code to estimate FI assuming the existence of translation. If you are looking at universal proteins that would have had to exist prior to translation, then you might want to use W. If not, then use the genetic code probability distributions.

Key Idea to Grasp: My use of delta-H is not idiosyncratic. It goes right back to Brillouin’s paper in 1951. Let me quote:

This example is just a special instance of a general rule: any additional information about the system under consideration corresponds to a decrease in the total number of complexions P. We have to reject all the complexions that would not satisfy the additional conditions contained in our information. Hence, the total entropy decreases whenever we happen to have some special information about the structure of our physical system. We choose this entropy decrease as the physical measure of information.

I =-∆ S = ∆ N , S, entropy; N, negentropy. (53)”

He also discussed the effect of “additional conditions”/constraints on the physical system as reducing S (constrained) required to carry a message. He uses different wording, but the entire approach is there, which I use, as does Hazen.

The flash drive example is not an analogy. A good method should not just work in on particular area, but work across many areas. For example, in the Hazen paper, they discuss functional information not just for biology, but for letter sequences and digital output of the AVIDA program. In each case, we are working with two things, function and digital information. So the test of a good method is that it works in all cases dealing with function and digital information. Josh’s KL approach fails the test since it yields nonsensical results when functional information is lost.

When I referred to protein families examined, it was to the families in Pfam that I have examined.

I was speaking in reference to a p53 protein sequence that no longer codes for the normal function of p53. I know it is not a defective p53 gene that initiates a cancerous cell in the first place; the defective p53 simply cannot terminate the cell. So since the p53 gene is now not working, it is free to evolve, meaning almost any aa at any site will be fine, since the p53 gene is no longer constrained to carry out it’s original task. To use the Hazen formula, M ≈ N. The DNA is the storage “flash drive”, and the digital information within that gene is the functional information to be analyzed.

No worries, I am fine with working with anything. All I need to know is how “function” is being defined.

It is an increase. I don’t know the effect of the other 19 aa’s, but assuming that only that particular aa produces the effect, we are looking at 4.32 bits of functional information added, the maximum possible. That is easily within reach of an unconstrained physical system.

If I understand correctly what you described, that would be the transfer of functional information from one source to another, but not the production of novel functional information (so far as I can see). In that case, the physical system has been expanded to include the source, so there is a net zero gain in functional information (ignoring any functional information the scientists may be introducing through imposed constraints). If we ignore everything but the mouse, it is a gain in FI.

@Mercer you are asking good, productive questions. The more specific they are related to digital information within the cell, the better I can answer them, but I can estimate for more complex effects if need be.

(S. Joshua Swamidass) #56

We are tasking past each other here. I agree that you are using delta H. I think this is in error, and you should be using KL. You, however, are using delta H. It seems we agree on this. Once we finish looking at delta H, we can turn to KL to see how this is a much better measure of FI (i.e. the amount of information needed to acquire a new function).

It is idiosyncratic to you because that is not the only way that Shannon uses it. We can let that go for now. I’m just trying to move forward without confusing people to think I agree with your use of that term. I do not.

Agreed. If we are talking about p53 it will be on the order of 100s of bits. If we apply it to cancer genomes, it comes out to be about 6 billion bits. Sorry that wasn’t clear.

Just looked at your paper again. It does not account for non-uniform probability distributions. There is no mention of non-IID or non-uniform probably distributions. The only context this math is valid is in the case of a maxent base state. Otherwise it becomes invalid.

I suppose we just have to disagree here. Your use of the delta H is very idiosyncratic.

And there is a reference to Hazen, who explains that using this on extant sequences rather than uniformly random sequences is an error. If you are going to appeal to Hazen, you have to deal with what he has said about this.

You miscalculated the KL divergences. I’m happy to work you through the calculation. But it can be as low as a couple bits of information. This is a sensible result. It always takes information to move between states, in this case it only takes a small amount of information.

Clarify This?

@Kirk you’ve made a string of claims I can’t make sense of. First, what I am writing here is confined to your delta H approach.

At first you were trying to compute delta H between normal and carcinogenic p53, presuming that a specific amino acid needed to mutated to on of the 19 other amino acids. That would mean:

delta H(cancer p53 | normal p53) = -4 bits

That is an approximate number. What ever it is, it is just a few bits. Then you guessed the the delta H of cancer to maxent, and arrive at this:

delta H(cancer p53 | maxent) = 0 bits

So that would seem to mean you think that:

H(normal p53) = 4 bits.

However I’m pretty sure you think that it is actually:

H(normal p53) > 100s bits.

So it seems we have a contradiction. It doesn’t seem that you can simultanuously argue that:

delta H(cancer p53 | normal p53) = -4 bits
delta H(cancer p53 | maxent) = 0 bits
H(normal p53 | maxent) > 100s bits.

Something has to give. What gives? What do you really think is going on? In my view, I think if we too look at the data from actually cancer sequence experiments, we would see (approximately):

H(normal p53 | maxent) > 100s bits.
H(cancer p53 | maxent ) > 100s bits.
delta H(cancer p53 | normal p53) = -30 bits

Those numbers are approximate. The actual numbers are going to adjust based on the precise data we use. Basically, we would see high FSC in both cancer and normal p53. The amount of information will be proportional to the length of the sequence. These are the numbers I think are reasonable for your delta H appraoch, and we can even compute them directly here too.

Your numbers, however, are not internally consistent. Can you explain which of those three things are to change?

The Bigger Picture

Taking a bigger picture view of this, I can see where confusion is arising. You are measuring the amount of functional entropy/information of different states, without accounting for the type of functional information. This seems to be pretty important.

In the case of normal vs. cancer, you are measuring the total amount of “normal” function and comparing that to the total amount of “cancer” information. Do you agree with that?

(S. Joshua Swamidass) #57

Clarifying this toy example some more. Of course, this is just a simplified model and better data would improve this. P53 has 390 amino acids. Let us imagine (this is not a fact) that all these amino acids need to be precisely correct for it function normally. Carcinogenic TP53 (which we will abbreviate cP53) requires mutating a specific amino acid to some other amino acid, any one will do.

We can obtain a large number of normal P53 sequences, and a large number of cP53 sequences. What will their H be?

H(maxent ground state)

In this toy example, it seems that that we will agree on this:

H(maxent ground state) = - 390 * log 20 = 1686 bits
H(P53) = 0

This means that the FSC is going to be 1686 bits for normal p53. The divergence appears to be in how we compute H(cP53). What number do you compute? What do you think we will arrive at for the H of the cancer-function in P53 sequences?

H(cP53) = ?

I compute it at 1682 bits. This is merely applying your formula to the extant sequences we have, which (in this toy example) will all have the normal P53 mutation except in one location. Any other number would require either (1) changing your formula for FSC, or (2) smuggling in sequences for cP53 that were not actually part of the data set. What do you compute?

Currently, you are arguing to mutually exclusive positions at the same time:

(John Mercer) #58

Understood, but that only functions in the context of a whole cell. I am simply trying to point out that cancer is a gain of function.

OK. Glad that we have that settled! :disappointed_relieved:

I think you missed the experimental aspect and concentrated on the method; apparently I didn’t get my point across.

My point wasn’t about the transfer of information, but that the experiment showed that the tumor DNA had information that the normal control DNA from the patient did not have.

Does that clarify? I’m trying to keep it simple and avoid the math.

(Kirk Durston) #59

Yes, I agree; it seems we have been talking past each other. Given that we cannot deny that ∆H clearly means something related to a change in physical state, I want to give a final explanation of the elegance and utility of my approach that is clear enough for any onlookers to understand where I am coming from. Then I think we should move on to discuss specific applications as a test and demonstration of the approach.

I wonder if your objections are arising from the assumption that this is an information-only problem. My approach is that this is, first and foremost, a physical system problem, which can be quantified in terms of functional information (FI). The physical states must be defined and understood before we can attempt to quantify them in terms of bits of information.

I present this final summary, not for the purpose of arguing any further about it, but simply as an explanation. In science, we do not always have to converge on just one approach but we all desire clarity on whatever the approach may be, so here is my final summary before we move on to application.

Final summary of the theory behind my approach to FI:

A clear distinction must be made between a physical system and information. Physical functions may require a change in the physical system in order to achieve that function. This physical change can be quantified in terms of FI.

Given any physical system ( phys ), the Shannon entropy of that system can be described as

H ( phys ) = - Kp ( i ) log p ( i ) = x bits (1)

where K defines the units of measure in terms of bits (in this case) and the summation is performed over all i from 1 to N , and N is the total number of possible configurations of the physical system, and p ( i ) represents the probability distribution over the options, not necessarily uniform.

If physical constraints must be imposed on a physical system in order to satisfy a physical function within a larger system, the Shannon entropy of the resulting constrained physical system ( phys + functional constraints ) of M functional options can be described as

H ( phys + functional constraints ) = - Kp ( j ) log p ( j ) = y bits (2)

where the summation is performed over all j from 1 to M .

The change in the physical system to achieve the required physical functionality is, first and foremost, physical, but it can be quantified in terms of Shannon information as

H = H ( phys ) - H ( phys + functional constraints ) (3)

= x bits – y bits = z bits (4)

Is this change meaningful?

The physical requirements to satisfy some physical function in a physical system can be quantified in terms of z bits of information. Since z bits quantifies functional requirements in terms of information it is meaningful and important. I call this functional information to distinguish it from classical information, where we are not concerned whether the message is functional/meaningful or not. Therefore, FI is defined under my approach as

FI = ∆ H = H ( phys ) - H ( phys + functional constraints ) (3)


This method can be used to provide an objective quantification of anomalies across the natural world, with applications in forensic science, biology, SETI, archaeology, and chemistry, from which an investigator can draw meaningful inferences, especially if such anomalies appear to have a function.

Moving On:

As I said above, I post the above summary not for the purpose of arguing about it any further, but so everyone is clear on what I am talking about. Given that we agree that my approach is not the same as how @Josh would approach this problem, let us move on to how I apply this. I think I should start with one or two very simple examples for the sake of all onlookers, and then look at applying it to protein families and cancer. I’ll start that in my next post. I’ll try to post more often than once per week, time permitting.

(S. Joshua Swamidass) #60

Okay, we can take that direction.

I’m still asking you to clarify now in the case of your toy problem what the entropy (H) of carcinogenic p53 is…

I’d like to know what you think H(cancer p53 | maxen) is in this toy (counter-factual) example.

(Kirk Durston) #61

Ok, let’s take a look at p53. Just to make it a little more interesting and realistic, I downloaded an 872-sequence MSA from Pfam for the p53 domain and ran it through a program I wrote. You can see a portion of the results in the linked file, but I’ll highlight the main items of interest here.

Initial Info:

Number unique sequences: 487

No. columns in MSA before stripping out insertions: 669

No. Columns in MSA after reducing the sequences to their essential core: 187

Average no. of different aa’s tolerated per site: 14

Estimated minimum no. of mutational events per site: 23


The Pfam MSA does not have adequate data for the p53 domain to obtain an acceptable estimate of functional information. I prefer to have a minimum of 30 estimated mutational events per site and at least a few thousand unique sequences. The data here demonstrates only a minimum of 23. Nevertheless, let’s just work with what we have as a “toy” problem.

Functional information required to code for a p53 domain:

  • starting from a physical system in the null state = 409 Bits (Avg. density = 2.17 bits/site)

  • starting from a physical system skewed by the genetic code = 374 Bits (Avg. density = 1.99 bits/site).

  • extreme lower limit = 123 bits (Avg density = 0.65 bits).

Functional information for non-functional p53 domain:

Here, it is critical to clearly define what the function is that determines whether a sequence is functional or not. There are three perspectives we will take: a) relative to an intelligent observer, b) relative to the normal, physical p53 function c) relative to a cancerous cell. Looking at the attached data, we observe that site 180 permits only one, specific amino acid, arginine. Let us suppose that it is mutated to something else, thereby inactivating p53 so that it can no longer perform it’s function. The mutated sequence falls into the set of sequences that are non-functional relative to the normal function. Let us use the extreme lower limit of 123 bits for a functional p53 domain.

a) relative to an intelligent observer: Even though the sequence is now non-functional we can see from the data I linked to, that only 4.32 bits of functional information has been lost, but the function has shifted from fulfilling it’s cellular duties to becoming “meaningful” to an intelligent observer, relative to the cellular function the intelligent observer “knows” it should have. The intelligent observer “knows” what the desired function is and “sees” that this sequence is quite “close”. The physical system does not work that way; it does not know or see anything, only whether it is functional or not.

b) relative to the normal p53 function: the mutated sequence is now a member of the set of non-functional sequences. The cell cannot “see” how close the sequence is; the sequence either satisfies the function or it does not. In this case it does not. A non-functional sequence requires approximately 0 bits to encode. Relative to the physical system, 123 bits of functional information has been lost. The gene is now inactive. This, however, is the simplest case. There are more complex cases that we shall leave aside for now.

Let’s pause here to underscore a critical point, and a frequent source of confusion. “Meaningful” is a special case of the more general “functional”. For intelligent observers/engineers, a sequence can be meaningful to the mind, while at the same time be non-functional to the physical system. For example, an intelligent observer attaches a certain degree of “meaning” to a p53 sequence that is “close” to being functional. The cell may not. In discussions of functional information, therefore, the function that is being discussed must always be clearly defined, as distinguished in the above example between (a) and (b). A common mistake is to conflate functions, illustrated by overlapping Venn diagrams, when the two functions are actually completely independent, not even existing in the same reference frame.

c) relative to a cancerous cell: There are three possibilities relative to the physical system of a cancerous cell, setting subjective “meaning” aside.

  1. If p53 is still functional relative to it’s normal physical function, then the FI = 123 bits.

  2. If p53 is non-functional relative to it’s normal physical function, then FI is 0 bits.

  3. If p53 is non-functional relative to it’s normal function but has a new function to do with cancer, then we need to have some way of determining what sequences satisfy that new function, before we can estimate FI. If non-functional p53 is simply a “bystander” in a cancerous cell (i.e., doesn’t do anything), then without a function, by definition, a system has no FI. FI is contingent on a specified function. If we wish to arbitrarily say that being a “bystander” is a function, then we must realize we are no longer talking about any function required by the physical system … the system does not require “bystanders”. We are now talking about an arbitrary, subjective function that we have invented in our minds. We can still estimate FI in that context, but it no longer has anything to do with biology.

(S. Joshua Swamidass) #62


First, I am not interested in “non-functional” p53, we areinterested in carcinogenic p53.

Still waiting…

Second, before we get to actual data, I want to work, out this toy example. I note that you have not yet answered the question. What do you compute for H(cP53 | maxent) based on the observed sequences? It cannot be two or three numbers. It can only be a single number. The closest to a straight answere is here:

This seems to be an admission that the H(cP53 ) computed from sequences is 4 bits less than H(P53 ), even though it should be zero (in your mind). Strangely, you also wrote contradictory things to this point. It is a straightforward question. Looking at the extant sequences, what do you compute as the H(cP53)? What do you compute as H(P53)? What do you compute as H(ground)? What is required here is the numbers computed from the extant sequences because that is all you use to compute FSC.

Note also that this is not relative to anything. It is merely the number computed from the extant sequences. We should be able to compute the delta H between any two states once the H of all states are established. Why is it so difficult to give us this number?

Where to find cancer data

Third, when we do actually work with data, there are over 100,000 examples of normal p53 in ExAc, so try using that. There are over 10,000 examples of carcinogenic p53 in CTAG, so try using that. We have more than enough data to make sense of this. The numbers calculated from this data are very different than yours.

Why so difficult?

@Kirk, I’m not sure why this is taking so long to establish H(cP53) or why you are writing to so much. I am just asking you to apply your formula to cancer in a toy example ot produce a straightforward answer. It is a well defined problem.

Show me how you computed this number from the sequences in this example? It appears that you are stating what FSC should be, rather than what it actually is computed to be from extant sequences. We need to know what FSC is computed as, not what you think it should be if it is a valid measure of FI. So I should repeate, for our toy example (not jumping ahead), using extant sequences and the FSC method you published, what are theses quantities:

  1. H(P53)
  2. H(cP53)
  3. H(ground)

I’ve asked this several times now:

So, based on extant sequences as have been described in this example, I compute it at:

H(P53) = 0
H(cP53) = 4
H(maxent ground state) = 1686

This means the FSC computations are:

FSC(P53) = 1686
FSC(cP53) = 1682

What numbers do you compute from the extant sequences in this example? Yes, I know you think FSC should equal zero to be valid. For your interpretation to hold, I agree. I am asking here instead what the actual FSC computation gives us, which is NOT zero, which means that FSC is not valid. If we can’t get a straight answer on this, it seems that that this conversation is coming to a close.

(S. Joshua Swamidass) #63

@Kirk, I am still really hoping to get these numbers from you. It would help make sense of what you are even proposing. From those numbers we should be able to compute the FSC and compare that number with what you think the FI is there. It will be enlightening! Are you having difficulty with this computation?

This is what I came up with:

Of course, that is just in our example. To compute the actual numbers for P53, we would use a databases like ExAc and CTAG.

We can also compute the difficulty of evolving cP53 using delta H (your method) versus KL (what I think is the correct method). I’m very curious to make sense of a negative FSC. That should be fun.

We can also generalize this to DNA easily, and compute it across whole genomes! This also should be fun too. The ~6 billion bits of FI calculation is in sight right now.

Also, for reference, mutant P53 has a large range of complex functions, including gain of function. It is not merely obliterating activity. (see here: https://p53.fr/tp53-database/mutload)

(Kirk Durston) #64

Post 15: The original, very simple example where we considered only 1 site (not the entire protein sequence) seems to have lead to some unnecessary confusion. We will look at that one more time, and then move on to the more realistic example where I think there will be less confusion.

Example #1: A single site

Given: H(MaxEnt) occurs when all 20 amino acids are equi-probable and, for this example only, H(Func) requires one, specific amino acid (which is highly unrealistic in almost all real-life cases).

H(MaxEnt) = log2(20) = 4.32 bits

H(Func) = log2(1) = 0 bits

FI = FSC = ∆H = H(MaxEnt) – H(Func) = 4.32 bits (using Eq. 5 in my paper).

For H(cancer) we need to know how many amino acids are tolerated at that site in a tumour. Without that information, we cannot estimate H(cancer). Just for the sake of this toy example, however, let us imagine that we sample a very large number of tumours and find that 7 different amino acids are now tolerated with equal probability at that site.

The ground state in going from a normal, functional gene to this cancerous gene is not H(MaxEnt); it is H(Func). To clarify, presumably we are going from a physical state defined by a functional gene to a mutated gene physical state defined by whatever restraints the cancerous state imposes. This requires that we use the more general Eq. 2 in my paper.


H(cancer) = log2(7) = 2.80 bits

FI = FSC = ∆H = H(Func) – H(cancer) = - 1.52 bits, indicating a loss of functional information from the original physical state.

I don’t understand where you get a H(maxent ground state) of 1,686 bits for a single site problem, but I expect you are considering the entire sequence, in which case we should look at the second, more realistic problem.

Example #2: A more realistic problem using p53

(we can use other databases later. Let’s continue with the Pfam MSA for the sake of this semi-toy, but more realistic example).

Given: A core sequence length of 187 aa’s after insertions are stripped out

H(MaxEnt) = log2(20^187) = 808 bits

H(p53) = 401 Bits. (estimated by my program)

FI = FSC = ∆H = 808 – 401 = 407 bits.

Your number of 1,686 bits is not possible for a core sequence length of only 187 aa’s, but I expect you are using a much longer sequence, which requires great care. Including insertions and tandem repeats will artificially inflate FI. We must stick with the core sequence for the p53 domain.

Estimating H(cP53) is more challenging. We do not need to know what new function it is serving in a cancerous tumour, but we do need an idea of what amino acids the new function (if there is one) tolerates at what sites along the sequence for this hypothetical cancerous function. To do that we will need a large MSA consisting of only of non-redundant cP53 that tests positive for adequate sequence space sampling.

So we are in a poor position to estimate H(cP53) but we can make a very rough estimate based on the knowledge that the TP53 gene is the most frequently mutated gene (>50%) in human cancer(Surget et al., 2013). That strongly suggests it is no longer providing a required function(s) within the tumour or, at the very least, the number of aa’s is increased in normally highly conserved sites. So what we can say is that the data suggests,

H(p53) < H(cP53).

Therefore, FI = ∆H = H(p53) - H(cP53) < 0,

indicating a loss of functional information in the degradation of the p53 gene to cP53.

(S. Joshua Swamidass) #65

What is Negative FI?

So now we are crystal clear that FSC can be negative. Great. How do you interpret negative FSC? Recall, you write:

FI = FSC = delta H

FI is defined as:

-log(N / M)

Where N is the number of functional sequences, and M is the total of number sequences. That means that a negative FSC means there is a negative FI, so…

-log(N / M) < 0

Working out the algebra, we arrive at:

N > M

How is it possible for the number of functional sequences to be greater than the total number of sequences? With some similar algebra, we can also derive N = M 2^{-FSC}. This means that if FSC is, say, -10, there are 2^10 times more functional sequences than sequences possible.

How do you make sense of this contradiction?

Most of us would take it as a clear indication that there is an error in your derivation. I can even point out the error too: you are using delta H instead of KL. As soon as you use a different base state than maxent, moreover, the relationship to FI is no longer valid. Why are you unconcerned by this nonsensical result?

Still Haven’t Answered the Question

I just do not understand why you can’t give me straight answer to this question, based on the toy example that you yourself proposed. Can you please try? I do not understand why you can’t just give me a straight answer here. Can you please just humor me here?

What do you compute as these numbers? It is not sufficient hat you have only discussed the single site where there is a mutation. I am waiting for you to consider the whole protein, using the formula that you proposed. Using this formula, I get these numbers:

Do you agree? If not, why not? If you do agree, how do you make sense of the prior computations you made, that are not consistent with these numbers?

(S. Joshua Swamidass) #66

@Kirk, I am confused by this. At first I thought you had caught an arithmetic error (which I would have gladly retracted).

P53, however, has 393 amino acids, not 187. Wheere are you coming from?

This also is not what I am suggesting. We are NOT in a poor position ot estimate H(cP53). It is very easy to do this. I can give you thousands of sequences of cP53. Your estimate, we will see, is very poor. That is one anomaly you have to make sense of in this analysis.

Still, I agree that it will likely be negative, so I agree that:

You still have not explained how this could possibly the right way to compute FI. How do you interpret a negative FI? As you have put down in the math,

This appears to be a clear contradiction in your math, that I can resolve…

Can you help ne out on these points?

Also, I"m still waiting for your H(cP53), H(P53) and H(maxent) for the toy example you gave.

(Kirk Durston) #67

What is negative FI?

Given: FI = ∆H = Ha – Hb

∆H is a measure of the change in the physical system in going from some initial state (a) to some functional state (b). If ∆H is positive, FI is required to go from state (a) to state (b). If ∆H is negative, less FI is required for state (b) than state (a). FI is therefore lost.

If FI = -log(N/M), can N>M?

Yes. For example, if state (a) represents P53 in a healthy cell, and we assume the sequence is more highly conserved than cP53 in state (b), and the total number of permissible sequences in state (a) is M, and the total number of permissible sequences in state (b) is N, and we have a case where N>M and FI will be negative.

The key point here is that for cancer, the ground state (state (a)) is not MaxEnt. Instead, it is H (P53). If one wishes to estimate how much functional information is required to code for a cP53 family from scratch then one must use H(maxent). But in cancer, we are going from H(P53) to H(cP53).

So in this case, the answer to the question, “does cancer create FI or lose FI?”, the negative ∆H reveals that FI is lost in going from the healthy to the cancerous state.

Straight answer for the whole sequece?

I am puzzled by your statement, “I am waiting for you to consider the whole protein”. That is what I did in the previous post, unless by “whole protein” you mean the full sequence of 393 aa’s rather than the 187 aa’s. I literally went to Pfam and got my sequence data from there.

Pfam only provides the MSA for the P53 DNA binding domain. When I strip out the insertions to get the smallest possible core sequence, I end up with only 187 aa’s (which is close to the consensus length for the DNA binding domain).

Perhaps by the “whole protein” you are extending the original toy, one-site, single aa example to a sequence of 393 sites, where only one aa is tolerated at each site. That is not what I originally was talking about, but if that is the case, then I agree with your numbers for H(P53) and H(MaxEnt), but I don’t see how you would get H(cP53) = 4 for the whole sequence. I thought we were looking at 7 aa’s tolerated at one site. Extending this to 393 sites, that should give us H(cP53) = 595. In real life, of course, H(P53) is not even remotely close to 0, nor would be H(cP53).

Using ExAc or CTAG

I think you said that ExAC has a hundred thousand sequences for P53. I have not used it before, but for P53, it says it only contains a total of 567 variants. I get 487 variants from Pfam, but that is after stripping out the insertions, which would have given a large number. So these hundred thousand sequences on ExAC must include an enormous number of redundant/identical sequences which, after removing the redundant ones, reduce to only 567 variants. I’m happy to use whatever database you prefer, but I would need to take some time to familiarize myself with them. If, however, Pfam is doing a much better job of filtering out all redundant/identical entries, then it Pfam would be more efficient to use.