Durston: Functional Information

swamidass · October 16, 2018, 10:53pm

To help those following along, it appears that @Kirk is arguing for two contradictory things. He appears to want to compute the functional information of cancer differently than he computes the functional information of a protein family.

In the Case of Protein Families

Consistent with our computations cancer, I’m arguing we should compute the amount of information to move from an ancestor to a functional state. This is is not easy to compute from the data. Regardless, this is not what @Kirk is computing.

In this case, @Kirk is computing:

In the Case of P53

Here, @kirk is arguing to use the delta H, or the difference in areas between two highly ordered states, cancer and normal p53. Cancer is a slightly larger area, so this gives us a negative H (which has no useful meaning here. This is a very new and different approach that is not how he computes the functional information of protein families.

I am instead saying we should use KL divergence, the amount of information required to move from normal to cancer.

If he is unwilling to do this (the correct computation, I’m arguing he should at least be consistent and use the same method he is using for computing the FI of protein families.

What Should be Done?

The answer seems obvious to me. We need to see how his published method works on cancer p53 rather than using a different method first. The two methods are different and should be treated differently.

swamidass · October 16, 2018, 10:55pm

Note also that the way this was originally understood by you was as KL divergence.

swamidass · October 17, 2018, 6:02pm

@Art and @Mercer , my conversation with @Kirk hit one rabbit trail here, that I’m not pursing. Still, I’m hopefully get your most concise and salient response to it.

Kirk:

Your response is that evolution can’t optimize proteins. I disagree. Your work certainly does not demonstrate this.

Your recollection of my response is incorrect. What I have and continue to say is that the fitness gradients at the boundaries of functional sequence space are likely pretty steep (i.e., very selectable). You have not seen any of my work on this, so I’m unclear as to why you would believe that my work does not demonstrate this. The real-world data reveals that for most proteins (all the ones examined thus far), the belief that, within functional sequence space, there is a global maximum upon which the sequences converge, is falsified. The fitness gradients for certain species in a certain environment might be sufficiently steep to enable selection, but for the entire protein family, it appears there are a large number of local maxima and they are widely distributed in sequence space. Whether Hazen or you believe this or not is interesting, but says nothing about whether it is true or false. The data says it is false (and we have not yet looked at it, so I understand your reluctance to agree).

In the context of this larger conversation, my assessment is:

There isn’t a reason to belabor the point. @Kirk doesn’t have to respond you necessarily. It might also turn into a different thread. I’m still curious how you understand this.

Mercer · October 17, 2018, 8:38pm

Kirk:

What I have and continue to say is that the fitness gradients at the boundaries of functional sequence space are likely pretty steep (i.e., very selectable). You have not seen any of my work on this, so I’m unclear as to why you would believe that my work does not demonstrate this. The real-world data reveals that for most proteins (all the ones examined thus far), the belief that, within functional sequence space, there is a global maximum upon which the sequences converge, is falsified. The fitness gradients for certain species in a certain environment might be sufficiently steep to enable selection, but for the entire protein family, it appears there are a large number of local maxima and they are widely distributed in sequence space. Whether Hazen or you believe this or not is interesting, but says nothing about whether it is true or false. The data says it is false (and we have not yet looked at it, so I understand your reluctance to agree).

Hi Kirk,

I would appreciate it if you would elaborate a bit on this conclusion and point me to the real-world data to which you refer. I’m not quite sure how one would define the boundaries of functional sequence space, either.

Just so you know the level at which to throw it at me, I have a rich background in protein engineering and biochemical analysis of clinically-relevant mutant proteins. I also have a lot of exposure to world-class biophysics, but not hands-on experience.

Thanks!

Kirk · October 18, 2018, 9:57pm

Reading through @Josh’s comments, it appears to me that you (@Josh) might be confusing classical information with functional information. That is why your approach gives such an obviously wrong answer when applied to cases where there is a loss of functional information, such as in cancer.

Testing your approach: It should be clear that your K-L approach does not work when it comes to quantifying functional information. By your lights, if I had a flash drive containing the digital information required to build a next generation laptop and I accidentally reformatted the flash drive, such that it contained only randomized gibberish, your calculations would show that we actually gained a large amount of functional information!!! This is just one of three reasons I suspect you are confused between classical information and functional information. I have not yet addressed the other two mistakes you are making. In classical information theory even randomized gibberish is “information”.

There are two issues here: 1) how can we estimate the functional information required to code for a protein family and, 2) how well does the Pfam data represent functional sequence space? I have had discussions with a variety of biologists. Their only concern/criticism is issue (2) … whether the sampling in Pfam is an adequate representation of functional sequence space. None of them have a problem with the method; you are unique. When I initially submitted the paper for review, one of the reviewers was very critical of the paper, explicitly stating that he/she had reservations about applying classical information theory to protein sequences. After a few tweaks and some clarification, the editor sent the revision, with comments, back to that critical reviewer and he/she was completely satisfied. So has every other biologist I have discussed this with.

No. It is statements like the above that really leave me bewildered as to why you insist on misrepresenting me and what I am saying. I can respond to the p53 example either way and have done so (either using section A or section B in my paper), but let me try again.

As you insist, let’s set the ground state back to the naked, lifeless planet earth, assuming that all possible options of amino acids or sequences (whichever you prefer) are equi-probable so that we have the special case where the ground state I describe in my paper is “the null state”.

H (ground) = H (null)

We know that without any functional constraints to produce an active p53 protein, almost any sequence or aa distribution will do. This has become abundantly obvious for every biological protein we have ever looked at; it is a piece of cake to string together aa’s to get non-functional sequences. Not so the other way around. Given the above, we get

H (ground) = H (null) ≈ H (non-functional-cancerous).

So using the method described in section A, the amount of information I would have to hand you, in order for you to build a non-functional p53 protein, would be

H (null) – H (non-functional-cancerous) ≈ 0 bits.

Essentially, we don’t need any information to produce non-functional p53; all we have to do is simply let the system produce non-constrained aa sequences.

I can demonstrate the same outcome using Hazen’s approach, but I focus here only on mine.

This works for any probability distribution you care to invoke, except for the one probability distribution that defines functional p53 sequences or aa distributions.

Your approach, on the other hand, gives us an answer that is right by classical information theory, but clearly wrong when it comes to functional information.

You have agreed that the difference in Shannon uncertainty between two states can be either positive or negative, so the solution to your problem of always getting a positive, increase in functional information, even when functional information is lost, is staring us all in the face.

To put it in words: The reduction in Shannon uncertainty to go from a non-functional ground state to a functional state is equal to the amount of functional information required to reduce that uncertainty. Conversely, the increase in Shannon uncertainty in going from a functional state to a non-functional state is equal to the amount of functional information lost. I did not directly address the topic of functional information loss in my paper, but Eq. 2 in my paper works just fine.

My apologies for not yet addressing this but I would like to stay on current topic at least for this post. I need to answer your question though, as it is also an objection Josh has raised, and other biologists in general, and needs to be addressed. Teaser: I think the Pfam data leans heavily in my favour.

Mercer · October 18, 2018, 10:41pm

Hi Kirk,

First, I hope that you are not trying to portray an analogy as a test of an approach or a model.

Second, your analogy makes no sense to me, because the genomes of cancer cells are in no way randomized gibberish.

If I may make a suggestion, instead of thinking of them as cancer, think of taking some cells from your body and growing them in culture. Most of them won’t last long, but if you culture enough of them, you will select for variants that are immortalized and will continue to divide.

If you view it this way, it is easier to see that you are simply subjecting the cells to radically different selective environments that cause evolution and an increase in information, removing value judgments about cancer “(non-functional-cancerous)” that seem to be clouding your understanding of this as biology.

It’s not a big leap when you consider that many cells immortalized in culture will grow as tumors when you inject them in whole animals.

Here’s another scenario:

I have a mouse with a murine leukemia virus (MuLV) provirus (retroviral genome integrated into the host chromosome). That provirus has a single-base mutation that prevents it from being infectious. If that inactivating mutation is reverted, the virus activates, replicates, and starts carcinogenesis.

How is that reversion not an increase in information?

Mercer · October 18, 2018, 10:46pm

I’m confused. Are you saying that “the Pfam data” represents “all the proteins examined thus far”?

swamidass · October 19, 2018, 1:58am

Thanks for the concern. I’m not confusing the two. I’m just asking you to be consistent.

This is not my approach. This is your approach. I’m just asking for us to use FSC to compute the FI of cancer using the same approach you published on pfam. You seem unwilling to do this, but that is all I’m asking for at this time.

This is not true. It is entertaining that you’d think so. The calculation would show something quite different. It seems like you don’t know how KL would be computed. Very interesting.

Regardless, your inferences are incorrect because your guess at what KL of FI is wrong. The rest of your logic is predicated in that erroneous calculation.

You can expect this to change going forward.

It appears I’m the first computational biologist with training in information theory to have looked over your math. It is clearly in error, but in a way that would be hard for most biologists to recognize. With this thread, and the thread on Cancer information (Computing the Functional Information in Cancer), and observers like @evograd, @rjdownard, @Art, and @mercer, the methodological errors will be more widely known.

The difference is that I am a computational biologist that has been applying information theory to biology for decades now. Very few people have the ability to competently review your work. I am one of them.

I’m sorry if I’m misrepresenting you. In your paper, FSC is computed using a maxent ground state. I think we agree that this is what you did. Right?

I’m saying we should see how that approach computes the information of cancer. I think it is in error on cancer, and it is also in error on protein families. You can’t use it for protein families if it can’t pass the basic control of cancer genomes. For you compute the correct FI of a protein family, among other things, you have to use the equivalent of “normal” which i the “ancestral” state in protein families. You didn’t do that for pfam, so the calculation is similarly invalid for pfam.

This is an helpful image. Your current approach starts from that ground state. However, it is not how evolution says things arose. So it is starting from a a strawman of a ground state. You have to start from the ancestral state, not a maxent state.

Nope. I am sorry, but that is not the right calculation. Applying the formula from your paper, the FSC paper, produces about 6 billion bits of FSC. Would you like step through the calculation? That should be fun.

Also “cancer” is not “non-functional.” As @Mercer has already noted. It is a precisely defined functional state that requires precise control of gene expression and new protein functions.

After letting go of the idiosyncratic use of the term “Shannon uncertainty” (which I do not agree with), I do agree that delta H can be negative. However, it does not mean what you think it does. Also, it is not clear that cancer has reduced H. You have to demonstrate this, and I am not sure it is true.

Except you haven’t even produced a single correct calculation. Clearly I am talking about information in a different way than you imagine I am.

The information to erase a hard drive is much FI at all, perhaps as low as one bit. It also increases the function of the hard drive by increasing it’s capacity (I.e.) its function. Stepping through this example might be instructive. It seems you didn’t realize that the FI using KL would be trivially low in this case. Also, if we want to use FSC or Delta H, we don’t know if it goes up, down, or stays the same. There is not enough details to tell. For KL, however we can know quite easily.

Likewise, the information of cancer using a base state of maxent is not 0; it is, rather, about 6 billion bits of FSC. That also is an example worth stepping through.

That is turns out to be false. KL divergence is the amount of information required to change between states, delta H is not. To make a simple analogy. Let us say I have two documents, that say totally different things, and accomplish different functions, D1 and D2. Let us say that the complexity is the same, so H(D1) = H(D2). So the delta H is zero. Now let’s say I have D1, but want to change into D2. How much information do I need to do this?

Turns out that it is definitively not 0, even though delta H = 0. Instead, it will take KL(D2 || D1) bits of information to create D2 starting from D1. What you need is KL. This is a foundational concept in information theory. There really is no way around this.

It seems that what is going on here is that you haven’t yet appreciate the error you made in thinking KL = delta H. Yes, that is true if and only if the base state is uniformly random. However, as soon as you deviate form this, it is no longer true that FI = -log (W), and the formula you are using are no longer correct. You just realized this recently, earlier in this thread.

Of course, you have to argue that detla H is correct for there to be any coherence to your argument, but it is not. The example I just gave you is an example of that.

At this point, you have incorrectly inferred how to compute KL on cancer and in the erased hard drive. Based on that incorrect inference, you made incorrect computations on the FSC of cancer. That seems to be where we should start, taking a large number of cancer genomes, and computing their FSC by your published method.

It will produce a FSC computation of 6 billion bits. From their we can discuss the next steps forward. What do you think?

swamidass · October 19, 2018, 2:01am

That is reasonable. I encourage you to keep the two directions separate in your posts. If this develops, we will put it in a separate thread.

I agree with @Mercer that you seem to be misunderstanding how cancer works. It certainly is not non-functional.

Mercer:

Second, your analogy makes no sense to me, because the genomes of cancer cells are in no way randomized gibberish.

If I may make a suggestion, instead of thinking of them as cancer, think of taking some cells from your body and growing them in culture. Most of them won’t last long, but if you culture enough of them, you will select for variants that are immortalized and will continue to divide.

If you view it this way, it is easier to see that you are simply subjecting the cells to radically different selective environments that cause evolution and an increase in information, removing value judgments about cancer “(non-functional-cancerous)” that seem to be clouding your understanding of this as biology.

It’s not a big leap when you consider that many cells immortalized in culture will grow as tumors when you inject them in whole animals.

And here:

I would agree with @Mercer here. It seems you have a very narrow, and incorrect, view of how cancer develops. For example, a p53 inactivation mutation is not nearly enough to create cancer. Most tumors are benign, even though they do not control their growth. What is the difference between a benign tumor and cancer? The cancer has acquired several new functions that the benign tumor does not have.

swamidass · October 19, 2018, 5:25pm

2 posts were merged into an existing topic: Side Comments on Durston

Mercer · October 19, 2018, 5:23pm

Here’s another one that’s even more intuitive (I hate the math).

The goal is to ID human oncogenes.

We transfect mouse cells with tumor DNA and control nontumor DNA from a patient, literally adding information. Some cells form colonies and grow as tumors in test animals. The human DNA causing the transformation is identified (by methods that aren’t important for this concept). This is how human oncogenes were identified by Mariano Barbacid first, and by many others later.

Kirk, can you think of any way at all you can interpret that oncogene as a loss of information, functional or otherwise?

Kirk · October 19, 2018, 9:28pm

I thought I had made this clear some number of posts ago. I am not using KL. My initial reference to KL starting in post #2, was due to misreading the particular definition dealing with Shannon information, as you pointed out, where I had mistakenly “saw” it equated to delta-H. To keep saying I am using K-L is misrepresenting me and my paper. I mentioned this before … neither my paper nor Hazen’s makes any reference whatsoever to K-L, so to keep insisting that my paper is using K-L is to misrepresent my paper. May I direct your attention, once again, to general Eqs. (2) and (6). You should be able to see that that is not K-L; it is delta-H. Delta-H should not be confused with K-L.

Please see page 11 of Shannon’s original paper where he refers to H as measure of “information, choice and uncertainty” after discussing Shannon entropy as uncertainty in the previous page. A reduction in uncertainty is a very useful concept when it comes to how much information was involved in that reduction.

We were talking about p53. There is no earthly way p53 can contain or requires 6 billion bits of functional information to encode.

First: Although K-L can be reduced to my Eq. (4), it only takes a glance at my paper to see that the general case from which the special case of Eq. 4 is derived, is not K-L, but Eq. 2, which embraces non-uniform probability distributions. It is essential that you do not confuse delta-H with K-L when discussing my paper.

Second: I thought I made this clear several times already. If the ground state cannot be described by a uniform probability distribution, then we use the more general Eq. (2) that I supply in my paper. It is easy. I do it all the time with the non-uniform probability distributions of the genetic code to estimate FI assuming the existence of translation. If you are looking at universal proteins that would have had to exist prior to translation, then you might want to use W. If not, then use the genetic code probability distributions.

Key Idea to Grasp: My use of delta-H is not idiosyncratic. It goes right back to Brillouin’s paper in 1951. Let me quote:

“This example is just a special instance of a general rule: any additional information about the system under consideration corresponds to a decrease in the total number of complexions P. We have to reject all the complexions that would not satisfy the additional conditions contained in our information. Hence, the total entropy decreases whenever we happen to have some special information about the structure of our physical system. We choose this entropy decrease as the physical measure of information.

I =-∆ S = ∆ N , S, entropy; N, negentropy. (53)”

He also discussed the effect of “additional conditions”/constraints on the physical system as reducing S (constrained) required to carry a message. He uses different wording, but the entire approach is there, which I use, as does Hazen.

The flash drive example is not an analogy. A good method should not just work in on particular area, but work across many areas. For example, in the Hazen paper, they discuss functional information not just for biology, but for letter sequences and digital output of the AVIDA program. In each case, we are working with two things, function and digital information. So the test of a good method is that it works in all cases dealing with function and digital information. Josh’s KL approach fails the test since it yields nonsensical results when functional information is lost.

When I referred to protein families examined, it was to the families in Pfam that I have examined.

I was speaking in reference to a p53 protein sequence that no longer codes for the normal function of p53. I know it is not a defective p53 gene that initiates a cancerous cell in the first place; the defective p53 simply cannot terminate the cell. So since the p53 gene is now not working, it is free to evolve, meaning almost any aa at any site will be fine, since the p53 gene is no longer constrained to carry out it’s original task. To use the Hazen formula, M ≈ N. The DNA is the storage “flash drive”, and the digital information within that gene is the functional information to be analyzed.

No worries, I am fine with working with anything. All I need to know is how “function” is being defined.

It is an increase. I don’t know the effect of the other 19 aa’s, but assuming that only that particular aa produces the effect, we are looking at 4.32 bits of functional information added, the maximum possible. That is easily within reach of an unconstrained physical system.

If I understand correctly what you described, that would be the transfer of functional information from one source to another, but not the production of novel functional information (so far as I can see). In that case, the physical system has been expanded to include the source, so there is a net zero gain in functional information (ignoring any functional information the scientists may be introducing through imposed constraints). If we ignore everything but the mouse, it is a gain in FI.

@Mercer you are asking good, productive questions. The more specific they are related to digital information within the cell, the better I can answer them, but I can estimate for more complex effects if need be.

swamidass · October 19, 2018, 10:14pm

Kirk:

Regardless, your inferences are incorrect because your guess at what KL of FI is wrong.

I thought I had made this clear some number of posts ago. I am not using KL. My initial reference to KL starting in post #2, was due to misreading the particular definition dealing with Shannon information, as you pointed out, where I had mistakenly “saw” it equated to delta-H. To keep saying I am using K-L is misrepresenting me and my paper. I mentioned this before … neither my paper nor Hazen’s makes any reference whatsoever to K-L, so to keep insisting that my paper is using K-L is to misrepresent my paper. May I direct your attention, once again, to general Eqs. (2) and (6). You should be able to see that that is not K-L; it is delta-H. Delta-H should not be confused with K-L.

We are tasking past each other here. I agree that you are using delta H. I think this is in error, and you should be using KL. You, however, are using delta H. It seems we agree on this. Once we finish looking at delta H, we can turn to KL to see how this is a much better measure of FI (i.e. the amount of information needed to acquire a new function).

It is idiosyncratic to you because that is not the only way that Shannon uses it. We can let that go for now. I’m just trying to move forward without confusing people to think I agree with your use of that term. I do not.

Agreed. If we are talking about p53 it will be on the order of 100s of bits. If we apply it to cancer genomes, it comes out to be about 6 billion bits. Sorry that wasn’t clear.

Just looked at your paper again. It does not account for non-uniform probability distributions. There is no mention of non-IID or non-uniform probably distributions. The only context this math is valid is in the case of a maxent base state. Otherwise it becomes invalid.

I suppose we just have to disagree here. Your use of the delta H is very idiosyncratic.

And there is a reference to Hazen, who explains that using this on extant sequences rather than uniformly random sequences is an error. If you are going to appeal to Hazen, you have to deal with what he has said about this.

You miscalculated the KL divergences. I’m happy to work you through the calculation. But it can be as low as a couple bits of information. This is a sensible result. It always takes information to move between states, in this case it only takes a small amount of information.

Clarify This?

@Kirk you’ve made a string of claims I can’t make sense of. First, what I am writing here is confined to your delta H approach.

At first you were trying to compute delta H between normal and carcinogenic p53, presuming that a specific amino acid needed to mutated to on of the 19 other amino acids. That would mean:

delta H(cancer p53 | normal p53) = -4 bits

That is an approximate number. What ever it is, it is just a few bits. Then you guessed the the delta H of cancer to maxent, and arrive at this:

delta H(cancer p53 | maxent) = 0 bits

So that would seem to mean you think that:

H(normal p53) = 4 bits.

However I’m pretty sure you think that it is actually:

H(normal p53) > 100s bits.

So it seems we have a contradiction. It doesn’t seem that you can simultanuously argue that:

delta H(cancer p53 | normal p53) = -4 bits
delta H(cancer p53 | maxent) = 0 bits
H(normal p53 | maxent) > 100s bits.

Something has to give. What gives? What do you really think is going on? In my view, I think if we too look at the data from actually cancer sequence experiments, we would see (approximately):

H(normal p53 | maxent) > 100s bits.
H(cancer p53 | maxent ) > 100s bits.
delta H(cancer p53 | normal p53) = -30 bits

Those numbers are approximate. The actual numbers are going to adjust based on the precise data we use. Basically, we would see high FSC in both cancer and normal p53. The amount of information will be proportional to the length of the sequence. These are the numbers I think are reasonable for your delta H appraoch, and we can even compute them directly here too.

Your numbers, however, are not internally consistent. Can you explain which of those three things are to change?

The Bigger Picture

Taking a bigger picture view of this, I can see where confusion is arising. You are measuring the amount of functional entropy/information of different states, without accounting for the type of functional information. This seems to be pretty important.

In the case of normal vs. cancer, you are measuring the total amount of “normal” function and comparing that to the total amount of “cancer” information. Do you agree with that?

swamidass · October 19, 2018, 10:40pm

Clarifying this toy example some more. Of course, this is just a simplified model and better data would improve this. P53 has 390 amino acids. Let us imagine (this is not a fact) that all these amino acids need to be precisely correct for it function normally. Carcinogenic TP53 (which we will abbreviate cP53) requires mutating a specific amino acid to some other amino acid, any one will do.

We can obtain a large number of normal P53 sequences, and a large number of cP53 sequences. What will their H be?

H(P53)
H(cP53)
H(maxent ground state)

In this toy example, it seems that that we will agree on this:

H(maxent ground state) = - 390 * log 20 = 1686 bits
H(P53) = 0

This means that the FSC is going to be 1686 bits for normal p53. The divergence appears to be in how we compute H(cP53). What number do you compute? What do you think we will arrive at for the H of the cancer-function in P53 sequences?

H(cP53) = ?

I compute it at 1682 bits. This is merely applying your formula to the extant sequences we have, which (in this toy example) will all have the normal P53 mutation except in one location. Any other number would require either (1) changing your formula for FSC, or (2) smuggling in sequences for cP53 that were not actually part of the data set. What do you compute?

Currently, you are arguing to mutually exclusive positions at the same time:

Mercer · October 21, 2018, 5:28am

Understood, but that only functions in the context of a whole cell. I am simply trying to point out that cancer is a gain of function.

OK. Glad that we have that settled!

I think you missed the experimental aspect and concentrated on the method; apparently I didn’t get my point across.

My point wasn’t about the transfer of information, but that the experiment showed that the tumor DNA had information that the normal control DNA from the patient did not have.

Does that clarify? I’m trying to keep it simple and avoid the math.

Kirk · October 26, 2018, 1:35am

Yes, I agree; it seems we have been talking past each other. Given that we cannot deny that ∆H clearly means something related to a change in physical state, I want to give a final explanation of the elegance and utility of my approach that is clear enough for any onlookers to understand where I am coming from. Then I think we should move on to discuss specific applications as a test and demonstration of the approach.

I wonder if your objections are arising from the assumption that this is an information-only problem. My approach is that this is, first and foremost, a physical system problem, which can be quantified in terms of functional information (FI). The physical states must be defined and understood before we can attempt to quantify them in terms of bits of information.

I present this final summary, not for the purpose of arguing any further about it, but simply as an explanation. In science, we do not always have to converge on just one approach but we all desire clarity on whatever the approach may be, so here is my final summary before we move on to application.

Final summary of the theory behind my approach to FI:

A clear distinction must be made between a physical system and information. Physical functions may require a change in the physical system in order to achieve that function. This physical change can be quantified in terms of FI.

Given any physical system ( phys ), the Shannon entropy of that system can be described as

H ( phys ) = - K ∑ p ( i ) log p ( i ) = x bits (1)

where K defines the units of measure in terms of bits (in this case) and the summation is performed over all i from 1 to N , and N is the total number of possible configurations of the physical system, and p ( i ) represents the probability distribution over the options, not necessarily uniform.

If physical constraints must be imposed on a physical system in order to satisfy a physical function within a larger system, the Shannon entropy of the resulting constrained physical system ( phys + functional constraints ) of M functional options can be described as

H ( phys + functional constraints ) = - K ∑ p ( j ) log p ( j ) = y bits (2)

where the summation is performed over all j from 1 to M .

The change in the physical system to achieve the required physical functionality is, first and foremost, physical, but it can be quantified in terms of Shannon information as

∆ H = H ( phys ) - H ( phys + functional constraints ) (3)

= x bits – y bits = z bits (4)

Is this change meaningful?

The physical requirements to satisfy some physical function in a physical system can be quantified in terms of z bits of information. Since z bits quantifies functional requirements in terms of information it is meaningful and important. I call this functional information to distinguish it from classical information, where we are not concerned whether the message is functional/meaningful or not. Therefore, FI is defined under my approach as

FI = ∆ H = H ( phys ) - H ( phys + functional constraints ) (3)

Applications:

This method can be used to provide an objective quantification of anomalies across the natural world, with applications in forensic science, biology, SETI, archaeology, and chemistry, from which an investigator can draw meaningful inferences, especially if such anomalies appear to have a function.

Moving On:

As I said above, I post the above summary not for the purpose of arguing about it any further, but so everyone is clear on what I am talking about. Given that we agree that my approach is not the same as how @Josh would approach this problem, let us move on to how I apply this. I think I should start with one or two very simple examples for the sake of all onlookers, and then look at applying it to protein families and cancer. I’ll start that in my next post. I’ll try to post more often than once per week, time permitting.

swamidass · October 26, 2018, 1:43am

Okay, we can take that direction.

I’m still asking you to clarify now in the case of your toy problem what the entropy (H) of carcinogenic p53 is…

swamidass:

Clarify This?

@Kirk you’ve made a string of claims I can’t make sense of. First, what I am writing here is confined to your delta H approach.

Clarifying this toy example some more. Of course, this is just a simplified model and better data would improve this. P53 has 390 amino acids. Let us imagine (this is not a fact) that all these amino acids need to be precisely correct for it function normally. Carcinogenic TP53 (which we will abbreviate cP53) requires mutating a specific amino acid to some other amino acid, any one will do.

We can obtain a large number of normal P53 sequences, and a large number of cP53 sequences. What will their H be?

H(P53)
H(cP53)
H(maxent ground state)

In this toy example, it seems that that we will agree on this:

H(maxent ground state) = - 390 * log 20 = 1686 bits
H(P53) = 0

This means that the FSC is going to be 1686 bits for normal p53. The divergence appears to be in how we compute H(cP53). What number do you compute? What do you think we will arrive at for the H of the cancer-function in P53 sequences?

H(cP53) = ?

I compute it at 1682 bits. This is merely applying your formula to the extant sequences we have, which (in this toy example) will all have the normal P53 mutation except in one location. Any other number would require either (1) changing your formula for FSC, or (2) smuggling in sequences for cP53 that were not actually part of the data set. What do you compute?

Currently, you are arguing to mutually exclusive positions at the same time:

swamidass:

So it seems we have a contradiction. It doesn’t seem that you can simultaneously argue that:

delta H(cancer p53 | normal p53) = -4 bits
delta H(cancer p53 | maxent) = 0 bits
H(normal p53 | maxent) > 100s bits.

Something has to give. What gives? What do you really think is going on?

I’d like to know what you think H(cancer p53 | maxen) is in this toy (counter-factual) example.

Kirk · November 2, 2018, 8:30pm

Ok, let’s take a look at p53. Just to make it a little more interesting and realistic, I downloaded an 872-sequence MSA from Pfam for the p53 domain and ran it through a program I wrote. You can see a portion of the results in the linked file, but I’ll highlight the main items of interest here.

Initial Info:

Number unique sequences: 487

No. columns in MSA before stripping out insertions: 669

No. Columns in MSA after reducing the sequences to their essential core: 187

Average no. of different aa’s tolerated per site: 14

Estimated minimum no. of mutational events per site: 23

Disclaimer:

The Pfam MSA does not have adequate data for the p53 domain to obtain an acceptable estimate of functional information. I prefer to have a minimum of 30 estimated mutational events per site and at least a few thousand unique sequences. The data here demonstrates only a minimum of 23. Nevertheless, let’s just work with what we have as a “toy” problem.

Functional information required to code for a p53 domain:

starting from a physical system in the null state = 409 Bits (Avg. density = 2.17 bits/site)
starting from a physical system skewed by the genetic code = 374 Bits (Avg. density = 1.99 bits/site).
extreme lower limit = 123 bits (Avg density = 0.65 bits).

Functional information for non-functional p53 domain:

Here, it is critical to clearly define what the function is that determines whether a sequence is functional or not. There are three perspectives we will take: a) relative to an intelligent observer, b) relative to the normal, physical p53 function c) relative to a cancerous cell. Looking at the attached data, we observe that site 180 permits only one, specific amino acid, arginine. Let us suppose that it is mutated to something else, thereby inactivating p53 so that it can no longer perform it’s function. The mutated sequence falls into the set of sequences that are non-functional relative to the normal function. Let us use the extreme lower limit of 123 bits for a functional p53 domain.

a) relative to an intelligent observer: Even though the sequence is now non-functional we can see from the data I linked to, that only 4.32 bits of functional information has been lost, but the function has shifted from fulfilling it’s cellular duties to becoming “meaningful” to an intelligent observer, relative to the cellular function the intelligent observer “knows” it should have. The intelligent observer “knows” what the desired function is and “sees” that this sequence is quite “close”. The physical system does not work that way; it does not know or see anything, only whether it is functional or not.

b) relative to the normal p53 function: the mutated sequence is now a member of the set of non-functional sequences. The cell cannot “see” how close the sequence is; the sequence either satisfies the function or it does not. In this case it does not. A non-functional sequence requires approximately 0 bits to encode. Relative to the physical system, 123 bits of functional information has been lost. The gene is now inactive. This, however, is the simplest case. There are more complex cases that we shall leave aside for now.

Let’s pause here to underscore a critical point, and a frequent source of confusion. “Meaningful” is a special case of the more general “functional”. For intelligent observers/engineers, a sequence can be meaningful to the mind, while at the same time be non-functional to the physical system. For example, an intelligent observer attaches a certain degree of “meaning” to a p53 sequence that is “close” to being functional. The cell may not. In discussions of functional information, therefore, the function that is being discussed must always be clearly defined, as distinguished in the above example between (a) and (b). A common mistake is to conflate functions, illustrated by overlapping Venn diagrams, when the two functions are actually completely independent, not even existing in the same reference frame.

c) relative to a cancerous cell: There are three possibilities relative to the physical system of a cancerous cell, setting subjective “meaning” aside.

If p53 is still functional relative to it’s normal physical function, then the FI = 123 bits.
If p53 is non-functional relative to it’s normal physical function, then FI is 0 bits.
If p53 is non-functional relative to it’s normal function but has a new function to do with cancer, then we need to have some way of determining what sequences satisfy that new function, before we can estimate FI. If non-functional p53 is simply a “bystander” in a cancerous cell (i.e., doesn’t do anything), then without a function, by definition, a system has no FI. FI is contingent on a specified function. If we wish to arbitrarily say that being a “bystander” is a function, then we must realize we are no longer talking about any function required by the physical system … the system does not require “bystanders”. We are now talking about an arbitrary, subjective function that we have invented in our minds. We can still estimate FI in that context, but it no longer has anything to do with biology.

swamidass · November 4, 2018, 2:59pm

@Kirk.

First, I am not interested in “non-functional” p53, we areinterested in carcinogenic p53.

Still waiting…

Second, before we get to actual data, I want to work, out this toy example. I note that you have not yet answered the question. What do you compute for H(cP53 | maxent) based on the observed sequences? It cannot be two or three numbers. It can only be a single number. The closest to a straight answere is here:

This seems to be an admission that the H(cP53 ) computed from sequences is 4 bits less than H(P53 ), even though it should be zero (in your mind). Strangely, you also wrote contradictory things to this point. It is a straightforward question. Looking at the extant sequences, what do you compute as the H(cP53)? What do you compute as H(P53)? What do you compute as H(ground)? What is required here is the numbers computed from the extant sequences because that is all you use to compute FSC.

Note also that this is not relative to anything. It is merely the number computed from the extant sequences. We should be able to compute the delta H between any two states once the H of all states are established. Why is it so difficult to give us this number?

Where to find cancer data

Third, when we do actually work with data, there are over 100,000 examples of normal p53 in ExAc, so try using that. There are over 10,000 examples of carcinogenic p53 in CTAG, so try using that. We have more than enough data to make sense of this. The numbers calculated from this data are very different than yours.

Why so difficult?

@Kirk, I’m not sure why this is taking so long to establish H(cP53) or why you are writing to so much. I am just asking you to apply your formula to cancer in a toy example ot produce a straightforward answer. It is a well defined problem.

Show me how you computed this number from the sequences in this example? It appears that you are stating what FSC should be, rather than what it actually is computed to be from extant sequences. We need to know what FSC is computed as, not what you think it should be if it is a valid measure of FI. So I should repeate, for our toy example (not jumping ahead), using extant sequences and the FSC method you published, what are theses quantities:

H(P53)
H(cP53)
H(ground)

I’ve asked this several times now:

swamidass:

Clarifying this toy example some more. Of course, this is just a simplified model and better data would improve this. P53 has 390 amino acids. Let us imagine (this is not a fact) that all these amino acids need to be precisely correct for it function normally. Carcinogenic TP53 (which we will abbreviate cP53) requires mutating a specific amino acid to some other amino acid, any one will do.

We can obtain a large number of normal P53 sequences, and a large number of cP53 sequences. What will their H be?

H(P53)
H(cP53)
H(maxent ground state)

In this toy example, it seems that that we will agree on this:

H(maxent ground state) = - 390 * log 20 = 1686 bits
H(P53) = 0

This means that the FSC is going to be 1686 bits for normal p53. The divergence appears to be in how we compute H(cP53). What number do you compute? What do you think we will arrive at for the H of the cancer-function in P53 sequences?

H(cP53) = ?

I compute it at 1682 bits. This is merely applying your formula to the extant sequences we have, which (in this toy example) will all have the normal P53 mutation except in one location. Any other number would require either (1) changing your formula for FSC, or (2) smuggling in sequences for cP53 that were not actually part of the data set. What do you compute?

So, based on extant sequences as have been described in this example, I compute it at:

H(P53) = 0
H(cP53) = 4
H(maxent ground state) = 1686

This means the FSC computations are:

FSC(P53) = 1686
FSC(cP53) = 1682

What numbers do you compute from the extant sequences in this example? Yes, I know you think FSC should equal zero to be valid. For your interpretation to hold, I agree. I am asking here instead what the actual FSC computation gives us, which is NOT zero, which means that FSC is not valid. If we can’t get a straight answer on this, it seems that that this conversation is coming to a close.

swamidass · November 8, 2018, 1:04am

@Kirk, I am still really hoping to get these numbers from you. It would help make sense of what you are even proposing. From those numbers we should be able to compute the FSC and compare that number with what you think the FI is there. It will be enlightening! Are you having difficulty with this computation?

This is what I came up with:

Of course, that is just in our example. To compute the actual numbers for P53, we would use a databases like ExAc and CTAG.

We can also compute the difficulty of evolving cP53 using delta H (your method) versus KL (what I think is the correct method). I’m very curious to make sense of a negative FSC. That should be fun.

We can also generalize this to DNA easily, and compute it across whole genomes! This also should be fun too. The ~6 billion bits of FI calculation is in sight right now.

Also, for reference, mutant P53 has a large range of complex functions, including gain of function. It is not merely obliterating activity. (see here: The TP53 Website - Mutant Loss Of Activity Database)

Topic		Replies	Views
Computing the Functional Information in Cancer Conversation Design	41	5429	July 6, 2020
Gpuccio: Functional Information Methodology Conversation Science , Design	183	13298	September 1, 2019
Explaining the Cancer Information Calculation Conversation	85	6740	September 28, 2020
Looking for sources on the information argument Conversation Design	127	2763	September 10, 2021
Shannon information and COVID-19 Conversation Science , Article	93	1955	October 7, 2022

Durston: Functional Information

In the Case of Protein Families

In the Case of P53

What Should be Done?

Clarify This?

The Bigger Picture

Still waiting…

Where to find cancer data

Why so difficult?

Related topics