Durston: Functional Information

Hello all. I appreciate the thoughtful and collegial comments, for the most part. Since my approach to estimating functional information, outlined in my TBMM paper, has been mentioned, I thought I should join the discussion. Before I explain/defend how I estimate the functional information required to code for a protein family, I want to ensure we are all on the same page regarding the basic principles of functional information. From the comments, I think we might be, but I’ll outline seven basics just to make sure we can all sign off on them before moving on.

  1. In general, Shannon information is the difference in Shannon entropy between two states. Shannon entropy is defined in Claude Shannon’s 1948 paper.

  2. The equation for functional information presented by Hazen et al. is merely a special case of Shannon information when the co-variable of function is included. It represents the difference in Shannon entropy between a non-functional ground state and a functional state.

  3. You will note that Hazen’s equation looks quite a bit simpler than the normal equation for Shannon information (i.e., no summation signs, and no variable probabilities). This is because Hazen’s equation assumes that all sequences are equally probable. When that is done, then the normal equation for Shannon information simplifies to the form Hazen presents.

  4. In reality, not all sequences are equally probable. For example, when we estimate the functional information required to code for protein families, the genetic code ensures that not all amino acids are equally probable, hence, not all sequences are equally probable. Even ignoring that, not all sequences are equally probable across all taxa due to environmental and phenotypic constraints on functionality, etc. Nevertheless, it simplifies things to grant Hazen’s assumption and I’m happy to work with it so far as we can.

  5. There is a difference between the functional information required to perform a function, and creating that information.

  6. Duplicating an existing sequence that carries functional information does not increase the amount of functional information created unless the two identical sequences in combination can perform a new function that a single sequence could not. Even then, the amount of information required to produce the duplicate sequence will not be 2X the amount to produce a single sequence, since the new ground state for producing the duplicate sequence already includes the original sequence. If the ground state includes the original sequence as well as a mechanism for duplicating a sequence, then the amount of functional information required to produce the duplicate sequence may be trivial or even zero.

  7. The problem with using Hazen’s equation to estimate the functional information required to code for a particular protein family is that it is a single equation with two unknowns. We haven’t the faintest idea what M(Ex) is, therefore, we cannot solve for I(Ex).


Welcome @kirk. This thread will have special rules to ensure you are appropriately respected and engaged.

  1. Everyone here should treat @kirk with kindness. No need calling or rudeness or taunting is allowed.

  2. Only participate in this thread of you have relevant expertise and can add or clarify something substantive.

  3. Questions from observers will be addressed in a side comments thread. Request that the @moderators start one if you would like to make use of it.

There are several people here with relevant expertise (@nwrickert, @art, @evograd, @T_aquaticus, @glipsnort, @dga471 etc) who should feel free to contribute constructively and judiciously.

1 Like

A post was split to a new topic: Side Comments on Durston

Good plan. Looking these, over I do not think I will be able to sign off.

This needs to be more precise. Do you mean the conditional entropy? Or mutual information? Or conditional mutual information? Perhaps write the equations, and refer to the specific equations in shannon’s paper. http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

That might true if you have defined shannon information correct. You seem to be using an idiosyncratic/non-technical definition of shannon information, so we can’t be sure.

I agree. Also his equation assumes we have perfect knowledge of all functional and non-functional sequences.

I am not happy to work with it as this one major way your method fails. You even recognize this in following points. So no, I do not accept that this is an acceptable simplification.

Not clear. There is no way to assess this as true or false.

I dispute this on several levels.

  1. Duplicating a sequence can add the function of redundancy and error control, that is a new function.
  2. This directly contradicts point #4, demonstrating that sequences are not equiprobable.
  3. The method you use cannot correct for this effect.
  4. FI here is not well defined, to the point of creating errors in analysis.

This last point is important. We can understand FI in several ways:

A. The bits of information required to modify a system without a function to produce a new function.
B. The number of bits in a system that preforms a system in total.
C. The amount of “function” we see (measured some how).
mA: The measured version of A by some specific method.
mB: The measured version of B by some specific method.

Several unjustified claims are often made by lumping these five concepts together. All are false.

  1. High B does not mean A is high.
  2. Low A does not mean C is high, because a small amount of A can produce a large amount of C.
  3. mA and mB are not necessarily good estimates of A and B, and can be wildly off if the process that generates sequences is not modeled.

I could go on, but the key point is that FI has to be carefully qualified everytime the term is used. That is not done here. So I cannot agree with most of what you have written until it is clarified.

I disagree with this. We have very good estimates of M(Ex) for many functions. Regardless, this is all beside the point, because this tells nothing about how difficult it is to evolve a new function. The type of FI we need to compute is the conditional information from the sequences we already have to get to any sequence that works for any function, even if we have never seen that sequence before or never expected that function. @Kirk’s method does not compute this quantity.

I, however, did compute FI in cancer, as the mutual conditional information: Computing the Functional Information in Cancer. This is an important negative control going forward. Any method proposed needs to be able to show that cancer is possible without design. That does not appear possible without demonstrating Durston’s argument wrong.


For reference, this is the key paper:

1 Like

A post was merged into an existing topic: Side Comments on Durston

Kirk, welcome and thank you fro starting at Shannon information definition and going from there. I think I might be able to follow you as you progress in applying information theory to biological systems.

You’re in my lane now. I know this paper by heart. Yes, if you can start here and go from there to biological systems I should be able to follow and comment.

Joshua’s request for clarity on basic point (1) is a good idea.

At the top of page 11 of Shannon’s 1948 paper, he provides an equation for H as follows:

H = - Kp ( i ) log p ( i ) (1)

where the expression is summed over i = 1, n and “the constant K merely amounts to a choice of a unit of measure”. If we wish to convert to units of “bits”, we can set K = 1/log2, but for the moment we will avoid complicating things and set K = 1. (I have changed the notation slightly to account for the limitations of this media for writing equations with superscripts and subscripts.)

We therefore can simplify Eq. (1) to

H = -∑ p ( i ) log p ( i ) (2)

where, according to Shannon, H fulfils the requirements as a measure of “information, choice and uncertainty”. He calls H “entropy” of a set of probabilities, but I wish he had gone with “uncertainty” as it is more intuitively related to the concept of information. Be that as it may, if you hear me saying “Shannon uncertainty”, it is equivalent to saying “Shannon entropy”.

The change in uncertainty/entropy/information from state A to state B, is simply the change in Shannon entropy from A to B,

H (A,B) = H (A) – H (B). (3)

On the bottom of page 11, he states two more things that I think are important to point out …

  1. H = 0 if and only if all the p( i ) but one are zero, this one having the value unity. Thus only when we are certain of the outcome does H vanish. Otherwise H is positive.

  2. For a given n , H is a maximum and equal to log n when all the p( i ) are equal (i.e., 1/ n ). This is also intuitively the most uncertain situation.

This underscores why I wish he had used “uncertainty” rather than “entropy”. I can expand on this if necessary, but for now let me make one point that I mentioned in my previous post. If all the options n ( i ) are equally probable, then Eq. (2) simplifies to

H = log n .

The beauty of Shannon’s approach is that it can be used to measure information for a wide variety of systems, including the storage capacity of your flash drive. Now let us apply Shannon’s approach to the paper by Hazen et al. to see how it works.

In their paper, E ( x ) is a measure of the “degree of function x ”. This idea is described in an illustration in Szostak’s earlier short article in Nature, 2003. The idea is that there may be a set of configurations that can perform some function x , but only a subset that can perform it at a sufficient level for the system being examined (see the figure in the 2003 article).

Back to the Hazen paper, they then postulate “a system with N possible configurations.” Some of these might satisfy function x and others may fall short of the “degree of function” E ( x ). The number of configurations that satisfy E ( x ) they designate as M ( E ( x )).

So the set of all possible configurations can be represented by

A = { n (1), n (2), … n (N)}

and the set of all functional configurations satisfying E ( x ) is

B = { m (1), m (2), … m ( M )}.

Now let us apply Shannon’s Eq. (2) to this situation.

The Shannon entropy of the total number of possible configurations is

H (A)= -∑ p ( n ( i )) log p ( n ( i )) where i = [1, 2, … N}.


H (A) = -(1/( n (1)) log (1/( n (1))) + 1/ n (2) log (1/( n (2))) + … 1/ n ( N ) log (1/( n ( N ))). (4)

If we assume (rightly or wrongly) that all configurations are equally probable, then as Shannon points out, Eq. (4) reduces to

H (A) = log [ N ] (5).

If we follow the same procedure for the set of functional configurations (B), then

H (B) = log [ M (E( x ))] (6)

Hazen et al. designate functional information as I ( E ( x )) . Using Shannon’s measure of information,

I ( E ( x )) = H (A) – H (B)


I ( E ( x )) = log [ N ] – (log [ M (E( x ))])

which simplifies to

I ( E ( x )) = log [N / M (E( x ) ]. (7)

This is equivalent to

I ( E ( x )) = -log [ M (E( x )/N ].

If we wish to have I ( E ( x )) in units of bits of information, then we must use log2 instead of log10 to get Hazen’s equation for functional information

I ( E ( x )) = -log2 [ M (E( x )/ N ].

So what I have done is demonstrate the derivation of Hazen’s equation from Shannon’s concept of entropy/uncertainty/information in Eq. (1). I think I will pause here in case there are any clarifying questions, before I move on to address concerns raised over my basic points 3-7. I’ve got to run, so I don’t have time to check for typos/math errors; I hope I’ve avoided them. If everyone is happy with this, I will move on to address Joshua’s other concerns in my initial points 3-7.

1 Like

@Kirk thanks for this post. How do you want to do this? I can see some points I disagree with here. I can qualify and correct them here. However, this is just the first of 7 points. Do you want to work though all of them first, or do you want me to respond to this?

(Ooops! I see I made two sign errors; so much for a hasty post. I’ve corrected both.) I see this as foundational to not only Szostak’s and Hazen’s approach, but mine as well, so I think it is important to address any concerns with the above, first, before we move on to my initial points 3-7. I agree with your comment to my point (4) that not all sequences (or configurations, as they put them) are equally probable, but my objective above was merely to derive Hazen’s equation for functional information from Shannon’s measure of uncertainty/entropy/information. As far as any other disagreements you have with the above derivation, I think we should deal with them here, before moving on to your concerns for my points 3-7.


It is not clear that this is correct. I think what you want is the conditional information, which tells you the amount of information required to move from state A to state B:

H( B | A)

If A is a uniform distribution, this produces the same final equations that Hazen uses, but in general will not do this. This becomes important as soon as there are correlations between B and A, as we see in the real world. This ends up being a consequential error down the line when you start fixing the erroneous assumption of a uniform distribution.

This is important to point out. You are using a fundamentally different definition. We will get into this later, but it ends up being consequential.

This is an invalid assumption in your case. I agree that is what is needed for your work.

Using this equation, you have now introduced a striking contradiction into your claims. By this formula, duplicating an existing sequences does increase the functional information created. There are several ways we can understand duplication in this context, and most of them will increase functional information. Some of them will double functional information, as you have defined it above.


You have a valid concern when it comes to what type of information they are measuring (and, later, I am measuring). At this point, it appears they are only interested in the change in Shannon entropy/uncertainty/information (gain in information) between the non-functional state and the functional state (more technically known as the Kullback-Leibler divergence). This is simply the difference between the Shannon entropy of the two states, as I have stated.

Conditional information and mutual information will be relevant to our discussion when it comes to my own approach, but here I am only looking at information gain between two states.

Regarding Hazen’s E ( x ) degree of function and how it comes into play in my own estimates, there is no difference. For them, E ( x ) is simply a theoretical cuttoff between configurations that have insufficient functionality and those that do. In the real world, natural selection provides that cuttoff. To clarify, the sequences in Pfam are real sequences that, in virtue of them being found in biological life, have survived the cuttoff that natural selection provides; they all satisfy the degree of function E ( x ). Admittedly, there may be non-functional sequences that sneak in there, as well as HMM sorting error sequences that are not even part of the family, but those will serve to reduce the estimated functional information to code for that family.

In my paper, I do not assume a uniform distribution for functional amino acids at each site, but I do assume a uniform distribution, and state that I do, in what I called the “null state”. Currently, I work only with non-uniform distributions.

I will pause here for now. I have not addressed your objection regarding duplicating a functional sequence, but I am a little pressed for time (I had an unpleasant tooth extraction this morning that put a major hole in my day and has really put me behind in what I need to get done). I will respond to the duplicate sequence concern on Monday.


Glancing through a couple other parallel discussions in this forum (‘IDist Disbelieves …’ and ‘Do I Fudge My Math?’) I am reminded of the merit of very carefully and even pedantically, laying out the grounds for the approach I take to define, measure, and estimate functional information, in an effort to avoid confusion. I am pretty sure that the method I adopt does not result in the conclusion that cancer generates a jaw-dropping 6 billion bits of functional information. If it does, then I will immediate abandon this approach. I am also highly sceptical that cancer can produce even 300 bits of functional information. We shall see.

One more thing … I appreciate the kind of discussion we can have on this forum under the rules that Joshua has laid out. It permits us to rigorously, carefully, and even pedantically examine my approach to functional information. If one is allowed to gloss over certain things, flaws can be “hand-waved” over, and discussion of the theory becomes like trying to nail jello to the wall. None of us want that, and my initial impression is that this is the kind of forum where I can carefully defend my approach like I would in a thesis defence. Science can be a bit pedantic at times, so I hope you will be patient. Joshua has suggested that his paper on cancer will be an opportunity to test my approach, and I very much look forward to doing that once we understand my methodology and the theory behind it.

Here is a brief history of the approach I take, and I ask that you would do me the favour of reading through the entire account in an effort to better understand the basis for my approach.

I cannot claim that this is “my” approach: I saw a comment in one of the other parallel discussions, speculating that I “took up” some ideas presented by a Robert Marks and/or some others. That is not true. Although I’ve seen some of their papers, I have not taken the time to read them carefully and cannot claim to be familiar with their work. Thus, I am in no position to attempt to build on what Marks has done.

I began to take in interest in the application of information theory to biology long before I ever heard of “intelligent design”, Dembski, or Marks. What got me started was a paper published by Leon Brillouin in the Journal of Applied Physics in 1951, that I came across in the late 80’s titled, ‘Physical Entropy and Information. II’. Brillouin, as you are probably well aware, became the primary contributor to information theory immediately after the publication of Shannon’s famous paper in 1948. Much of Brillouin’s 1951 paper deals with the relationship between physical entropy and information and he ends up defining the relationship in terms of ‘negentropy’,

I = -∆ S = ∆ N , S , entropy; N , negentropy (which he denotes as Eq. 53).

I was always uncomfortable with trying to relate physical entropy to information, although the mathematical descriptions are similar. Consequently, I merely regarded his concept of negentropy as interesting, but that is all. He then went on in his paper to discuss information and, specifically, Shannon information in sections VI and VII, but he states something in the final section IX that got me started in the area of functional information, particularly because he discussed the idea of constraints (which later figure heavily in discussions of functional information). He wrote,

"The real physical entropy of the system is very much larger than the physical entropy with constraints. The negentropy,

N ( A ) = ( S ( r phys) – S (phys)), (57)

represents the price we have to pay in order to obtain a readable’ message, which our receiving station can interpret correctly."

His N ( A ) = I = -∆ S = ∆ N = - ( S ( r phys) – S (phys)), where S ( r phys), represents the entropy with constraints. Ignoring the concept of negentropy N we get,

I = -∆ S = - ( S ( r phys) – S (phys))


I = ( S (phys) - S ( r phys)) (I’ll denote this as Eq. 1a, because it is fundamental to everything else I discuss and I want to distinguish it from Eq. 1 in my earlier post)

This is known as Kullback-Leibler divergence (K-L ) or “information gain”. Thus it is not to be confused with mutual information, conditional information, or joint information. What was important here was his mention that to get a “readable” message, certain constraints would be required in the physical system. To my mind back in the late ‘80’s, there was a direct application to the problem of the biopolymeric sequences digitally encoded in the genomes of life, and the constraints on the digital information imposed by, ultimately, the laws of physics coupled with the desired function, if we desired to get stable, functional 3D structures in proteins.

Before I ever heard of Marks, Dembski, Johnson, or “intelligent design” I began to give seminars at various universities, and possibly a few in the US, beginning in the 1980’s on the application of what I called at the time ‘brillouin’s equation’ to functional protein sequences. My terminology evolved into discussing it in terms of “functional entropy” or “functional uncertainty”, with a nod to Shannon’s terminology. I submitted a paper to the Journal of Theoretical Biology around 1992, and it was sent out for review. One of the reviewers responded with an wrath-filled, 8-page, 10-font rant. It appears he/she was vehemently opposed to the idea that DNA encodes digital information, supposing that if it actually did there would be theological implications. His/her rock-solid belief, therefore, was that information theory applied to biology is nothing more than “theology” and seethingly recommended to me and the editor that I submit the paper to a theology journal, despite the fact it had absolutely zero mention of, nor any reference to, anything related to theology and it was filled with nothing but math and genetics. This was before the days of Google and intelligent design, so it showed that, back then, there were scientists highly hostile to the idea of biological information and who were astute enough to sense there might be theological implications. I’ve no idea who the reviewer was, but I was sufficiently demoralized to not bother submitting it anywhere else.

Fast forward to 2003 and Jack Szostak’s short article in Nature . When I read the article I was mildly shocked. The content of that article was virtually identical to the content of the lectures I had been giving for ten years previous. I considered sending him an email asking if he had ever heard one of my lectures, but decided against it. I wanted to avoid any suggested accusation of plagiarism, realizing that what both he and I were talking about seemed so easily arrived at each on our own. The problem was, using Hazen’s later (2007) M ( E ( x )), the actual value for M ( E ( x )) is an unknown for protein families. However, with the advent of online databases, specifically Pfam, I began to have the kind of data that might enable me to estimate the functional information required for protein families so I published the above linked-to paper which came out the same year as Hazen’s (though I had no knowledge of it when I first submitted my paper in January of that year. In this forum, we will test out the method I have adopted, and use Joshua’s cancer paper in that process, though I must first address some of the initial remaining concerns he has.

My day is filled with meetings tomorrow and I have some lecture prep I’m a bit behind in, so it may be Wednesday or Thursday before I can take the time to post again. If there is a delay in my responding, I have not disappeared; I will return! :slightly_smiling_face:


I agree. This is why I’ve found the forum is a powerful way of getting to the bottom of things.

@Kirk I wanted to know what we can expect here. Are you willing to acknowledge if your approach fails or is in error? What would you do then?

Glad to hear it. To be pedantically clear, I’m not saying it is precisely your scripts. It is rather the same mathematical approach, computing the same quantity (sequence mutual information), across the whole genome. I can elaborate when we get there.

1 Like

Joshua, in science if an approach is in error or fails, then it needs, at the very least, to be corrected or, if it cannot be corrected, abandoned. I am totally in favour of this and will happily jettison this approach if it cannot be satisfactorily corrected. As I already stated in my above post, if cancer generates 6 billion bits of functional information, then I will abandon this approach (or at least correct it). I have tested this approach over a lot of different ways outside of biology in the late 80’s and into the 90’s, and it works beautifully in other disciplines, so at this point, I expect it will work here, although I cannot say, in advance that you will not find a flaw that needs to be satisfactorily corrected. I see this discussion as informal peer review.


Computing 6 billion bits of information is just one way it fails. It does this because it does not take common descent into account.

Correcting the error of not taking common descent into account, the approach still generates over 300 bits of information, well above your cutoff, and over and over again in human observable time. This is not merely a matter of raising the threshold upwards. It would indicate that the rationale used to pick the cutoff you’ve used is in error.

Do you agree that this, if it were true, would also be a problem for your approach?

1 Like

Yes, this is where the error is.

Except you are defining differently than required for your purposes. “Insufficient” functionality means unselectable. Once selection kicks in, all bets are off. That is not, however, how your sequences are selected. So the definition here is material.

I disagree. There is not such thing as a “cutoff” that natural selection provides for new functions.

Remember, natural selection optimizes sequences of low function to those of high function. What we observe is the end result of optimizing sequences, not the full range of sequences with unoptimized, but nonetheless selectable, function.

This is guaranteed to be true if all distributions are IID and uniform (as they are in your model). If they are not IID or not uniform, this is not longer the case. This is a fairly important math error you seem to be making. You are saying the the change (delta) in information content (H) is the same as the KL distance (D). This is just false in general an only works if you’ve picked a IID/uniform statistical model, which is in error in this case.

The easist way to prove this is that the delta H that you suggest above:

Can be positive or negative. In fact, we know that:

H (A,B) = - ∆ H (B,A)

However, this is not true for KL (D), which is guaranteed to always be positive: D (X|Y) \geq 0. Therefore we know for a fact that:

D (X|Y) \neq -D* (Y|X)

The two concepts are totally different. They are not the same thing. What you really need is KL any ways, which is the amount of information require to specify a change from one state to another.


Thank you for the extended story of how you got to this point of view. This is very helpful.

My apologies. Perhaps we have it the wrong way around. They certainly have taken up your work, and claim it as a specific case of ASC. I agree. That is exactly what it is, even if it is not what you intended. I’ll be sure to keep that clear.

Well that is absurd. Information theory is immensely helpful to biology. We use it all the time. I just doesn’t work the way you are doing it in FSC.

Something seems strange about this. At that time, there was very high interesting information theory in biology. Genbank was growing, BLAST was invented, and the human genome project was kicking off. Are you sure it wasn’t something about the paper’s wording that triggered her? Information theory, at that time, was very hot in biology.

Take your time. No need to publicly state this everytime. Just private message me if you need to.

1 Like

Feel free to move this to the “side comments” thread if you think that’s more appropriate, I just wanted to get a quick clarification. I certainly can’t claim to be following along with all the technical details here, but I think I’m roughly following the broader conceptual points.

Kirk says he’s assuming a uniform distribution for the “null state” across different proteins (i.e. versions of proteins from different species), so is your objection here that common descent needs to be taken into account since it biases the sequences to be more similar between species? Do I have that right? Because that seems like a huge oversight indeed.