Gpuccio: Functional Information Methodology

Thanks @gpuccio, but you are jumping the gun here. I’m asking if any of these cases would constitute a good negative control for you, not what you would compute their FI. If none of these cases work as negative controls what would?

1 Like

Yes, those cases are not examples of FI. We can use them as negative controls. But also any case where a protein is passed from one species to another one without any relevant modifications can be a negative control. There is no variation of FI there (for that protein). There are a lot of cases like that.

Indeed, any case of variation in a lab where no designed information is added to the system can be used as a negative control. For example, Lenski’s experiment can be considered as a good negative control, a non design biological system, because the design intervention there is rather limited (setting the lab system and the general rules), but no other specific information is added. Instead, any experiment where intelligent selection takes place would not be a good negative.


I may have more later, but for now it should be noted that the bit score from a BLAST search and the informational number of bits used to identify design may be two rather different things. The usage adopted by gpuccio and most ID proponents is just a -log2 transformation of the ratio of functional to all possible sequences. I don’t believe the bit score from a BLAST search is the same thing.

If I am mistaken about this, a correction would be most welcome.


My understanding is aligned with you @Art. The BLAST bitscore is worth explaining sometime, but it does not appear to be equivalent to what most IDists mean by FI.

Great. That is really helpful. You are agreeing then that:

  1. Human-to-human genetic variation is a negative control.
  2. Viral evolution is a negative control.
  3. Cancer evolution is a negative control.

You add the, appropriate, caveat that the horizontal transfer of genes is not a design based infusion of information. You also suggest as a negative control:

  1. Experiments that do not include “intelligent” selection (for example, Lenski’s experiment)

“Intelligent selection” is poorly defined by I think I get what you mean. It seems also that, correctly, this would include both in silico simulations and in vitro experiments. Both approaches are valid “experiments” if conducted correctly.

That is great news. From here, there are two ways forward I see.

First, I want to hear your response to what we have written already about your analysis. That seems the place to start. This should clarify a great deal about your methodology and what you are precisely claiming.

Second, after your methodology and refined and clarified, perhaps we will circle back to looking at some of these other cases. To have a preview of what this might look like, see here: Computing the Functional Information in Cancer. However, let’s not distract from the first point too soon.

Looking forward to it.

1 Like

OK, now I will answer the main points raised in the comments here, and then we can discuss your arguments about cancer. It will be a good way to explain better what FI is, how it should be used, and its role in design inference.

I am looking forward to it, too! :slight_smile:

Swamidass et al. :

I see that many comments here are about the relationship of my analysis with a possible philogenetic analysis. I am trying to understand better what you mnean and why you suggest that. Maybe some further feedback from you would help.

I will try to clarify a few points about my analysis that, apparently, are not well understood.

  1. My analysis is focusing on the vertebrate transition only because it is very easy to study it. A number of circumstances are particularly favorable, as I have tried to explain. In particular, the time pttern of the pertinent splits, and the presence of sufficient protein sequences in the NCBI database to make the comparisons, and of course, the very good data about the human proteome.

But in no way I am trying to affirm that there is something special in the vertebrate transition. There is a lot of functional information added at that time, and we can easily check the sequence conservation of that information up to humans. However, the same thing probably happens at many other transitions.

So, why do I find a lot of FI at the vertebrate transition? It’s because I am looking for FI specific to the vertebrate branch. Indeed, I am using human proteins as a probe, and humans are of course vertebrate. My analysis shows that a big part of the specific FI found in vertebrates was added at the initial transition. It is not cpmparing that to what happens in other branches of natural history.

Just to be clear, I could analyze in a similat way the transition to hymenoptera. In that case, I would take as probes the protein sequences in some bee, for example, and blast them against pre-hymenoptera and some common ancestor of the main branches in that tree. I have not done that, but it can be done, and it would have the same meaning of my vertebrate analysis: to quantify how much specific FI was added at the beginning of that evolutionary branch.

I am not saying that vertebrates are in any way special. I am not saying that humans are in any way special (well, they are, but for different reasons).

  1. It should be clear that my methodology is not measuring the absolute FI present in a protein. It is only measuring the FI conserved up to humans, and specific to the vertebrate branch.

So, let’s say that protein A has 800 bits of human conserved sequence similarity (conserved for 400+ million years). My methodology affirms that those 800 bits are a good estimator of specific FI. But let’s say that the same protein A, in bees, has only 400 bits of sequence similarity with the human form. Does it mean that the bee protein has less FI?

Absolutely not. It probably just means that the bee protein has less vertebrate specific FI. But it can well have a lot of Hymenoptera specific FI. That can be verified by measuring the sequence similarity conserved in that branch for a few hundred million years, in that protein.

OK, time has expired. More in next post.


Whether some protein grew in size during the evolution of some clade or along some branch, or how conserved that protein is in that clade, seems to me to have little to nothing to do with whether it was designed. What am I missing?


Hi Art, thank you for your contribution, that allows me to clarify some important point.

You are right of course, the bitscore of a BLAST search and the value of FI as -log2 of the ratio between target space and search space are not the same thing. But the point is: the first is a very good estimator of the second, provided that some conditions are satisfied.

The idea of using conserved sequence similarity to estimate FI is not mine. I owe it completely to Durston, and probably others have pointed to that concept before. It is, indeed, a direct consequence of some basic ideas of evolutionary theory. I have just developed a simple method to apply that concept to get a quantitative foundation to the design inference in appropriate contexts.

The condition that essentially has to be satisfied is: sequence conservation for long evolutionary periods. I have always tried to emphasize that it is not simply sequence conservation, but that long evolutionary periods are absolutely needed. But sometimes that aspect is not understood well,/ so i am happy that i can emphasize it here.

I will be more clear. A strong sequence similarity between, say, a human protein and the chimp homologue of course is not a good estimator of FI. The reason for that should be clear enough: the split between chimps and humans is very recent. Any sequence configuration that was present in the common ancestor, be it functionally constrained or not, will probably still be there in humans and chimps, well detectable by BLAST, just because there has not been enough evolutionary time after he split for the sequences to diverge because of neutral variation. IOWs, we cannot distinguish between similarity due to functional constraint and passive similarity, if the time after split is too short.

But what if the time after split is 400+ million years, like in the case of the transition to vertebrates, or maybe a couple billion years, like in the case of ATP synthase beta chain in E. coli and humans?

According to what we know about divergence of synonimous sites, I would say that time windows higher than 200 million years begin to be interesting, and probably 400+ million years are more than enough to guarantee that most of all the sequence similarity can be attributed to strong functional constraint. For 2 billion years, I would say that there can be no possible doubt.

So, in this particular case of long conservation, the degree of similarity becomes a good estimator of functional constraint, and therefore of FI. The unit is the same (bits). The meaning is the same, in this special context.

Technically, the bitscore measures the improbability of finding that similarity by chance in the specific protein database we are using. FI measures the improbability of finding that specific sequence by a random walk from some unrelated starting point. If the sequence similarity can be attributed only to functional constraint, because of the long evolutionary separation, then the two measures are strongly connected.

Of course, there are differences and technical problems. We can discuss them, if you want. The general idea is that the BLAST bitscore is a biased estimator, because it always underestimates the true FI.

But that is not the important point, because we are not trying to measure FI with great precision. We just need some reliable approximation and order of magnitude. Why?

Because in the biological world, a lot of objects (in this case, proteins) exhibit FI well beyond the threshold of 500 bits, that can be conventionally be assumed as safe for any physical system to infer design. So, when I get a result of 1250 bits of new FI added to CARD11 at the start of vertebrate evolution, I don’t really need absolute precision. The true FI is certainly much more than that, but who cares? 1250 bits are more than enough to infer design.

To all those who have expressed doubts about the value of long conserved sequence similarity to estimate FI, I would simply ask the following simple question.

Let’s take again the beta chain of ATP synthase. Let’s BLAST again the E. coli sequence against the human sequence. And, for a moment, let’s forget the bitscore, and just look at identities. P06576 vs WP_106631526.

We get 335 identities. 72%. Conserved for, say, a couple billion years.

My simple question is: if we are not measuring FI, what are we measuring here?

IOWs, how can you explain that amazing conserved sequence similarity, if not as an estimate of functional specificity?

Just to know.

1 Like

You are missing the central core of ID theory, which connects complex FI to the design inference. I will get to that soon enough.

1 Like

Fair enough, will be looking forward to that.

1 Like

2 posts were merged into an existing topic: Comments on Gpuccio: Functional Information Methodology

I will start from this statement to deal with what I call the central/core of ID theory.

You must understand that when I was requested to write a summary of my methodology to measure FI in proteins, I did exactly that. I did not include a complete description of ID theory.

Of course, being a very convinced supporter of ID theory, it was very natural for me to conflate the measurement of very high values of FI with an inference to design, because that’s exactly what ID theory warrants. But now, having discussed in some detail the rationale of my measurement of FI in proteins, the focus can shift to ID theory itself. In brief, what is the connection between complex FI and design? And how does that connection apply to biological objects?

An important premise is that my personal approach to ID theory is completely empirical. It requires no special philosophy or worldview, except some good epistemology and philosophy of science. It is, I believe, completely scientific. And it has no connections with any theology. It has always been my personal choice to avoid any reference to theological arguments in all my scientific discussions about ID theory. And I will stick to that choice here too.

More in next post.


You can feel free to refine your case, as long as you are clear about any shifts you are making.

Three questions:

  1. What empirical evidence do you have that demonstrates FI increases are unique to design?

  2. How have you accounted for the similarity between proteins due to common descent?

  3. How have you accounted for differences between proteins caused by neutral drift and neutral coevolution?



Please, let me go on with some linear explanation of ID theory and my approach to it. Then I will answer your three questions.

You may have noticed that I have proposed two different questions about what ID theory is about:

  1. What is the connection between complex FI and the design inference?

  2. How does that apply to biological objects?

Now, if we want to understand each other, we have to focus first on the first question. To do that, we must for the moment forget biological objects. After all, they are the object we are discussing about: are they designed or do they arise by other mechanisms? So, we will for the moment consider the origin of biological objects undecided, and try to understand ID theory without any reference to biology.

To do that, we need an explicit definition of design and of functional information. I have offered a lnk to my two OPs about those two definitions. So, I will just remind here that:

  1. Design is any process where some conscious intelligent and purposeful agent imprints some specific configuration to a material object deriving it from subjective representations in his consciousness. The key point here is that the subjective representation must precede its output to the materila oobject.

  2. FI is the number of bits required to implement some explicitly defined function. Any function can be used. FI is always defined in relation to the defined function, whatever it is. n object exhibits the level of FI linked to the function if it can be used to implement the explicitly defined function at the explicitly defined level.

  3. In general, an explicitly defined function generates a binary partition in a well defined system and set of possible objects: those that can implement it, and those that cannot. FI, in general, is computed as -log2 of the ratio of the target space (the number of objects that can implement the function) to the search space (the number of possible objects) in the defined system.

More in next post.

Finally, a definition of “design”.

It looks like an entirely reasonable definition.

With this definition, it should be trivially obvious that biological organisms are not the result of design.



I am happy that you appreciate the definition. I love definitions.

For the conclusions, we will see…

1 Like

Speaking only for myself, I don’t think there is any reason to discuss design, or “ID theory,” in this context until the very basic questions asked by @swamidass are addressed. And I would reiterate that asking this question outside of even a basic phylogenetic analysis/approach is futile.

It is not possible to talk meaningfully about “functional information” without these basic foundational tasks being done.


I agree. Rather than a primer on ID generalities, let’s focus on what specifically you @gpuccio are doing.

For example, with your definition of design, there must be a pre-existing design. You are empirically based. So what evidence can you produce for a pre-existing specification? As @nwrickert, it seems obvious that this does not exist, at least not in an human accessible form.


To all:

So, the central core of ID theory is the following:

Leaving aside biological objects (for the moment), there is not one single example in the whole known universe where FI higher than 500 bits arises without any intervention of design.

On the contrary, FI higher than 500 bits (often much higher than that) abunds in designed objects. I mean human artifacts here.

Therefore, if we observe FI in any object (leaving aside for the moment biological objects) we can safely infer a design origin for that object.

That procedure will generate no false positives. Of course, it will generate a lot of false negatives. The threshold of 500 bits has been chosen exactly to get that type of result.

If those points are not clear, we are not really discussing iD theory, but something else.

This strong connection between high FI levels and a design origin has, of course, a rationale. But its foundation is completey empirical. We can observe that connection practically everywhere.

The rationale could be expressed as follows: there is no known necessity law that generates those levels of FI without any design intervention. Therefore, FI in non design systems can arise only by chance. But a threshold of 500 bits is so much higher than the probabilistic resources of the known universe, that we can be sure that such an event is empirically impossible. The probabilistic barriers of getting 500 bits of FI are simply too high to be overcome.

Well, that’s ID theory in a nutshell. I will come to the application to biology later. But I am confident that this simple summary will be enough for the moment to generate some answers. :slight_smile:

1 Like

You present a bare assertion as a premise of your analysis.

Why should we agree with this? It seems obviously false. What evidence do can you present to support this assumption?

There are examples of non-designed processes processes we can directly observe producing FI. We can observe high amounts of FI in cancer evolution too, which you agree is not designed. We also see high amounts of FI in viruses, which you also agree are not designed. All these, and more, are all counter examples to your assumption.

As a technical point, without clarifying precisely how FI is defined, this is not at all clearly the case.