Gpuccio: Functional Information Methodology

OK, I have been kindly invited by Joshua Swamidass to open an account herer, and here I am.

I will go on this way: I will answer as usual at UD, and then paste the asnwer here. I hope that works! :slight_smile:


Thanks for joining the conversation @gpuccio. I’ve transferred ownership of @colewd’s reposts of your responses to you. Please contact the @moderators with private message with any concerns that arise.


Well, maybe there can be some overlap here, but just to be sure that everything is clear, I will paste here, as said, my earlier answers at UD, as they are, and then go on.

I invite you to be patient if I need sone time to answer: there is a lot to say, and my time is limited. Those who are interested in more details can certainly look at my OPs at UD, as I quote them.

Thanks to all.


Good news is that all those posts are already here! If you scan up the thread you will see them, now assigned to you.

1 Like

I will start with these easy questions, as I have not much time right now. Here are my answers:

  1. Yes, definitely.

  2. I affirm common descent of all living organisms, at least as the best current theory. Including, of course, humans. But, as explained many times, I believe that design acts on common descent to input the new functional information any time it is required.

  3. I am arguing that there is evidence of design any time that we observe new complex functional information, higher than 500 bits for example, arise. Of course, also in the origin of phyla.

  4. Yes, definitely.

Well, that was easy.


Very helpful @gpuccio. I want to ask one last clarifying question here, regarding this:

Do you have any taxa you would agree are negative controls? That their evolution by common descent did not require designed infusion of information? For example, what about viruses? Subspecies of rhinos? What are some examples of groups of organisms you think could have arisen by common descent without an “infusion of design”?

How did you determine these negative controls?

If you can’t give us these negative controls, are you arguing that every change in organisms is an infusion designed, no mater how small?

1 Like

It all depends on the complexity of molecular changes in FI. My analysis is absolutely quantitative.

So, if two taxa have no relevant differences in FI, they could well have arised from non design mechanisms.

But I believe that in most cases it would be difficult to prove a negative. What do you think?

1 Like

It seems that, measuring FI the way some people measure it, there is more than 500 bits of FI differing between individual humans. We see more than this amount of FI develop in the evolution of cancer. Also in the evolution of viruses.

This, it would seem, would mean that an infusion of design is required to explain all these things. That raises some difficult theological questions. Why would God be intentionally intervening to cause cancer, for example?

On the other hand, if they are agreed negative controls, it gives us a way to assess metrics of FI that you are proposing.


To clarify one point, I think God governs all things and I have no problem with the notion of him giving input into evolutionary processes. I don’t think evolutionary science demonstrates God was not involved, so I’m not asking you to prove a negative.

Rather, I want to know the scenarios we do not think a priori God was giving any design input to enable genetic change. Cancer evolution, viral evolution, and human diversity are three possible domains I am offering. Others are possible too. It isn’t about proving a negative here, but establishing a negative control.

Hopefully what ever calculations we use to infer design, when applied to the negative controls, would not produce a positive. If a false positive arises, we should doubt that strategy for detecting design.


There are many things to say, and many interesting issues to be anwered in the comments that have been offered here. I am really not sure where to begin.

So, let’s begin at this last statement of yours, hoping that it can help me clarify a few things about FI and its measurement.

First of all, it should be clear thatr all the information we are discussing here is in digital form. That reakky helps, because it is much easier to measure FI in digital form. However, we need to know the reral molecular basis of our functions. That’s why I rarely discuss fossil, morphology and similar issues, and stick rather ot protein sequences. It’s the only way to begin to be quantitative about FI.

Even so, it is not an easy task.

I will just remind here that FI is defined as the minimum number of specific bits that, to the best of our knowledge, are necessary to implement some explicitly defined function.

You will find more details about my definitions of design and FI here:

Defining Design

and here:

Functional Information Defined

These are, indeed, the first two OPs I wrote for UD. I like to have my definitions explicit, before going on.

Now, very briefly, FI is usually measured directly as the rate between the target space and the search space, as defined by the function. The measure is completely empirical, and it must be referred to some definite system and time window. The purpose, of course, is to possibly infer design fro some object we are observing in the system.

We infer design if we observe that some object can implement a function, explicitly defined, hich implies at least 500 bits of specific FI. This is a very simlified definition, and we may need to clarify many aspects later. For the moment, it will be a starting point.

But, of course, those 500+ bits of FI must arise in the system at some time, and must not be present before. IOWs, we need the appearance of new complex FI in the system, to infer a design intervention.

So, just to be brief, I believe that none of the three examples you offer is an example of new complex FI arising in a system. I will briefly discuss the first two, avoding for the moment the example of viruses (I am not really expert about that, and I may need some better clarifications about what you mean).

So, the first point. You say: " There is more than 500 bits of FI differing between individual humans".

Well, the point is not if there is such a difference. The point is: what is the origin of such a difference?

Let’s see. The basic reference genome and proteome are rather similar in all human beings. The FI there is more or less the same, and we can wonder how it came into existence. Much of it comes, of course, from “lower” animals, but some of it is human specific. In all cases, according to my theory, complex FI arose as the result of design, in the course of natural history.

Then there are the differences. Of course humans are different one from the other. There are many reasons for that.

First of all, the greatest part of that difference is generated in the course of human procreation. We know how the combination of two different genomes (father and mother) into one generates a new individual genome, with the further contribution of recombination. That is a wonderful process, but essentially it is a way to remix FI that is already there, in a new configuration. The process is combinatorially rather simple. I don’t see how it should generate new FI.

I will be more clear. We would observe new FI if some individual, for some strange reason, received a new complex protein capable of implementing a new function, a protein which did not exist at all before in all other humans. Let’s say a new enzyme, 500 AAs long, that implement a completely new biochemical reaction and has no sequence similarity with any other previous protein. That would be new FI. Or the addition of a new function to an existing protein by at least 500 new specific bits, as some new partial sequence configuration which did not exist before.

These are the things that happened a lot of times in the course of natural history. The information jumps. But there is nothing like that in the differences between humans.

There are also differences due to variation. Mainly neutral or quasi neutral variation, which generates known polymorphisms, or simply individual differences. The online Exac browser is a good repository of them.

And there are the negative mutations, genetic diseases.

Nothing of that qualifies as new complex FI.

Let’s go for the moment to the second point. You say:

“We see more than this amount of FI develop in the evolution of cancer”.

I don’t think so. Could you give examples, please?

What we see in the evolution of concer is a series of random mutations, most of them very deleterious, that in some cases confer reproductive advantage to cancer cell in the host environment. But those mutations are combinatorially simple. They are usually SNPs, or deletions, duplications, inversions, translocations and so on. Simple events. Many of them, but still simple events.

We are exactly in the scenario described and analyzed by Behe in his very good last book. Simple mutations affect already existing complex structures, altering their previous functions in sucvh a way that, sometimes, a relative advantage is gained. For example, a cell can escape control, and start reproducing beyond its assigned limits.

I will just give an example. Burkitt’s lymphoma is caused, among other things, by a translocation involving the c-mych gene, a very important TF. The most common event is a 8-14 translocation. The event is very simple, but the consequences are complex. However, the change in FI is trivial.

A single frameshift mutation can easily cancel a whole gene and its functions. Still, the molecular event is very simple.

FI arises when more than 500 specific bits for a new function appear. That is about 116 specific AAs. Do you know of any cancer cell where a completely new and functional enzyme appears? That did not exist before?

Well, that’s enough for the moment.

1 Like

Thanks @gpuccio, but you are jumping the gun here. I’m asking if any of these cases would constitute a good negative control for you, not what you would compute their FI. If none of these cases work as negative controls what would?

1 Like

Yes, those cases are not examples of FI. We can use them as negative controls. But also any case where a protein is passed from one species to another one without any relevant modifications can be a negative control. There is no variation of FI there (for that protein). There are a lot of cases like that.

Indeed, any case of variation in a lab where no designed information is added to the system can be used as a negative control. For example, Lenski’s experiment can be considered as a good negative control, a non design biological system, because the design intervention there is rather limited (setting the lab system and the general rules), but no other specific information is added. Instead, any experiment where intelligent selection takes place would not be a good negative.


I may have more later, but for now it should be noted that the bit score from a BLAST search and the informational number of bits used to identify design may be two rather different things. The usage adopted by gpuccio and most ID proponents is just a -log2 transformation of the ratio of functional to all possible sequences. I don’t believe the bit score from a BLAST search is the same thing.

If I am mistaken about this, a correction would be most welcome.


My understanding is aligned with you @Art. The BLAST bitscore is worth explaining sometime, but it does not appear to be equivalent to what most IDists mean by FI.

Great. That is really helpful. You are agreeing then that:

  1. Human-to-human genetic variation is a negative control.
  2. Viral evolution is a negative control.
  3. Cancer evolution is a negative control.

You add the, appropriate, caveat that the horizontal transfer of genes is not a design based infusion of information. You also suggest as a negative control:

  1. Experiments that do not include “intelligent” selection (for example, Lenski’s experiment)

“Intelligent selection” is poorly defined by I think I get what you mean. It seems also that, correctly, this would include both in silico simulations and in vitro experiments. Both approaches are valid “experiments” if conducted correctly.

That is great news. From here, there are two ways forward I see.

First, I want to hear your response to what we have written already about your analysis. That seems the place to start. This should clarify a great deal about your methodology and what you are precisely claiming.

Second, after your methodology and refined and clarified, perhaps we will circle back to looking at some of these other cases. To have a preview of what this might look like, see here: Computing the Functional Information in Cancer. However, let’s not distract from the first point too soon.

Looking forward to it.

1 Like

OK, now I will answer the main points raised in the comments here, and then we can discuss your arguments about cancer. It will be a good way to explain better what FI is, how it should be used, and its role in design inference.

I am looking forward to it, too! :slight_smile:

Swamidass et al. :

I see that many comments here are about the relationship of my analysis with a possible philogenetic analysis. I am trying to understand better what you mnean and why you suggest that. Maybe some further feedback from you would help.

I will try to clarify a few points about my analysis that, apparently, are not well understood.

  1. My analysis is focusing on the vertebrate transition only because it is very easy to study it. A number of circumstances are particularly favorable, as I have tried to explain. In particular, the time pttern of the pertinent splits, and the presence of sufficient protein sequences in the NCBI database to make the comparisons, and of course, the very good data about the human proteome.

But in no way I am trying to affirm that there is something special in the vertebrate transition. There is a lot of functional information added at that time, and we can easily check the sequence conservation of that information up to humans. However, the same thing probably happens at many other transitions.

So, why do I find a lot of FI at the vertebrate transition? It’s because I am looking for FI specific to the vertebrate branch. Indeed, I am using human proteins as a probe, and humans are of course vertebrate. My analysis shows that a big part of the specific FI found in vertebrates was added at the initial transition. It is not cpmparing that to what happens in other branches of natural history.

Just to be clear, I could analyze in a similat way the transition to hymenoptera. In that case, I would take as probes the protein sequences in some bee, for example, and blast them against pre-hymenoptera and some common ancestor of the main branches in that tree. I have not done that, but it can be done, and it would have the same meaning of my vertebrate analysis: to quantify how much specific FI was added at the beginning of that evolutionary branch.

I am not saying that vertebrates are in any way special. I am not saying that humans are in any way special (well, they are, but for different reasons).

  1. It should be clear that my methodology is not measuring the absolute FI present in a protein. It is only measuring the FI conserved up to humans, and specific to the vertebrate branch.

So, let’s say that protein A has 800 bits of human conserved sequence similarity (conserved for 400+ million years). My methodology affirms that those 800 bits are a good estimator of specific FI. But let’s say that the same protein A, in bees, has only 400 bits of sequence similarity with the human form. Does it mean that the bee protein has less FI?

Absolutely not. It probably just means that the bee protein has less vertebrate specific FI. But it can well have a lot of Hymenoptera specific FI. That can be verified by measuring the sequence similarity conserved in that branch for a few hundred million years, in that protein.

OK, time has expired. More in next post.


Whether some protein grew in size during the evolution of some clade or along some branch, or how conserved that protein is in that clade, seems to me to have little to nothing to do with whether it was designed. What am I missing?


Hi Art, thank you for your contribution, that allows me to clarify some important point.

You are right of course, the bitscore of a BLAST search and the value of FI as -log2 of the ratio between target space and search space are not the same thing. But the point is: the first is a very good estimator of the second, provided that some conditions are satisfied.

The idea of using conserved sequence similarity to estimate FI is not mine. I owe it completely to Durston, and probably others have pointed to that concept before. It is, indeed, a direct consequence of some basic ideas of evolutionary theory. I have just developed a simple method to apply that concept to get a quantitative foundation to the design inference in appropriate contexts.

The condition that essentially has to be satisfied is: sequence conservation for long evolutionary periods. I have always tried to emphasize that it is not simply sequence conservation, but that long evolutionary periods are absolutely needed. But sometimes that aspect is not understood well,/ so i am happy that i can emphasize it here.

I will be more clear. A strong sequence similarity between, say, a human protein and the chimp homologue of course is not a good estimator of FI. The reason for that should be clear enough: the split between chimps and humans is very recent. Any sequence configuration that was present in the common ancestor, be it functionally constrained or not, will probably still be there in humans and chimps, well detectable by BLAST, just because there has not been enough evolutionary time after he split for the sequences to diverge because of neutral variation. IOWs, we cannot distinguish between similarity due to functional constraint and passive similarity, if the time after split is too short.

But what if the time after split is 400+ million years, like in the case of the transition to vertebrates, or maybe a couple billion years, like in the case of ATP synthase beta chain in E. coli and humans?

According to what we know about divergence of synonimous sites, I would say that time windows higher than 200 million years begin to be interesting, and probably 400+ million years are more than enough to guarantee that most of all the sequence similarity can be attributed to strong functional constraint. For 2 billion years, I would say that there can be no possible doubt.

So, in this particular case of long conservation, the degree of similarity becomes a good estimator of functional constraint, and therefore of FI. The unit is the same (bits). The meaning is the same, in this special context.

Technically, the bitscore measures the improbability of finding that similarity by chance in the specific protein database we are using. FI measures the improbability of finding that specific sequence by a random walk from some unrelated starting point. If the sequence similarity can be attributed only to functional constraint, because of the long evolutionary separation, then the two measures are strongly connected.

Of course, there are differences and technical problems. We can discuss them, if you want. The general idea is that the BLAST bitscore is a biased estimator, because it always underestimates the true FI.

But that is not the important point, because we are not trying to measure FI with great precision. We just need some reliable approximation and order of magnitude. Why?

Because in the biological world, a lot of objects (in this case, proteins) exhibit FI well beyond the threshold of 500 bits, that can be conventionally be assumed as safe for any physical system to infer design. So, when I get a result of 1250 bits of new FI added to CARD11 at the start of vertebrate evolution, I don’t really need absolute precision. The true FI is certainly much more than that, but who cares? 1250 bits are more than enough to infer design.

To all those who have expressed doubts about the value of long conserved sequence similarity to estimate FI, I would simply ask the following simple question.

Let’s take again the beta chain of ATP synthase. Let’s BLAST again the E. coli sequence against the human sequence. And, for a moment, let’s forget the bitscore, and just look at identities. P06576 vs WP_106631526.

We get 335 identities. 72%. Conserved for, say, a couple billion years.

My simple question is: if we are not measuring FI, what are we measuring here?

IOWs, how can you explain that amazing conserved sequence similarity, if not as an estimate of functional specificity?

Just to know.

1 Like

You are missing the central core of ID theory, which connects complex FI to the design inference. I will get to that soon enough.

1 Like