Gpuccio: Functional Information Methodology

A few thoughts. First, it seems to me that a simple pairwise comparison is completely inadequate for what he claims to be doing, and using the actual phylogenetic tree, with lots of chordate sequences, would better enable him to do what he seems to be trying to do here very clumsily, which is to estimate the sequences at internal nodes. And also to estimate sequence conservation. More work than clicking on a few buttons in BLAST, but a much better approach. Oh, I see @davecarlson has already mentioned this.

Second, I don’t understand why similarity to the human sequence should estimate functional information. If you started with a tunicate sequence, would you not see something similar, and wouldn’t that show a big information jump on the way to Ciona? I do not, in general, see an argument for relating bitscore to functional information.

As for the confusion of homology with similarity, that seems common among molecular biologists, not just Gpuccio.

Bad idea. Taxonomic ranks are arbitrary and taxa of the same rank can’t be considered equivalent. Even as a starting point.


I do not find that logic to be reasonable. If I am claiming you have committed a crime, I have demonstrated that claim if I show that some other person didn’t commit it.

Of course, my case is even weaker if it turns out the evidence points conclusively to the fact that this other person did commit the crime, as is the case here. :slight_smile:

First of all I would like to thank Joshua @Swamidass and all the kind interlocutors at Peaceful Science for taking the time to consider my writing. I have read the first few comments, and they are very interesting and stimulating.

As agreed with @colewd , who is the stimulator of this discussion and whom I thank sincerely, I will answer here to the main arguments raised at PS, because I do want that my answers be well visible to the people here at UD. I hope this “parallel” discussion (already realized in the past with TSZ) may be comfortable for the guys at PS too. Bill can of course reference my comments at PS, if he wants to.

So, let’s start.

Thank you! Seriously, this is one of the biggest acknowledgments I have ever received from the other side.

Yes. Indeed, in many proteins. CARD11 is just one example.

That’s exactly what I have tried to do. My comment for your blog is, of course, only a brief summary with a couple of examples.

I have described in detail my results for vertebrates in this OP:

In brief, I have tested the whole human reference proteome against 9 groups of organisms, chosen, with some practical compromise, to represent the natural history of metazoan. For each human protein, a blast comparison was made versus all the protein sequences present in the NCBI database for that group of organisms, and the best hit chosen. I used the donwnaded version of Blast to perform the comparisons automatically.

So, my database has the best hit of each human protein with each class of organisms, in terms of bitscore, bits per aminoacid, and difference with the previous class. I use that database for all my analyses, and the R software to analyze results and draw graphics.

You can read in the OP above mentioned that I have found about 1.7 million bits of new FI appearing in vertebrates, just at the start of their natural history.

You can find the general graph with the mean values, in baa, for each class, in the above mentioned OP (Fig. 1). You can see the general jump in vertebrates, which is better analyzed in the following figures.

However, the important point is not only the generic jump, but the fact that the significant jumps can be found in specific classes of proteins, especially those involved in immune response and brain development. That is good confirmation that my methodology is really measuring the relevant information novelties.

Your next statement deserves a rather more detailed answer, so I will have to postpone it to next post (as soon as possible).

Well, thank you for the correction. I am a medical doctor, and not a biologist, so some imprecision in terminology on my part can be reasonably expected. I will be happy to acknowledge any well explained correction.

Yes, I have been using the word “homology” to mean the degree of sequence similarity between protein sequences, as measured for example by the Blast bitscore. I was not aware that the word should be reserved to the binary condition of being or not a homologue.

So I will, in the future, use the form “conserved sequence similarity” instead of “conserved homology”. I would not use “sequence identity”, because identities are not the only component of the bitscore, even if certainly the major component.

However, I can see that in a later comment John_Harshman comes to my (partial) rescue, stating:

Thank you. Being human, I can find some small consolation in not being alone in my errors!


@Gpuccio, it is great to see a starting point of a response from you. We aim to key the conversation substantive and professional. It doesn’t always go that way perfectly, but that is our goal. I hope that we would all understand each other, and science, better through exchanges like this.

A few thoughts to offer as we await a longer response from you.

Interesting Hypothesis

It is an interesting hypothesis, and it deserves to be received with seriousness and rigor. Good ideas come from all over. The beauty of how science works is that we can, if we are humble to the data, come to common understanding through engaging hypotheses like this. Even failed hypotheses have value for this reason.

Whether or not this hypothesis pans out, our success/failure is determined by the rigor with which we test the hypothesis, and our ability to come agreement (especially across camps) over evidential claims concerning the hypothesis.

Terminology Flubs are Human!

As far as I am concerned this is a non issue now. Many biologists make this mistake, and you are adjusting. Thanks for hearing @davecarlson and @John_Harshman out. It is a fairly common mistake, even among biologists, and it does not affect the substance of your argument.

The Central Methodological Issue

It seems that central methodological issue is:

As I explained, there is a mapping between bitscores and the phylogenetic analysis. I hope my explanation made sense. It seems like it would be far more methodologically grounded to move into the phylogenomic analysis, and translate back to BLAST if you want to compare results.

How familiar are you with BLAST bitscores? Do you know how they are generated? Do you understand the assumptions involved in how they are computed? The gap between reality and these assumptions, I believe, undermines your case. If you switch to a better methodology, this objection would go away, and you would still be able to cross check with your prior results.

Interpreting What We Expect

This paper recently posted by @sfmatheson has a figure that tells us what we expect to see with the new analysis.

Here, branch length is going to be linearly correlated with “FI gain,” as you call it. If you look at vertebrates here, the length of the branch is not remarkably or unusually long. If your results look much like this, it seems that it would count against your analysis. If the data did end up like this, do you agree it would count against your hypothesis?

Your Data is a Good Starting Point

I really love that you are looking at the data yourself. You clearly have the ability to run much of these anlayses yourself. Great!

Let’s talk about how to analysis with phylogenies. I recommend following @Jordan’s contributions to this thread: John Harshman: The Phylogeny of Crocodiles. There are several programs you can freely download to use on your sequence database. It would be really interesting to look at your results in comparison to the tree I just posted.

Next, it seems your analysis would benefit immensely by augmenting it with more organisms, at least one from each of the terminal clades in the tree I posted above. It would be valuable, again, to see how your results compares with that tree. It would be interesting to see how including more genomes affects your conclusions. It would be interesting to see what sorts of deviations from the consensus tree you can find.

Some Parallel Questions

From hearing about your work from others, and reading your articles, I wanted to clarify your position. It sounds like you:

  1. Believe in an old earth.
  2. Affirm common descent of humans with the great apes.
  3. Are arguing that there is evidence for design in the origin of phyla.
  4. You conceive design as information infused into the common descent process.

Is this about right?

I’m looking forward to seeing the conversation continue.


Looking over that figure one more time, it does not look like the branch lengths are drawn to scale. That means we do not exactly expect this tree. Nonetheless, this question still stands:

We would want to see the actual tree, with branches drawn to scale, to make a better assessment.

A Scaled Graph

This scaled tree from the same paper is more helpful, even though it does not have vertebrates. Focus on Chordata (dark blue):

The branch length or divergence width of chordata does not seem particularly notable compared to other groups. The branch length and spread for vertebrates will be smaller that that of chordata, because vertebrates are fully contained within chordata. So the vertebrates would be less remarkable than chordata.

As negative controls, look at Arthopods (bottom), Nematodes (pink), Ascoela (red), and Platyhelminthes (light green). There seems to be far more information gain and/or spread in these taxa.

Consider the Myzostomida

The long thin (red) clade Myzostomida deserves some attention. This are a diverse bunch of worms that are very phenotypically diverse, very different from other clades, and very genetically similar to one another.


To reiterate:

  1. Very genetically different from other clades.
  2. Very genetically similar to one another.
  3. Very phenotypically different from one another.

How do we reconcile the divergence between genetics and phenotype (2 vs. 3)? If we understand its difference from other clades as a measure of FI, then we should not expect to see much phenotypic diversity in the clade, but we do. Likewise, if phenotypic diversity much to do with FI (measured this way), we expect there to be a high spread in genetic diversity for a phenotypically diverse group such as this. We really need to reconcile this divergence between genetics and phenotype, as it demonstrates that this approach to measuring FI lacks validity.

A key finding from the neutral theory is that, on bulk, genetic changes are more a marker of history than functional changes. That means we do not expect there to be a strong link between FI and genetic divergence. That solves the riddle, but in a way that undermines this analysis as a measure of FI.

How would @Gpuccio solve this riddle?


My comments regarding gpuccio’s work:

As far as I can tell, the goal is to estimate what gpuccio calls “human conserved FI”, the idea being (as far as I can tell) that large amounts of “human conserved FI” will be suggestive of design. What confuses me is the method that is used to arrive at “human conserved FI”, and how this relates to any parameter that may suggest design.

To illustrate – gpuccio estimates the “human conserved FI” for a given protein by subtracting the bit scores from BLAST comparisons of human, shark, and Saccoglossus (H:S – H:Sa). The problem with this is that one can obtain very different values if one changes the organisms that are plugged into the analysis. For example, replace sharks with chimpanzees (Ch) and one gets much, much larger values. However, also replace Saccoglossus with the mouse (M) and the value for “human conserved FI” (H:Ch – H:M) will be much, much smaller.

To show why this doesn’t make much sense to me, consider instead the Saccharomyces species (cerevisiae and fragilis), and, as an “outgroup” to represent some unicellular predecessor, Plasmodium. Run gpuccio’s calculation (Sc:Sf - Sc:P) and one gets a result that would call for design in the origination of yeasts (probably, for the comparison I present here, the amount of “conserved FI” would be much greater for yeast than for humans when that latter is calculated using chimpanzees and mice).

Thus, as best I can tell, “conserved FI” is little more than a measure of evolutionary relatedness. One can rig the calculation to obtain pretty much any value one wants, and the value would reflect relationships between the three organisms used for the analysis. Nothing more, IMO.

Beyond this, it is not clear to me what the connection is between “conserved FI” and design. I suspect (but would welcome correction) that gpuccio is drawing on the work of Dembski, Axe, Behe, et al. who argue that information, defined as the frequency of occurrence of a functional sequence in sequence space, is suggestive of design when it is high. However, as many, many discussions here on PS have shown, the ID vanguard is wrong when it comes to their ideas about protein functionality and information. This calls into question gpuccio’s use of the term, and the conclusions drawn.

However, I will grant that I am not familiar with all of gpuccio’s posts on UD, and will grant the possibility that gpuccio is aware of these considerations, and has developed a more correct formulation. If so, I would interested in a summary, and moreso in the ways that the metric has been calibrated and/or validated.

My 2c.


I agree, which is why this is important:

OK, I have been kindly invited by Joshua Swamidass to open an account herer, and here I am.

I will go on this way: I will answer as usual at UD, and then paste the asnwer here. I hope that works! :slight_smile:


Thanks for joining the conversation @gpuccio. I’ve transferred ownership of @colewd’s reposts of your responses to you. Please contact the @moderators with private message with any concerns that arise.


Well, maybe there can be some overlap here, but just to be sure that everything is clear, I will paste here, as said, my earlier answers at UD, as they are, and then go on.

I invite you to be patient if I need sone time to answer: there is a lot to say, and my time is limited. Those who are interested in more details can certainly look at my OPs at UD, as I quote them.

Thanks to all.


Good news is that all those posts are already here! If you scan up the thread you will see them, now assigned to you.

1 Like

I will start with these easy questions, as I have not much time right now. Here are my answers:

  1. Yes, definitely.

  2. I affirm common descent of all living organisms, at least as the best current theory. Including, of course, humans. But, as explained many times, I believe that design acts on common descent to input the new functional information any time it is required.

  3. I am arguing that there is evidence of design any time that we observe new complex functional information, higher than 500 bits for example, arise. Of course, also in the origin of phyla.

  4. Yes, definitely.

Well, that was easy.


Very helpful @gpuccio. I want to ask one last clarifying question here, regarding this:

Do you have any taxa you would agree are negative controls? That their evolution by common descent did not require designed infusion of information? For example, what about viruses? Subspecies of rhinos? What are some examples of groups of organisms you think could have arisen by common descent without an “infusion of design”?

How did you determine these negative controls?

If you can’t give us these negative controls, are you arguing that every change in organisms is an infusion designed, no mater how small?

1 Like

It all depends on the complexity of molecular changes in FI. My analysis is absolutely quantitative.

So, if two taxa have no relevant differences in FI, they could well have arised from non design mechanisms.

But I believe that in most cases it would be difficult to prove a negative. What do you think?

1 Like

It seems that, measuring FI the way some people measure it, there is more than 500 bits of FI differing between individual humans. We see more than this amount of FI develop in the evolution of cancer. Also in the evolution of viruses.

This, it would seem, would mean that an infusion of design is required to explain all these things. That raises some difficult theological questions. Why would God be intentionally intervening to cause cancer, for example?

On the other hand, if they are agreed negative controls, it gives us a way to assess metrics of FI that you are proposing.


To clarify one point, I think God governs all things and I have no problem with the notion of him giving input into evolutionary processes. I don’t think evolutionary science demonstrates God was not involved, so I’m not asking you to prove a negative.

Rather, I want to know the scenarios we do not think a priori God was giving any design input to enable genetic change. Cancer evolution, viral evolution, and human diversity are three possible domains I am offering. Others are possible too. It isn’t about proving a negative here, but establishing a negative control.

Hopefully what ever calculations we use to infer design, when applied to the negative controls, would not produce a positive. If a false positive arises, we should doubt that strategy for detecting design.


There are many things to say, and many interesting issues to be anwered in the comments that have been offered here. I am really not sure where to begin.

So, let’s begin at this last statement of yours, hoping that it can help me clarify a few things about FI and its measurement.

First of all, it should be clear thatr all the information we are discussing here is in digital form. That reakky helps, because it is much easier to measure FI in digital form. However, we need to know the reral molecular basis of our functions. That’s why I rarely discuss fossil, morphology and similar issues, and stick rather ot protein sequences. It’s the only way to begin to be quantitative about FI.

Even so, it is not an easy task.

I will just remind here that FI is defined as the minimum number of specific bits that, to the best of our knowledge, are necessary to implement some explicitly defined function.

You will find more details about my definitions of design and FI here:

Defining Design

and here:

Functional Information Defined

These are, indeed, the first two OPs I wrote for UD. I like to have my definitions explicit, before going on.

Now, very briefly, FI is usually measured directly as the rate between the target space and the search space, as defined by the function. The measure is completely empirical, and it must be referred to some definite system and time window. The purpose, of course, is to possibly infer design fro some object we are observing in the system.

We infer design if we observe that some object can implement a function, explicitly defined, hich implies at least 500 bits of specific FI. This is a very simlified definition, and we may need to clarify many aspects later. For the moment, it will be a starting point.

But, of course, those 500+ bits of FI must arise in the system at some time, and must not be present before. IOWs, we need the appearance of new complex FI in the system, to infer a design intervention.

So, just to be brief, I believe that none of the three examples you offer is an example of new complex FI arising in a system. I will briefly discuss the first two, avoding for the moment the example of viruses (I am not really expert about that, and I may need some better clarifications about what you mean).

So, the first point. You say: " There is more than 500 bits of FI differing between individual humans".

Well, the point is not if there is such a difference. The point is: what is the origin of such a difference?

Let’s see. The basic reference genome and proteome are rather similar in all human beings. The FI there is more or less the same, and we can wonder how it came into existence. Much of it comes, of course, from “lower” animals, but some of it is human specific. In all cases, according to my theory, complex FI arose as the result of design, in the course of natural history.

Then there are the differences. Of course humans are different one from the other. There are many reasons for that.

First of all, the greatest part of that difference is generated in the course of human procreation. We know how the combination of two different genomes (father and mother) into one generates a new individual genome, with the further contribution of recombination. That is a wonderful process, but essentially it is a way to remix FI that is already there, in a new configuration. The process is combinatorially rather simple. I don’t see how it should generate new FI.

I will be more clear. We would observe new FI if some individual, for some strange reason, received a new complex protein capable of implementing a new function, a protein which did not exist at all before in all other humans. Let’s say a new enzyme, 500 AAs long, that implement a completely new biochemical reaction and has no sequence similarity with any other previous protein. That would be new FI. Or the addition of a new function to an existing protein by at least 500 new specific bits, as some new partial sequence configuration which did not exist before.

These are the things that happened a lot of times in the course of natural history. The information jumps. But there is nothing like that in the differences between humans.

There are also differences due to variation. Mainly neutral or quasi neutral variation, which generates known polymorphisms, or simply individual differences. The online Exac browser is a good repository of them.

And there are the negative mutations, genetic diseases.

Nothing of that qualifies as new complex FI.

Let’s go for the moment to the second point. You say:

“We see more than this amount of FI develop in the evolution of cancer”.

I don’t think so. Could you give examples, please?

What we see in the evolution of concer is a series of random mutations, most of them very deleterious, that in some cases confer reproductive advantage to cancer cell in the host environment. But those mutations are combinatorially simple. They are usually SNPs, or deletions, duplications, inversions, translocations and so on. Simple events. Many of them, but still simple events.

We are exactly in the scenario described and analyzed by Behe in his very good last book. Simple mutations affect already existing complex structures, altering their previous functions in sucvh a way that, sometimes, a relative advantage is gained. For example, a cell can escape control, and start reproducing beyond its assigned limits.

I will just give an example. Burkitt’s lymphoma is caused, among other things, by a translocation involving the c-mych gene, a very important TF. The most common event is a 8-14 translocation. The event is very simple, but the consequences are complex. However, the change in FI is trivial.

A single frameshift mutation can easily cancel a whole gene and its functions. Still, the molecular event is very simple.

FI arises when more than 500 specific bits for a new function appear. That is about 116 specific AAs. Do you know of any cancer cell where a completely new and functional enzyme appears? That did not exist before?

Well, that’s enough for the moment.

1 Like