Gpuccio: Functional Information Methodology

We could just look at taxonomic level I suppose. That should work as a starting point.

It is worth emphasizing that BLAST bitscore is an approximation of its own, and not any meaningful gold standard. It would be good to show pairwise branch length is negatively related to bitscore, to demonstrate the relationship to gupuccio’s analysis. It is not a problem, however, if there is a spread in that relationship.

1 Like

It is not that hard to see his logic. He thinks that FI is a tell tale sign of engineering, because only design can cause large increases in FI.

Of course, we know there are other ways that FI can accrue and that common descent with neutral drift can inflate FI estimates. Gpuccio, it seems, does not know about these other mechanisms and complications to his analysis. Not knowing these things, he does not have a way to explain FI increases without recoursing to design.

So his logic is reasonable, because he does not know of other ways of explaining FI. To make his cases, however, he has to demonstrate that the other mechanisms do not account for any FI we measure. He has not done that here. My guess is because he is unaware of these mechanisms.


A few thoughts. First, it seems to me that a simple pairwise comparison is completely inadequate for what he claims to be doing, and using the actual phylogenetic tree, with lots of chordate sequences, would better enable him to do what he seems to be trying to do here very clumsily, which is to estimate the sequences at internal nodes. And also to estimate sequence conservation. More work than clicking on a few buttons in BLAST, but a much better approach. Oh, I see @davecarlson has already mentioned this.

Second, I don’t understand why similarity to the human sequence should estimate functional information. If you started with a tunicate sequence, would you not see something similar, and wouldn’t that show a big information jump on the way to Ciona? I do not, in general, see an argument for relating bitscore to functional information.

As for the confusion of homology with similarity, that seems common among molecular biologists, not just Gpuccio.

Bad idea. Taxonomic ranks are arbitrary and taxa of the same rank can’t be considered equivalent. Even as a starting point.


I do not find that logic to be reasonable. If I am claiming you have committed a crime, I have demonstrated that claim if I show that some other person didn’t commit it.

Of course, my case is even weaker if it turns out the evidence points conclusively to the fact that this other person did commit the crime, as is the case here. :slight_smile:

First of all I would like to thank Joshua @Swamidass and all the kind interlocutors at Peaceful Science for taking the time to consider my writing. I have read the first few comments, and they are very interesting and stimulating.

As agreed with @colewd , who is the stimulator of this discussion and whom I thank sincerely, I will answer here to the main arguments raised at PS, because I do want that my answers be well visible to the people here at UD. I hope this “parallel” discussion (already realized in the past with TSZ) may be comfortable for the guys at PS too. Bill can of course reference my comments at PS, if he wants to.

So, let’s start.

Thank you! Seriously, this is one of the biggest acknowledgments I have ever received from the other side.

Yes. Indeed, in many proteins. CARD11 is just one example.

That’s exactly what I have tried to do. My comment for your blog is, of course, only a brief summary with a couple of examples.

I have described in detail my results for vertebrates in this OP:

In brief, I have tested the whole human reference proteome against 9 groups of organisms, chosen, with some practical compromise, to represent the natural history of metazoan. For each human protein, a blast comparison was made versus all the protein sequences present in the NCBI database for that group of organisms, and the best hit chosen. I used the donwnaded version of Blast to perform the comparisons automatically.

So, my database has the best hit of each human protein with each class of organisms, in terms of bitscore, bits per aminoacid, and difference with the previous class. I use that database for all my analyses, and the R software to analyze results and draw graphics.

You can read in the OP above mentioned that I have found about 1.7 million bits of new FI appearing in vertebrates, just at the start of their natural history.

You can find the general graph with the mean values, in baa, for each class, in the above mentioned OP (Fig. 1). You can see the general jump in vertebrates, which is better analyzed in the following figures.

However, the important point is not only the generic jump, but the fact that the significant jumps can be found in specific classes of proteins, especially those involved in immune response and brain development. That is good confirmation that my methodology is really measuring the relevant information novelties.

Your next statement deserves a rather more detailed answer, so I will have to postpone it to next post (as soon as possible).

Well, thank you for the correction. I am a medical doctor, and not a biologist, so some imprecision in terminology on my part can be reasonably expected. I will be happy to acknowledge any well explained correction.

Yes, I have been using the word “homology” to mean the degree of sequence similarity between protein sequences, as measured for example by the Blast bitscore. I was not aware that the word should be reserved to the binary condition of being or not a homologue.

So I will, in the future, use the form “conserved sequence similarity” instead of “conserved homology”. I would not use “sequence identity”, because identities are not the only component of the bitscore, even if certainly the major component.

However, I can see that in a later comment John_Harshman comes to my (partial) rescue, stating:

Thank you. Being human, I can find some small consolation in not being alone in my errors!


@Gpuccio, it is great to see a starting point of a response from you. We aim to key the conversation substantive and professional. It doesn’t always go that way perfectly, but that is our goal. I hope that we would all understand each other, and science, better through exchanges like this.

A few thoughts to offer as we await a longer response from you.

Interesting Hypothesis

It is an interesting hypothesis, and it deserves to be received with seriousness and rigor. Good ideas come from all over. The beauty of how science works is that we can, if we are humble to the data, come to common understanding through engaging hypotheses like this. Even failed hypotheses have value for this reason.

Whether or not this hypothesis pans out, our success/failure is determined by the rigor with which we test the hypothesis, and our ability to come agreement (especially across camps) over evidential claims concerning the hypothesis.

Terminology Flubs are Human!

As far as I am concerned this is a non issue now. Many biologists make this mistake, and you are adjusting. Thanks for hearing @davecarlson and @John_Harshman out. It is a fairly common mistake, even among biologists, and it does not affect the substance of your argument.

The Central Methodological Issue

It seems that central methodological issue is:

As I explained, there is a mapping between bitscores and the phylogenetic analysis. I hope my explanation made sense. It seems like it would be far more methodologically grounded to move into the phylogenomic analysis, and translate back to BLAST if you want to compare results.

How familiar are you with BLAST bitscores? Do you know how they are generated? Do you understand the assumptions involved in how they are computed? The gap between reality and these assumptions, I believe, undermines your case. If you switch to a better methodology, this objection would go away, and you would still be able to cross check with your prior results.

Interpreting What We Expect

This paper recently posted by @sfmatheson has a figure that tells us what we expect to see with the new analysis.

Here, branch length is going to be linearly correlated with “FI gain,” as you call it. If you look at vertebrates here, the length of the branch is not remarkably or unusually long. If your results look much like this, it seems that it would count against your analysis. If the data did end up like this, do you agree it would count against your hypothesis?

Your Data is a Good Starting Point

I really love that you are looking at the data yourself. You clearly have the ability to run much of these anlayses yourself. Great!

Let’s talk about how to analysis with phylogenies. I recommend following @Jordan’s contributions to this thread: John Harshman: The Phylogeny of Crocodiles. There are several programs you can freely download to use on your sequence database. It would be really interesting to look at your results in comparison to the tree I just posted.

Next, it seems your analysis would benefit immensely by augmenting it with more organisms, at least one from each of the terminal clades in the tree I posted above. It would be valuable, again, to see how your results compares with that tree. It would be interesting to see how including more genomes affects your conclusions. It would be interesting to see what sorts of deviations from the consensus tree you can find.

Some Parallel Questions

From hearing about your work from others, and reading your articles, I wanted to clarify your position. It sounds like you:

  1. Believe in an old earth.
  2. Affirm common descent of humans with the great apes.
  3. Are arguing that there is evidence for design in the origin of phyla.
  4. You conceive design as information infused into the common descent process.

Is this about right?

I’m looking forward to seeing the conversation continue.


Looking over that figure one more time, it does not look like the branch lengths are drawn to scale. That means we do not exactly expect this tree. Nonetheless, this question still stands:

We would want to see the actual tree, with branches drawn to scale, to make a better assessment.

A Scaled Graph

This scaled tree from the same paper is more helpful, even though it does not have vertebrates. Focus on Chordata (dark blue):

The branch length or divergence width of chordata does not seem particularly notable compared to other groups. The branch length and spread for vertebrates will be smaller that that of chordata, because vertebrates are fully contained within chordata. So the vertebrates would be less remarkable than chordata.

As negative controls, look at Arthopods (bottom), Nematodes (pink), Ascoela (red), and Platyhelminthes (light green). There seems to be far more information gain and/or spread in these taxa.

Consider the Myzostomida

The long thin (red) clade Myzostomida deserves some attention. This are a diverse bunch of worms that are very phenotypically diverse, very different from other clades, and very genetically similar to one another.


To reiterate:

  1. Very genetically different from other clades.
  2. Very genetically similar to one another.
  3. Very phenotypically different from one another.

How do we reconcile the divergence between genetics and phenotype (2 vs. 3)? If we understand its difference from other clades as a measure of FI, then we should not expect to see much phenotypic diversity in the clade, but we do. Likewise, if phenotypic diversity much to do with FI (measured this way), we expect there to be a high spread in genetic diversity for a phenotypically diverse group such as this. We really need to reconcile this divergence between genetics and phenotype, as it demonstrates that this approach to measuring FI lacks validity.

A key finding from the neutral theory is that, on bulk, genetic changes are more a marker of history than functional changes. That means we do not expect there to be a strong link between FI and genetic divergence. That solves the riddle, but in a way that undermines this analysis as a measure of FI.

How would @Gpuccio solve this riddle?


My comments regarding gpuccio’s work:

As far as I can tell, the goal is to estimate what gpuccio calls “human conserved FI”, the idea being (as far as I can tell) that large amounts of “human conserved FI” will be suggestive of design. What confuses me is the method that is used to arrive at “human conserved FI”, and how this relates to any parameter that may suggest design.

To illustrate – gpuccio estimates the “human conserved FI” for a given protein by subtracting the bit scores from BLAST comparisons of human, shark, and Saccoglossus (H:S – H:Sa). The problem with this is that one can obtain very different values if one changes the organisms that are plugged into the analysis. For example, replace sharks with chimpanzees (Ch) and one gets much, much larger values. However, also replace Saccoglossus with the mouse (M) and the value for “human conserved FI” (H:Ch – H:M) will be much, much smaller.

To show why this doesn’t make much sense to me, consider instead the Saccharomyces species (cerevisiae and fragilis), and, as an “outgroup” to represent some unicellular predecessor, Plasmodium. Run gpuccio’s calculation (Sc:Sf - Sc:P) and one gets a result that would call for design in the origination of yeasts (probably, for the comparison I present here, the amount of “conserved FI” would be much greater for yeast than for humans when that latter is calculated using chimpanzees and mice).

Thus, as best I can tell, “conserved FI” is little more than a measure of evolutionary relatedness. One can rig the calculation to obtain pretty much any value one wants, and the value would reflect relationships between the three organisms used for the analysis. Nothing more, IMO.

Beyond this, it is not clear to me what the connection is between “conserved FI” and design. I suspect (but would welcome correction) that gpuccio is drawing on the work of Dembski, Axe, Behe, et al. who argue that information, defined as the frequency of occurrence of a functional sequence in sequence space, is suggestive of design when it is high. However, as many, many discussions here on PS have shown, the ID vanguard is wrong when it comes to their ideas about protein functionality and information. This calls into question gpuccio’s use of the term, and the conclusions drawn.

However, I will grant that I am not familiar with all of gpuccio’s posts on UD, and will grant the possibility that gpuccio is aware of these considerations, and has developed a more correct formulation. If so, I would interested in a summary, and moreso in the ways that the metric has been calibrated and/or validated.

My 2c.


I agree, which is why this is important:

OK, I have been kindly invited by Joshua Swamidass to open an account herer, and here I am.

I will go on this way: I will answer as usual at UD, and then paste the asnwer here. I hope that works! :slight_smile:


Thanks for joining the conversation @gpuccio. I’ve transferred ownership of @colewd’s reposts of your responses to you. Please contact the @moderators with private message with any concerns that arise.


Well, maybe there can be some overlap here, but just to be sure that everything is clear, I will paste here, as said, my earlier answers at UD, as they are, and then go on.

I invite you to be patient if I need sone time to answer: there is a lot to say, and my time is limited. Those who are interested in more details can certainly look at my OPs at UD, as I quote them.

Thanks to all.


Good news is that all those posts are already here! If you scan up the thread you will see them, now assigned to you.

1 Like

I will start with these easy questions, as I have not much time right now. Here are my answers:

  1. Yes, definitely.

  2. I affirm common descent of all living organisms, at least as the best current theory. Including, of course, humans. But, as explained many times, I believe that design acts on common descent to input the new functional information any time it is required.

  3. I am arguing that there is evidence of design any time that we observe new complex functional information, higher than 500 bits for example, arise. Of course, also in the origin of phyla.

  4. Yes, definitely.

Well, that was easy.


Very helpful @gpuccio. I want to ask one last clarifying question here, regarding this:

Do you have any taxa you would agree are negative controls? That their evolution by common descent did not require designed infusion of information? For example, what about viruses? Subspecies of rhinos? What are some examples of groups of organisms you think could have arisen by common descent without an “infusion of design”?

How did you determine these negative controls?

If you can’t give us these negative controls, are you arguing that every change in organisms is an infusion designed, no mater how small?

1 Like

It all depends on the complexity of molecular changes in FI. My analysis is absolutely quantitative.

So, if two taxa have no relevant differences in FI, they could well have arised from non design mechanisms.

But I believe that in most cases it would be difficult to prove a negative. What do you think?

1 Like

It seems that, measuring FI the way some people measure it, there is more than 500 bits of FI differing between individual humans. We see more than this amount of FI develop in the evolution of cancer. Also in the evolution of viruses.

This, it would seem, would mean that an infusion of design is required to explain all these things. That raises some difficult theological questions. Why would God be intentionally intervening to cause cancer, for example?

On the other hand, if they are agreed negative controls, it gives us a way to assess metrics of FI that you are proposing.