Gpuccio: Functional Information Methodology

Gpuccio is a poster at Uncommon Descent who published a blog article at UD arguing that proteins have high functional information (FI), and must have been designed. His work is commonly referenced by ID proponents.

This is his post:

We asked for him to explain his methodology further. @colewd asked him (comment 343), and gpuccio gave a response (comment 356). He has agreed to respond to reasonable critique.

OK, so here is a brief primer about my methodology to measure Functional Information in proteins:

a) I use Blast to measure sequence homology between proteins, in bits. I take the bitscore from the Blast algorithm as it is, with some consideration of the number of identities and similarities, too.

b) I am interested in homologies that are conserved throughout long evolutionary periods. I consider that kind of homology as a very good estimator of FI. The reason is very simple: a specific sequence can be conserved for those long time windows only if it is under very strong functional constraint, and is therefore preserved by negative (purifying) selection. In all other cases, sequence homologies are practically cancelled after a long evolutionary time because of the constant effect of neutral variation.

c) How long must the time window be so that sequence homology may be considered a good estimator of FI? I would say at least 200 – 400 million years. Better if 400. Why? Because that’s more or less the time window that is usually associated with “saturation” of synonymous sites, IOWs with the more or less complete loss of any detectable homology in neutral sequences.

d) In particular, I am interested in “information jumps” in proteins at specific evolutionary times, especially at the transition to vertebrates.

e) That transition is supposed to have happened more than 400 million years ago, providing therefore a good time window for our reasonings. Moreover, the split between pre-vertebrates and the first vertebrates, and the following split between cartilaginous fishes and bony fishes, happened reasonably in a relatively short time more than 400 million years ago. As we will see, this is a very good context to measure information jumps in proteins.

f) So, let’s be practical. I take some protein in the human form. IOWs, I use the human sequence as my initial “probe”.

g) Then I measure, for the specific task of studying the transition to vertebrates, the sequence homology between the human protein and all pre-vertebrates, in particular deuterostomes and chordates non vertebrates. I take the bitscore value of the best hit. This value represents the best assessment of sequence homologies existing before the appearance of vertebrates. The value can be very low, or medium, or high. Whatever it is, that sequence information was already there.

h) Then I measure the sequence homology between the human protein and the proteins in cartilaginous fishes. I take the bitscore valure of the best hit. Again, it can be low, medium or high. This is the sequence homology that is present at the beginning of vertebrate history, before the split between cartilaginous fishes and bony fishes. That, again, is supposed to have happened 400+ million years ago. This value is important, because humans derive from bony fishes. Therefore, any homology found between cartilaginous fishes and humans predates the split between cartilaginous fishes and bony fishes. IOWs, any such homology was alredy present in the common ancestor of fishes, and therefore it has been conserved for 400+ million years.

i) Finally, I make the difference, in bits, between the bitscore from h) and the bitscore from g). This is the “information jump”, IOWs the sequence homology (to the human form) that has been “added” in the transition to vertebrates. And it is also a very good estimator of the functional information jump (more precisely, of the jump in human conserved FI), IOWs of the human conserved FI that was added to that protein in the transition to vertebrates, because both the homology measured from h) and the homology measured from g) are sequence homologies that have been conserved from 400+ million years.

This is the general idea.

Now, an example.

Let’s consider for a moment an old friend, the beta chain in ATP synthase. We know it is a very conserved proteins, with a very high homology between the human sequence and the sequence in bacteria. Certainly a lot of FI there.

Now, this is a 529 AA long protein. Let’s say that we want to apply our methodology to see what happens to that protein at the vertebrate transition.

OK, blasting the human sequence (P06576) against non vertebrate Deuterostomia and Chordates, the best hit is 866 bits (Acanthaster planci), a starfish.

Now, let’s blast it against cartilaginous fishes. The best hit is 929 bits (Callorhincus milii).

So, the information jump at the transition to vertebrates, for that protein, is 929 – 866 = 63 bits. 0.12 bits per AA (baa). Very low indeed.

That simply means that the protein was already almost identical to the human form in pre-vertebrates (87% identities). IOWs, the FI was already there, and no big information jump takes place at the vertebrate transition.

Now, let’s do that again with the protein CARMA1/CARD11, of which we have discussed in this thread.

This is a 1554 AA long sequence, in the human form.

Again, let’s blast it against Deuterostomia and Chordates that are not vertebrates. The best hit is 234 bits (Saccoglossus kowalevskii). 0.15 baa. A very low score, for such a long protein. That means that the specific sequence found in humans was almost completely absent in pre-vertebrates.

Now, let’s blast it against cartilaginous fishes. The best hit is 1514 bits (Callorhincus milii). That means that, even if the shark protein is still different from the human form, about half of the potential sequence information is already there. More than 400 million years ago.

So, how big is the jump in human conserved FI? Easy. 1514 – 234 = 1280. IOWs, about 1280 bits of FI have been added “de novo” in vertebrates, and then conserved for more than 400 million years. That’s a very big information jump.

IOWs, this protein was highly and specifically engineered during the transition to vertebrates, and that precious FI has then been preserved up to now.

Obviously this is not among the most important critiques, but Gpuccio seems to not understand what homology means. It’s a binary state. Two sequences (or characters) are either homologous or they are not. It’s nonsensical to assess the “degree” (e.g., high or low) of homology. I presume he means sequence identity, which is really not at all the same thing.

2 Likes

The hypothesis he is testing is interesting:

Essentially, he is arguing there was a large jump in FI in proteins at the vertebrate transition. Maybe he is right. I am not sure yet.

It seems that he needs to run some systematic controls to be sure. It would be interesting to see the conservation (in bits) between homologs of the protein at different taxa levels. If we looked that graph, would we see a discontinuity at vertebrates? Maybe, but it does not seem he has collected enough information to be sure. I would like to see the graph.

Let’s imagine his hypothesis bears out. What does that mean?

Well that inference is not warranted. As we know, there are several mechanisms that produce FI in biological sequences. He would have to rule all of these out to make this inference.

Given the multiple lineages involved, this sort of question also seems more conducive to an explicitly phylogenetic analysis instead of a series of pairwise blast searches. But some other criteria other than bitscore would have to be used, and I’m not sure what that would be.

3 Likes

I’m pretty sure that bitscores can be mathematically interconverted with phylogeny branch length. A phylogenetic analysis would be the right way to do this.

2 Likes

Interesting. I’m sure I could find a use for this. Can you elaborate more (a citation is fine!)?

1 Like

That conclusion just comes out of nowhere. I fail to see how he has supported his belief that this protein was “engineered”.

1 Like

From branch length, you can compute the approximate number of substitutions between any two sequences. Knowing the sequence length, that gives you a way to compute the percent identity. Percent identity is linearly correlated with bitscore. With some assumptions, you can compute an analytic relationship.

The analytic relationship will not be exact, because it will have to make a lot of assumptions. A better analysis would pay attention to homoplasies along each pathway, and also would take into account different AA frequencies.

Instead, I recommend we just fit a linear equation using the empirical data.The empirical approach should implicitly account for these concerns, but scaling branch length to BLAST bitscores. Eitherway, the relationship between bitscore and branch length is going to be more-or-less linear (the longer the branch length, the smaller the bitscore).

From this, we can see a way to better test gpuccio’s hypothesis. He is trying to test it with hand picked blast searches, examining the top hits. This has him looking at the distance between extant sequences in different clades, but that is not really directly addressing his hypothesis. A better strategy would be to look at the branch length for the vertebrate clade, and compare it to other clades at a similar level. We would want to see the vertebrate clade has a longer branch length than others for his hypothesis to be supported.

This is straightforward enough to do that it might be worth actually doing. There should be some servers online to do this sort of analysis, right?

2 Likes

Yes, that makes sense other than that at some level it would be hard to decide what clades are at an “equivalent” level to the vertebrate clade. In any case, I have too much on my plate at the moment, but if nobody takes a look at this in the meantime, I could probably conduct the analysis several months from now.

1 Like

We could just look at taxonomic level I suppose. That should work as a starting point.

It is worth emphasizing that BLAST bitscore is an approximation of its own, and not any meaningful gold standard. It would be good to show pairwise branch length is negatively related to bitscore, to demonstrate the relationship to gupuccio’s analysis. It is not a problem, however, if there is a spread in that relationship.

1 Like

It is not that hard to see his logic. He thinks that FI is a tell tale sign of engineering, because only design can cause large increases in FI.

Of course, we know there are other ways that FI can accrue and that common descent with neutral drift can inflate FI estimates. Gpuccio, it seems, does not know about these other mechanisms and complications to his analysis. Not knowing these things, he does not have a way to explain FI increases without recoursing to design.

So his logic is reasonable, because he does not know of other ways of explaining FI. To make his cases, however, he has to demonstrate that the other mechanisms do not account for any FI we measure. He has not done that here. My guess is because he is unaware of these mechanisms.

2 Likes

A few thoughts. First, it seems to me that a simple pairwise comparison is completely inadequate for what he claims to be doing, and using the actual phylogenetic tree, with lots of chordate sequences, would better enable him to do what he seems to be trying to do here very clumsily, which is to estimate the sequences at internal nodes. And also to estimate sequence conservation. More work than clicking on a few buttons in BLAST, but a much better approach. Oh, I see @davecarlson has already mentioned this.

Second, I don’t understand why similarity to the human sequence should estimate functional information. If you started with a tunicate sequence, would you not see something similar, and wouldn’t that show a big information jump on the way to Ciona? I do not, in general, see an argument for relating bitscore to functional information.

As for the confusion of homology with similarity, that seems common among molecular biologists, not just Gpuccio.

Bad idea. Taxonomic ranks are arbitrary and taxa of the same rank can’t be considered equivalent. Even as a starting point.

2 Likes

I do not find that logic to be reasonable. If I am claiming you have committed a crime, I have demonstrated that claim if I show that some other person didn’t commit it.

Of course, my case is even weaker if it turns out the evidence points conclusively to the fact that this other person did commit the crime, as is the case here. :slight_smile:

First of all I would like to thank Joshua @Swamidass and all the kind interlocutors at Peaceful Science for taking the time to consider my writing. I have read the first few comments, and they are very interesting and stimulating.

As agreed with @colewd , who is the stimulator of this discussion and whom I thank sincerely, I will answer here to the main arguments raised at PS, because I do want that my answers be well visible to the people here at UD. I hope this “parallel” discussion (already realized in the past with TSZ) may be comfortable for the guys at PS too. Bill can of course reference my comments at PS, if he wants to.

So, let’s start.

Thank you! Seriously, this is one of the biggest acknowledgments I have ever received from the other side.

Yes. Indeed, in many proteins. CARD11 is just one example.

That’s exactly what I have tried to do. My comment for your blog is, of course, only a brief summary with a couple of examples.

I have described in detail my results for vertebrates in this OP:

In brief, I have tested the whole human reference proteome against 9 groups of organisms, chosen, with some practical compromise, to represent the natural history of metazoan. For each human protein, a blast comparison was made versus all the protein sequences present in the NCBI database for that group of organisms, and the best hit chosen. I used the donwnaded version of Blast to perform the comparisons automatically.

So, my database has the best hit of each human protein with each class of organisms, in terms of bitscore, bits per aminoacid, and difference with the previous class. I use that database for all my analyses, and the R software to analyze results and draw graphics.

You can read in the OP above mentioned that I have found about 1.7 million bits of new FI appearing in vertebrates, just at the start of their natural history.

You can find the general graph with the mean values, in baa, for each class, in the above mentioned OP (Fig. 1). You can see the general jump in vertebrates, which is better analyzed in the following figures.

However, the important point is not only the generic jump, but the fact that the significant jumps can be found in specific classes of proteins, especially those involved in immune response and brain development. That is good confirmation that my methodology is really measuring the relevant information novelties.

Your next statement deserves a rather more detailed answer, so I will have to postpone it to next post (as soon as possible).

Well, thank you for the correction. I am a medical doctor, and not a biologist, so some imprecision in terminology on my part can be reasonably expected. I will be happy to acknowledge any well explained correction.

Yes, I have been using the word “homology” to mean the degree of sequence similarity between protein sequences, as measured for example by the Blast bitscore. I was not aware that the word should be reserved to the binary condition of being or not a homologue.

So I will, in the future, use the form “conserved sequence similarity” instead of “conserved homology”. I would not use “sequence identity”, because identities are not the only component of the bitscore, even if certainly the major component.

However, I can see that in a later comment John_Harshman comes to my (partial) rescue, stating:

Thank you. Being human, I can find some small consolation in not being alone in my errors!

6 Likes

@Gpuccio, it is great to see a starting point of a response from you. We aim to key the conversation substantive and professional. It doesn’t always go that way perfectly, but that is our goal. I hope that we would all understand each other, and science, better through exchanges like this.

A few thoughts to offer as we await a longer response from you.

Interesting Hypothesis

It is an interesting hypothesis, and it deserves to be received with seriousness and rigor. Good ideas come from all over. The beauty of how science works is that we can, if we are humble to the data, come to common understanding through engaging hypotheses like this. Even failed hypotheses have value for this reason.

Whether or not this hypothesis pans out, our success/failure is determined by the rigor with which we test the hypothesis, and our ability to come agreement (especially across camps) over evidential claims concerning the hypothesis.

Terminology Flubs are Human!

As far as I am concerned this is a non issue now. Many biologists make this mistake, and you are adjusting. Thanks for hearing @davecarlson and @John_Harshman out. It is a fairly common mistake, even among biologists, and it does not affect the substance of your argument.

The Central Methodological Issue

It seems that central methodological issue is:

As I explained, there is a mapping between bitscores and the phylogenetic analysis. I hope my explanation made sense. It seems like it would be far more methodologically grounded to move into the phylogenomic analysis, and translate back to BLAST if you want to compare results.

How familiar are you with BLAST bitscores? Do you know how they are generated? Do you understand the assumptions involved in how they are computed? The gap between reality and these assumptions, I believe, undermines your case. If you switch to a better methodology, this objection would go away, and you would still be able to cross check with your prior results.

Interpreting What We Expect

This paper recently posted by @sfmatheson has a figure that tells us what we expect to see with the new analysis.

https://www.cell.com/current-biology/fulltext/S0960-9822(15)00928-8

Here, branch length is going to be linearly correlated with “FI gain,” as you call it. If you look at vertebrates here, the length of the branch is not remarkably or unusually long. If your results look much like this, it seems that it would count against your analysis. If the data did end up like this, do you agree it would count against your hypothesis?

Your Data is a Good Starting Point

I really love that you are looking at the data yourself. You clearly have the ability to run much of these anlayses yourself. Great!

Let’s talk about how to analysis with phylogenies. I recommend following @Jordan’s contributions to this thread: John Harshman: The Phylogeny of Crocodiles. There are several programs you can freely download to use on your sequence database. It would be really interesting to look at your results in comparison to the tree I just posted.

Next, it seems your analysis would benefit immensely by augmenting it with more organisms, at least one from each of the terminal clades in the tree I posted above. It would be valuable, again, to see how your results compares with that tree. It would be interesting to see how including more genomes affects your conclusions. It would be interesting to see what sorts of deviations from the consensus tree you can find.

Some Parallel Questions

From hearing about your work from others, and reading your articles, I wanted to clarify your position. It sounds like you:

  1. Believe in an old earth.
  2. Affirm common descent of humans with the great apes.
  3. Are arguing that there is evidence for design in the origin of phyla.
  4. You conceive design as information infused into the common descent process.

Is this about right?


I’m looking forward to seeing the conversation continue.

2 Likes

Looking over that figure one more time, it does not look like the branch lengths are drawn to scale. That means we do not exactly expect this tree. Nonetheless, this question still stands:

We would want to see the actual tree, with branches drawn to scale, to make a better assessment.

A Scaled Graph

This scaled tree from the same paper is more helpful, even though it does not have vertebrates. Focus on Chordata (dark blue):

The branch length or divergence width of chordata does not seem particularly notable compared to other groups. The branch length and spread for vertebrates will be smaller that that of chordata, because vertebrates are fully contained within chordata. So the vertebrates would be less remarkable than chordata.

As negative controls, look at Arthopods (bottom), Nematodes (pink), Ascoela (red), and Platyhelminthes (light green). There seems to be far more information gain and/or spread in these taxa.

Consider the Myzostomida

The long thin (red) clade Myzostomida deserves some attention. This are a diverse bunch of worms that are very phenotypically diverse, very different from other clades, and very genetically similar to one another.

image

To reiterate:

  1. Very genetically different from other clades.
  2. Very genetically similar to one another.
  3. Very phenotypically different from one another.

How do we reconcile the divergence between genetics and phenotype (2 vs. 3)? If we understand its difference from other clades as a measure of FI, then we should not expect to see much phenotypic diversity in the clade, but we do. Likewise, if phenotypic diversity much to do with FI (measured this way), we expect there to be a high spread in genetic diversity for a phenotypically diverse group such as this. We really need to reconcile this divergence between genetics and phenotype, as it demonstrates that this approach to measuring FI lacks validity.

A key finding from the neutral theory is that, on bulk, genetic changes are more a marker of history than functional changes. That means we do not expect there to be a strong link between FI and genetic divergence. That solves the riddle, but in a way that undermines this analysis as a measure of FI.

How would @Gpuccio solve this riddle?

2 Likes

My comments regarding gpuccio’s work:

As far as I can tell, the goal is to estimate what gpuccio calls “human conserved FI”, the idea being (as far as I can tell) that large amounts of “human conserved FI” will be suggestive of design. What confuses me is the method that is used to arrive at “human conserved FI”, and how this relates to any parameter that may suggest design.

To illustrate – gpuccio estimates the “human conserved FI” for a given protein by subtracting the bit scores from BLAST comparisons of human, shark, and Saccoglossus (H:S – H:Sa). The problem with this is that one can obtain very different values if one changes the organisms that are plugged into the analysis. For example, replace sharks with chimpanzees (Ch) and one gets much, much larger values. However, also replace Saccoglossus with the mouse (M) and the value for “human conserved FI” (H:Ch – H:M) will be much, much smaller.

To show why this doesn’t make much sense to me, consider instead the Saccharomyces species (cerevisiae and fragilis), and, as an “outgroup” to represent some unicellular predecessor, Plasmodium. Run gpuccio’s calculation (Sc:Sf - Sc:P) and one gets a result that would call for design in the origination of yeasts (probably, for the comparison I present here, the amount of “conserved FI” would be much greater for yeast than for humans when that latter is calculated using chimpanzees and mice).

Thus, as best I can tell, “conserved FI” is little more than a measure of evolutionary relatedness. One can rig the calculation to obtain pretty much any value one wants, and the value would reflect relationships between the three organisms used for the analysis. Nothing more, IMO.

Beyond this, it is not clear to me what the connection is between “conserved FI” and design. I suspect (but would welcome correction) that gpuccio is drawing on the work of Dembski, Axe, Behe, et al. who argue that information, defined as the frequency of occurrence of a functional sequence in sequence space, is suggestive of design when it is high. However, as many, many discussions here on PS have shown, the ID vanguard is wrong when it comes to their ideas about protein functionality and information. This calls into question gpuccio’s use of the term, and the conclusions drawn.

However, I will grant that I am not familiar with all of gpuccio’s posts on UD, and will grant the possibility that gpuccio is aware of these considerations, and has developed a more correct formulation. If so, I would interested in a summary, and moreso in the ways that the metric has been calibrated and/or validated.

My 2c.

5 Likes

I agree, which is why this is important: