Gpuccio: Functional Information Methodology

I said the information is the positions of visible stars in the sky. The function of this information, for many thousands of years, was navigation (latitude and direction), time-telling (seasons), and storytelling (constellations). Any change that would impact navigation, time-telling, or storytelling, or create a visual difference would impact one or all these things.

There are about 9,000 visible stars in the sky (low estimate). Keeping things like visual acuity in mind (Naked eye - Wikipedia), we can compute the information. However, even if there are just two possible locations in the sky for every star (absurd) and only half the stars are important (absurd), we are still at 4,500 bits of information in the position of stars in the sky. That does not even tell us the region of sky we are looking at (determined by season and latitude), but we can neglect this for now.

1 Like

I wll briefly answer this, and then for the moment I must go.

What makes the current configuration of the stars specific to help navigation, time telling or story telling? If the configuration were a different random configuration, wouldn’t it be equally precious for navigation, time telling and storytelling?

There is no specific functional information in the configuration we observe. Most other copnfigurations generated by cosmic events would satisfy the same functions you have defined.

Yes you did say this. We dispute this claim. Both @sfmatheson, @glipsnort, and I have all explained our objections.

This the crux (or at least one crux) of the issue. We are convinced that neutral evolution will be mistaken as FI gains. You have not put forward any negative controls to quell our objections. See what has already been said:

That last paragraph is key. Your estimate of FI seems to be, actually, FI + NE (neutral evolution), where NE is expected to be a very large number. So the real FI is some number much lower than what you calculated.


I really don’t understand.

Can you please explain why neutral evolution would be part of the FI I measure? This is complete mystery to me.

Neutral evolution explains the conservation of sequences? Why? I really don’t understand.

1 Like

A new configuration would not be equally precious for telling the stories we have now. We would have different constellations, and therefore different myths about these constellations. My function is to tell this specific (specified!) stories, not any old stories you might want to come up with in place of them. So no, a new configuration would break the storytelling function.

Remember also, that some configurations (e.g. a regular grid or a repeating pattern) are useless for navigation or time-telling. Very quickly, we would get over 500 bits with a careful treatment, well into the thousands if not millions of bits.

1 Like

But you are doing exactly what I cautioned about. You are defining the function as a consequence of an already observed configuration.

If the configuration were different, we woukld be telling different stories.

Are you really so confused about the meaning of FI?

The function must be defined independently You can define the function as “telling stories about the stars”. You cannot define the function as “telling storeis about the stars in this specific configuration”.

How can you not understand that this is conceptually wrong?

1 Like

No, I’m just using a particular definition of function, which parallels yours in biology. If you don’t want me to use my definition, I am not sure you can use yours.

It seems you are defining function by the already observed configuration of proteins in extant biology. This does not take into account the configurations that would produce the same high level functions, but we just don’t see because it is not what happened.

If these are the rules, you are breaking them. Right?

It is subjective how we define function. I chose a definition of function that paralleled yours in biology, so I am not sure how you can object to me “breaking the rule” while breaking yourself with your own definition!

Yes, this highlights the problems with using FI as a way of determining if something is designed or not.


When using conservation it’s possible, even likely, to underestimate the total FI present while overestimating the change in FI. When the mouse genome was sequenced, for example, one of the immediate outcomes was a lower bound on the fraction of the genome that is functional (not quite the same thing as FI, but in the same conceptual neighborhood), a bound of 6% based on the fraction of the genome that is conserved. That was a valid conclusion (with various caveats). If we were to repeat the same analysis across primates to humans, we would get a larger fraction, say 8%. That would also be a valid conclusion. What is not a valid conclusion is that the functional fraction of the genome increased by 2% in primates. Some functional sequence is likely to have changed on the branches between rodents and primates without losing function, while other functional sequence has been lost and gained in each branch.


Which is, once again, why it is important to do this analysis with a phylogeny. As @John_Harshman, an expert in this area, comments:

Note that the paper he links too does a great job at the analysis you are attempting to do. It would be valuable to look it over to determine where you disagree, agree, or could learn from it @gpuccio.

Furthermore, I reiterate my question from early on:



Gentlemen, I see my name has been mentioned here. I’m a bit overwhelmed by work, so cannot participate in this discussion, but I do want to make a few clarifying points regarding my own thinking on this problem.

First, as I have stated previously, the quality of any estimation depends upon sufficient sampling, and I greatly doubt that we have sufficient data to estimate the FI required for any protein in a specific species. The same probably goes for genus, and maybe even for family and order. My interest has always been, and continues to be, the FI required for the origin of novel protein families, rather than for a protein in a single species or genus.

For this reason, I have focussed, and continue to focus on protein families that have had the benefit of sufficient sampling produced by thousands of independently evolving populations across a wide range of taxa (preferably across many phyla). In discussions with various people in the field, there seems to be two research questions to answer (which appear to have come up here, though I’ve not read anything other than gpuccio’s one reply included in the email sent to me):

  1. How can we test for sufficient sampling, given common descent and,

  2. How can we test for sufficient sampling, given the possibility of a global maximum fitness in sequence space, clustering our data in only a subsection of sequence space?

For the past 8 months or so I have been working on a method to provide answers to both questions. The method itself was not difficult, but the testing of that method has been time consuming. I cannot discuss anything related to this here, as I am submitting my findings to a journal for peer review and publication. I will say this, however … I wouldn’t even think about estimating the FI for a protein from the data for an individual species (e.g., human), for obvious reasons when one scans the sampling available at present. But my initial assumptions several years ago regarding sampling broadly across phyla or kingdoms is being verified to produce reasonably accurate estimates of FI required for the origin of many protein families.

I can’t say anymore until the paper passes review and is published. The input and critiques I’ve had thus far from a few non-ID scientists sceptical of ID, has been especially valuable, but I cannot widen the circle of discussion any further until after the paper is out. As my former supervisor urged me … “stop leaking your research and focus on submitting more papers for publication.”

As I indicated at the outset, I cannot participate in this discussion, although it does look to be interesting.


Thanks for chiming in @Kirk. Great to see you, even if it is for a moment.


Again, my purpose is not to measure the full content of FI in a protein. I ageree that it is possible to underestimate the full content of FI, but we can have a good idea of the component revealed by sequence conservation from that point on. That’s what I do, that’s what I get as result, and that’s what I use for my inferences. I have never made reference to the full content of FI. My purpose is to demonstrate the presence, at a certain point in evolutionary history, of a definite quantity of FI that has a sequence configuration that will be conserved up to humans. This should be clear, by now, Either you agree that my methodology does that, or you don’t. Fell free to decide, and if you want to explain why. But there is no sense in requiring from my methodology what it has never tried to measure.

As for overestimating the change in FI, again I have never tried to estimate the absolute change in the full content of FI. That should be clear from what I have written. I quote myself:

That should be clear enough, but still you insist about the danger of overestimating the change in FI, when I have never tried to do that.

Maybe you are confused by the fact that I speak of information jumps. But, you see, my term has always been “information jumps in human conserved sequence similarity”. It’s not a jump in the full content of FI, as I clearly explain in the above quote.

IOWs, when I say that CARD11 shows and information jump of 1250 bits at the transition to vertebrates, I simply mean that 1250 bits of new FI that is similar to the form observed today in humans appear at that transition. It is a jump, because new sspecific sequence information arises, that was not there before. But I have never said the the total FI was lower before. I simply don’t measure it, because my methodology cannot do that.

And this is it. You think as you like, but at least try to understand what I say and what I am doing. Or justr don’t try, if you prefer so.

@gpuccio did you misread @glipsnort?

Overestimate is the opposite of underestimate. We are saying you are wildly overestimating FI, not underestimating it.

1 Like

OK, I will try to be simple and clear.

I am here to discuss a methopdology that can, I believe, give important indications about a certain type of FI in proteins (the one that can be revealed by long sequence conservation), its appearance at certain evolutionary times, its different behaviour in different proteins. And that can give a good idea, by establishing a reliable lower threshold of new FI appearing at certain steps, of how bif the functional content of many proteins is. These data are very interesting, in my opinion, tgo sipporft a design inference in many cases. This is my purpose, and nothing else.

Now, may be that a phylogenetic analysis could do that better. Or maybe not. I don’t know, and I cannot certainly perform a phylogenetic analysis now. I am not aware of phylogenetic analyses that are centered on the concept of FI as formulated in ID, least of all on design detection. So, I have my doubts.

However, I am not here to perform a phylogenetic analysis, I am here only to explain and defend my ideas, and I try to do exactly that.

So, again, I am comvinced that my methodology is a good estimator of that part of FI which is connected to sequence conservsation, for example from cartilaginous fish to humans, and that appears in vertebrates. The same procedure can also be applied to other contexts, of course.

I have received, from you and others, a few recurring criticism that are simply not true ot not pertinent. Here are a couple of examples:

Wrong. My estimate of FI is, rather, Total FI - functional divergence (FI not conserved up to humans). Therefore, as stated also by glipsnort, my estimate is underestimating FI, not overestimating it. Moreover, NE has nothing to do with this.

Why? And what do you mean by NE? Do you mean NV and ND? Why should that “be mistaken” as FI gain? By my procedure? There is absolutely no reason for that. Neutral variation is the cause of divergence in non functional sequences. Why should it be mistaken as FI gain by a proicedure based on sequence conservation? I really don’t understand what you mean.

And so on. How can we discuss with such basic misunderstandings repeated so many times, and without any real explanation of what is meant?


Did you misread @glipsnort?

He is saying that it is possible, even likely, to underestimate the total FI present (true) while while overestimating the change in FI.

And have you read my answer to him? I quote myself:

My procedure cannot overestimate FI, only underestimate it. My estimate of the change is only an estimate of the change (jump) in human conserved similarity. It is not, and never has been said to be, a measure of the change in total FI.

OK, I was answering to your very disappointing post about the starry sky, but for some strange reason I have lost all that I had written. Maybe it is better. Now I am tired. Tomorrow I will see if I really want to say those things.

This is bad. Frankly, I would probably not even answer this kind of arguments, if they did not come from you.

I don’t know if you really believe that the stars in the sky exhibit high values of FI, or if you are only provoking (without any good reason, IMO).

If you really believe that, there is proibably no purpose in continuing any discussion about FI.

If you are provoking, it’s not a good sign just the same.

However, here is a brief answer.

The simple rule I have described (and which is rather obvious in any possible serious discussion about FI) is that we cannot use the observed bits to define the function. The function must be defined undependently from the knowledge of the observed bits.

So: “a configuration of stars that favors storytelling” is valid. But probably almost all possible configurations would do that.

While “a confifuraion of stars where the first has these celestial coordinates, the second these other ones”, and so on for all 9000 visible stars, is not valid.

So, a binary number of 100 digits is a good definition. And, of course, has no relevant FI.

A binary number that is 00110100… is not a valid definition. It can be used only as a pre-specification.

This is the rule, and I have never broken it. I have never defined a protein function that says: “a protein with the following sequence: …” I have always used for proteins the function described in Uniprot for the observed protein, or something like that. IOWs, a protein which can do this and that. Never: “a protein with this sequence”.

But you say: no, I wnat the stars that must have exactly the position that we know. That is breaking the rules. You are using the bits. I have never done that.

You say:

"It seems you are defining function by the already observed configuration of proteins in extant biology. "

Not at all. That statement is unfair, wrong and confounding.

I am always defining function as what a protein can do. I am using observed configurations, in a precisely described way and accordign to well explained assumptions, only to estimate FI in proteins, not to define function. You are equivocating, and rather badly.

You raise the problem of other sequences that could implement the function. But my methodology is aimed exactly at that: having an estimate of the target space. If you do the math, you will see that the estimates of the target space in my results are very big.

Of course, there is always the problem of possible alternative solutions, similarly complex, but completely different. Those cannot easily be anticipated. They certainly exist, in some measure.

That is a completely different problem. It has nothing to do with the definition of the function, but rather with the estimate.

I have discussed that problem in detail in the past. You will find a long discussion about that in this OP and in the following thread:

Defending Intelligent Design Theory: Why Targets Are Real Targets, Probabilities Real Probabilities, And The Texas Sharp Shooter Fallacy Does Not Apply At All.

Look at the part about clocks.

Okay, can you clarify how you implemented this rule in your analysis?

In my discussion about the relationship between FI and the design inference (that has nothing to do with my methodology to estimate FI) I have given a clear example of FI in language and how to measure it. Please, refer to that. The Shakespeare sonnet. You will find many possible functional definitions for the sonnet, each of them implying different levels of FI. The bits in the sonnet (the sequence of letters) are of course never used in any definition.

In my procedure to estimate FI in proteins, the function is not defined (it is supposed to be the one described in Uniprot, if and when known). The estimate is based on conservation, which is an indicator of functional constraint, but does not tell us what the function is.

Those are two different things.

I am going to sleep Tomorrow we can go on. :slight_smile:

1 Like


OK, now after some rest, let me go back to the starry sky example. I will show how it should be treated in terms of FI.

We have a system where 9000 stars can have an independent position in the sky. Also, they can have different brighteness.

We define a function: that the stars can help orientation and navigation.

Let’s assume that the position of the stars is a random configuration. We have a system with a very big search space. A lot of possible configurations, considering both position and brightness.

The right question, from the point of vew of FI, is: how many of the possible configutations would satisfy the independently defined function? How big is the target space? Is it an infinitesimal fraction of the search space (high FI), or rather a big part of it?

The answer here, while difiicult to compute in detail, is easy enough in principle: the target space is almost as big as the search space. Therefore, FI is really trivial, almost zero.


Because, of course, almost all the possible random configurations of position and brightness can help orientation and navigation.

Not all of them, however.

Very ordered configurations, those where all the stars are more or less equally distributed in the sky, and brghtness is more or less the same for all of them, would not help orientation and navigation. We would just see a sky that is the same everywhere, and does not allow us to get information about earth rotation and our position.

But of course, those highly ordered configurations are really, really rare in the search space.

This is an interesting example, and I thank you for providing it, because it is a case where order does not satisfy the function, while randomness does. Unfortunately (for your argument), the FI linked to such a system is almost zero.

I think this is a good answer for your other examples too, but if you believe that they have different merits, please explain why.