Gpuccio: Functional Information Methodology

Art · August 26, 2019, 7:32pm

In my opinion, this is needlessly redundant. Moreover, it strays from the usages first put forward by Dembski (who only spoke of specifications). For me, when it comes to proteins (the usual focus of biological ID), a specification is the same as the biochemical activity possessed by the protein - ATPase, RNA binding, etc., etc. I have never found it necessary to introduce more layers of confusion and terminology on this.

glipsnort · August 26, 2019, 7:34pm

Be that as it may, this thread is about @gpuccio’s argument, and that argument relies on BLAST bit scores being a good proxy for the defined sense of FI. But they’re not.

Art · August 26, 2019, 7:35pm

Understood and agreed. I just thought I would toss out my own approach to keeping track of definitions and the like, and making sense of things.

Back to our regularly-scheduled program.

swamidass · August 26, 2019, 8:26pm

I see some answers to my questions…

gpuccio:

I will be more clear. A strong sequence similarity between, say, a human protein and the chimp homologue of course is not a good estimator of FI. The reason for that should be clear enough: the split between chimps and humans is very recent. Any sequence configuration that was present in the common ancestor, be it functionally constrained or not, will probably still be there in humans and chimps, well detectable by BLAST, just because there has not been enough evolutionary time after he split for the sequences to diverge because of neutral variation. IOWs, we cannot distinguish between similarity due to functional constraint and passive similarity, if the time after split is too short.

But what if the time after split is 400+ million years, like in the case of the transition to vertebrates, or maybe a couple billion years, like in the case of ATP synthase beta chain in E. coli and humans?

This is helpful and distinguishes you from @Kirk. We agree that common descent can produce similarity, and that this would look like FI in your computation. We have had a hard time establishing this point with other ID luminaries.

The way you account for this is by only looking at ancient proteins, where you hope that millions of years would be sufficient to erase this effect. How do you know 400,000 million years is long enough to erase this effect?

Moreover, you are computing two numbers. One inside the clade, and another outside. It seems you have explain how this works for both numbers.

I have not seen the answer to this yet, and it seems critical. Neutral co-evolution, it seems, will produce the illusion of FI with your methodology, especially at long time-scales.

gpuccio · August 26, 2019, 8:32pm

I thank Swamidass and the others for the attention to my “risk of being overwhelmed”. I would like to answer everyone, or at least all those who offer interesting contributions, but the risk os real. I am one, and my resources are limited. So, I will profit of this new “anti overwhelming” policy, but I will also try to have a look at waht others say, and if possible answer them.

I will answer different points that have been raised as it comes. There is so much to say. I don’t want to convince anyone, but I would like very much to clarify what I believe as well as possible.

So, let’s start.

gpuccio · August 26, 2019, 8:56pm

Swamidass:

I will start with an easy point: your question about neutral drift. I think that will be valid for coevolution also.

First of all, I am well aware of nutralism and of drift as important actors in the game. Indeed, you can see that my whole reasoning to measure FI is based on the effects of neutral variation and neutral drift. What else erases non functional similarities given enough evolutionary time?

So, I have no intention at all to deny the role of neutral variation, of neutral drift, and of anything else that is neutral. Or quasi neutral.

My simple point is: all this neutral events, including drift, are irrelevant to ID and to FI.

The reason is really very simple. FI is a measure of the probability to get one state from the target space by a random walk. High FI means an extremely low probability if reaching the target space.

Well, neutral drift does not change anything. The number of states that is tested (the probabilistic resources of the system) remains the same. The ratio of the targte space to the search space remains the same. IOWs, neutral drift has no influence at all on the probabilistic barriers.

Why? Because it is a random event, of course. Each neutral event that is fixed is a random variation. There is no reason to believe that the mutations that are fixed are better than those that are not fixed, in the persepctive of getting to the target. Nothing changes.

Look, again I am trying to answer briefly. But I can deepen the discussion, if you let me know what you think.

Just a hint. To compute FI in a well defined system, we havt to compute, durectly or indirectly, three different things:

The search space. That is usually easy enough, with some practical approximations.
The target space. This is usually the difficult part, and it usually requires indirect approximations.
The probabilistic resources of the system. FI (-log2 the ratio of the target space to the search space) is a measure of the improbability of finding the target space in one random event. But of course the system can try many random events. So, we have to analyze the probabilistic resources of the system, in the defined time window. This can usually be donne by considering the number of reproductions in the population, and the number of mutations. IOWs, the total number of genetic states that will be available in the system in the time window.

I have discussed many aspects of these things in this OP:

What Are The Limits Of Random Variation? A Simple Evaluation Of The Probabilistic Resources Of Our Biological World

I give here also the link to my OP about the limits of NS:

What Are The Limits Of Natural Selection? An Interesting Open Discussion With Gordon Davisson

In those two OPs, and in the following discussions, I have discussed many aspects of the questions that are being raised here. Of course, I will try to make again the important points. But please help me. When what I say seems too brief or not well argumented, consider that I am trying to give the general scenario first. Ask, and I will try to answer.

gpuccio · August 26, 2019, 9:06pm

Usually, when you look at synonimous sites, it is very difficult to detect any passive sequence similarity after such a time. IOWs, we reach “saturation” of the Ks.

The R function that I use to compute Ks usually gives me a saturation message for synonimous sites in proteins that have such an evolutionary distance.

Moreover, the fact is rather obvious when we look at the very different behaviour of proteins in the transition to vertebrates. The example of CARD11 is really an extreme case. Many proteins have very low sequence similarity between fishes and humans. The human configuration, in those proteins, begins to grow at later steps. There are proteins that have big neutral components, and the neutral part in not conserved at all thorughout that evolutionary window.

So, I have all reasons to believe that 400 million years are enough to make conserved information a good estimator of FI. Remember, we are not looking for perfection here. Just a good estimate.

Art · August 26, 2019, 10:08pm

I would note that the bits calculated in this cited essay cannot be equated with bits derived from either BLAST analyses or the equation for FI that has been given above. They are quite completely different, and the calculation in the essay simply does not provide a bit-based estimate of probabilistic resources.

@gpuccio gpuccio, have you ever considered trying to derive expressions that would formalize the relationships between the different information calculations you have presented here? I believe this is possible for the BLAST-FI usages, and a clever mathematical biologist could probably do something similar for the metric you describe in the cited essay. I believe this could be a valuable addition to this field, as it were, and would address a lot of the concerns and confusion that you may see here, or have encountered before.

gpuccio · August 26, 2019, 10:27pm

I don’t follow your reasoning.

A sequence of 100 bp where each bp must be specific for the function to be present has, of course, a FI of 200 bits:

Target space = 1

Search space = 4^100

Target space/Search space = 6,22302E-61

FI = 200 bits

The FI expresses the probability of finding the target space from an unrelated state in one attempt. In this case, it is 1:4^100

I don’t follow your reasoning.

Of course, the perfect conservation of that sequence would inform us that the sequence has 200 bits of FI. Indeed, the bitscore of a 100 bp sequence against itself is 185 bits. Which seems good enough.

gpuccio · August 26, 2019, 10:36pm

Why not?

I have considered the probabilistic resources of our planet as a higher threshold of the number of possible states visited by a super population of bacteria inhabiting our planet for 5 billion years and reproducing at a very high rate.

This is of course an exaggeration, and a big one, but the idea is correct, I believe. The probabilistic resources of a systerm are the number of states that can be randomly reached. It is similar to the number of times that i can toss a coin. They can be expressed as bits, just taking the positive log2 of the total number of states.

So, if I have a sequence that has a FI of 500 bits, it means that there is a probability of 1:2^500 to get it in one random attempt. If my system has probabilistic resources of 120 bits (IOWs, 2^120 states can be reached), the probability of reaching the target using the whole probabilistic resources is still 1:2^380.

What’s wrong with that?

Of course, as I have said, the Blast bitscore is not the FI. But, provided that the conditions I have listed are satisfied, it is a good estimator of it. Look also at my answer to glipsnort, that I have just published.

Please, let me know what you think. Thank you.

sfmatheson · August 27, 2019, 12:50am

Your answer, I think, missed the point. The issue is not, as you seem to think (and I could be wrong) that neutral drift precludes design. (Edit for clarity: what I mean is that the challenge posed by @swamidass is not an assertion that drift cannot lead to design.) The issue is that drift (the key concept being neutral drift) will create sequence divergence that is uncoupled from function. In situations in which you have neutral sequence divergence, you have a nice negative control, which is one theme that @swamidass has emphasized without success. Any metric that supposes itself to measure “functional information” must be able to distinguish random drift from functional difference. Until you can show your method can do this, it seems to me that you have nothing more than a calculation of divergence (or of its converse, conservation) that is neither interesting nor informative.

And, again, I don’t see (but I could have missed it) an understanding of the need for a phylogenetic framework in which to interpret this or any other quantification of sequence divergence over evolutionary time. This could potentially help you find interesting signals of conservation/divergence, but without it, I think you lack any context to understand what your metric means. The exceptions would be proteins or genes about which we know the functional implications of sequence differences; but in those cases, we don’t need a new metric to calculate “FI” or anything similar.

swamidass · August 27, 2019, 2:00am

@glipsnort, I do not think this is correct. It would, rather, be log(1/4) * 100, which would be about 200 bits.

Which, of note, is very close to 200 bits - log 3e9. So this is ends up being very close to the estimate in this particular case.

This is what @gpuccio says, and I think he is correct:

So I think I agree with @gpuccio here, unless I am missing something @glipsnort. I think he is right here.

In contrast, I agree with @sfmatheson:

sfmatheson:

The issue is not, as you seem to think (and I could be wrong) that neutral drift precludes design. (Edit for clarity: what I mean is that the challenge posed by @swamidass is not an assertion that drift cannot lead to design.) The issue is that drift (the key concept being neutral drift) will create sequence divergence that is uncoupled from function. In situations in which you have neutral sequence divergence, you have a nice negative control, which is one theme that @swamidass has emphasized without success. Any metric that supposes itself to measure “functional information” must be able to distinguish random drift from functional difference. Until you can show your method can do this, it seems to me that you have nothing more than a calculation of divergence (or of its converse, conservation) that is neither interesting nor informative.

And:

I will be watching your response closely @gpuccio, and doing what I can to acknowledge when you make a solid point, and to simplify/distill the conversation down to the core issues at hand.

glipsnort · August 27, 2019, 2:26am

You are missing something. FI is not defined in terms of accessibility of functional states from this sequence. It’s defined in terms of target and search space. The search space is the space of all 100 bp sequences. The target space is the set of all 100 bp sequences that perform the same function, regardless of whether their sequences resemble the one that we happen to be looking at (presumably because it happens to be the one used by some lineage of organisms). By my statement of the scenario, the functional sequences constitute 0.1% of the search space. This specifies the FI of this functional class to be 10 bits, whether or not the sequence in question can reach those other states by single mutations.

Whether this particular sequence can reach any of those other functional sequences by allowed mutations has only the most tenuous relationship to FI, in fact. I didn’t state it explicitly, but I meant that functional sequences are randomly distributed throughout the search space. Given that, there is a reasonable probability that one or two functionally equivalent sequences are single-mutation neighbors of the one we’re examining, but the most probable case is that there are zero. Regardless, the sequence as I have described it will be highly conserved, even though FI is low.

swamidass · August 27, 2019, 2:29am

Okay, that makes sense, but it is not what you wrote before. It was not just @gpuccio that was confused.

@gpuccio, the point is that each individual 100 bp sequence is fragile, and needs to be 100% conserved, but there are so many distinct functional 100 bp sequences that the FI is lower than we would compute if we just considered the one 100 bp target we found (and passed on by common descent). This is an example where the general approach you are taking would fail. FI would be far less than sequence conservation would indicate.

There are several other failure cases visible.

gpuccio · August 27, 2019, 8:32am

Maybe I was not clear enough.

My point about neutral drift is not that it precludes design, or not. Or that it leads to design, or not. My point is that neutral drift is irrelevant for FI and the design inference.

I have also specified the reson for that. Neutral drift does not change any of the factors that influence FI and the design inference:

a) It does not change the target space

b) It does not change the search space

c) It does not change the probabilistic resources of the system.

IOWs, neutral drift neither precludes design nor leads to it, and it makes the generation of FI neither easier nor more difficult. I hope that is clear.

But, of course, neutral variation (which is the result of neutral drift) is instead an important part of my methodology to measure FI in proteins. Indeed, as explained many times, my whole procedure is based on the two pillars of neutral variation and negative (purifying) selection.

You say:

And you are right, of course. I have clarified those things myself, in comment #36. I quote myself:

gpuccio:

It should be clear that my methodology is not measuring the absolute FI present in a protein. It is only measuring the FI conserved up to humans, and specific to the vertebrate branch.

So, let’s say that protein A has 800 bits of human conserved sequence similarity (conserved for 400+ million years). My methodology affirms that those 800 bits are a good estimator of specific FI. But let’s say that the same protein A, in bees, has only 400 bits of sequence similarity with the human form. Does it mean that the bee protein has less FI?

Absolutely not. It probably just means that the bee protein has less vertebrate specific FI. But it can well have a lot of Hymenoptera specific FI. That can be verified by measuring the sequence similarity conserved in that branch for a few hundred million years, in that protein.

So, I will try to be even more clear.

When we have a bitscore from Blast, we are measuring of course both conservation and divergence. We can divide that into 4 different components:

a) Sequence conservation due to passive common descent. That is the component that we cancel, or minimize, by guaranteeing that we are blastin sequences separated by a very long evolutionary time, because practically all passive similarity will have been erased by neutral variation, if that part of the sequence is not functional.

b) Sequence similarity preserved by negative (purifying) selection. This is what we measure by the bitscor, of the condition above mebtioned is satisfied. This is what I call a good estimator of FI.

c) Divergence due to neutral variation and drift.

d) Divergence due to different functional specificities in the two organisms. Let’s call this “functional divergence”.

Now your point seems to be that my method cannot distinguish between c) and d). And I perfectly agree. But I have never claimed that it could.

My method is aimed to measure that part of FI that is linked to long sequence conservation. Nothing more. That is of course only a part of the total FI. Let’s say that it is the part that can be detected by long sequence conservation.

I have always made that very clear. My method underestimates FI, and this is one of the resons for that. So, we can consider my methodology as a good estimator of a lower threshold of FI in a protein.

That’s perfectly fine for the purpose of making a design inference. When I find 1250 bits of detectable FI in CARD11, I can well infer design according to ID theory. If the FI is more than that, OK. But that value is much more than enough.

As stated many times, ID is a procedure to infer design with no false positives, and many false negatives. False negatives are part of the game, and the threshold is conceived to guarantee the practical absence of false positives.

Of course, functional divergence is a very interesting issue too. But it requires a different apporach to be detected. I will discuss that briefly in next post.

gpuccio · August 27, 2019, 9:13am

Just to have some breath, I quote here this interesting comment by Giltil on the other thread:

Comments on Gpuccio: Functional Information Methodology

Dr. Swamidass: What empirical evidence do you have that demonstrates FI increases are unique to design?

I missed a good chunk of the exchanges. Did Gpuccio or anyone else from Team ID ever answer this rather critical question on this or the main thread?

I think you will find gpuccio answer in the main thread.
But maybe I can give you the answer.
The question is whether the claim « high FI signifies design » is warranted.
The answer is yes, both theoretically and empirically. Theoretically because there is no sufficient probabilistic ressources in the universe to find a target with a FI > 500 bits.
Empirically because there is not a single example that an object exhibiting such high FI has ever been produced without design; each time the cause of an object with high FI is known, this cause is invariably an intelligent cause.
In fact, the statement that high FI signifies design may probability be seen as a scientific law, in par with the second law of thermodynamics!
In the context of @gpuccio main thread, this leads us to the only real question, I. E., does the FI associated with some well defined transitions in life history is high enough to warrant a design inference? For the answer, stay tune with the ongoing dialogue.

I will come back to some of these concepts as soon as I can.

gpuccio · August 27, 2019, 9:28am

sfmatheson and Swamidass:

I have always been interested in the issue of functional divergence.

One way to test for functional divergence could be to use my methodology in two separate branches of evolutionary history. The sequence configuration that is not shared in the two branches but is highly conserved in each branch would be a good candidate for functional divergence. I have tried something on that line in this OP of mine:

Information Jumps Again: Some More Facts, And Thoughts, About Prickle 1 And Taxonomically Restricted Genes.

Another way, of course, is to prove the function of the non conserved sequence directly. Transcription Factors are a very good example of that. They have one or more DNA binding domains that are usually highly conserved. The rest of the molecule (often half of the sequence or more) is not very conserved. However, there are many indications that important functions, different from DNA binding, are implemented by that part of the protein.

I have discussed a recent, very interesting paper about RelA, an important TF, which demonstrates how much of the function is linked to the non DBD part of the molecule. If you are interested, you will find my comment about that at #29 in the thread linked at the start of this thread:

Controlling The Waves Of Dynamic, Far From Equilibrium States: The NF-KB System Of Transcription Regulation.

gpuccio · August 27, 2019, 11:32am

In the past few days I was away from home, and I could not access my database.

Now I was able to draw a graph of the two examples I made in the initial summary about my methodology. Here it is:

gpuccio · August 27, 2019, 12:32pm

Swamidass et al. :

So, here is your question #1:

I have explained that the connection is empirical, even if with a good rationale. I quote myself:

This is the empirical connection. Based on observed facts.

Of course, you are not convinced. You ask for more, and you raise objections and promises of counter-examples. That’s very good.

So, let’s go to my two statements. I will try to support them both. But in reverse order.

My second statement is:

“FI higher than 500 bits (often much higher than that) abunds in designed objects. I mean human artifacts here.”

Your objection:

But I have given a very precise definition of FI. What is the problem here?

To be more clear, I will describe here the three main classes of human artifacts, designed objects, where “FI higher than 500 bits (often much higher than that) abunds”. They are:

a) Language

b) Software

c) Machines

The first two are in digital form, so I will use one of them as an example, in particular language.

I have shown in detail how FI can be indirectly computed, as a lower threshold, for a piece of language. I link here my OP about that:

An Attempt At Computing DFSCI For English Language

A clarification: dFSCI is the acronym I used for some time in the past to point to the specific type of information I was discussing. It means digital Functionally Specified Complex Information. It was probably too complicated, so later I started to use just Funtional Information, specifying when it is in digital form.

The piece of language I analyze in the OP is a Shakespeare Sonnet (one of my favourite, I must say).

My simple conclusion is that a reliable lower threshold of FI for such a sonnet is more than 800 bits. The true FI is certainly much more than that.

There has been a lot of discussion about that OP, but nobody, even on the other side, has really questioned my procedure.

So, this is a good example of how to compute FI in language, and of one object that has much more than 500 bits of FI. And is designed.

Of course, Hamlet or any other Shakespeare drama have certainly a much higher FI than that.

The same point can be easily made for software, and for machines (which are usually analogic, so in that case the procedure is less simple).

So, I think that I have explained and supported my second point.

If you still do not have a clear understanding of my definition of FI, and how to apply it to that kind of artifacts, please explain why.

So, let’s go to my first statement.

“Leaving aside biological objects (for the moment), there is not one single example in the whole known universe where FI higher than 500 bits arises without any intervention of design.”

I maintain that. Absolutely. Your objection:

OK, I invite you and anybody else to present and defend one single counter-example. Please do it.

You mention two things that you have offered before.

a) Cancer

b) Viruses.

I have already declared that cancer is not an example of a design system, and I maintain it. Technically, it is a biological example, but as I have agreed that it is not a design system, I am ready to discuss it to show that you are wrong in this case. I want, however, to clarify that I stick to my declared principle to avoid any theological reference in my discussions. I absolutely agree that cancer is not designed, but the reason has nothing to do with the idea that “God would not do it”. It is not designed because facts show no indication of design there. I am ready to discuss that, referring to your posts here about the issue.

For viruses, I was, if you remember, more cautious. And I still am. The reason is that I do not understand well your point. Are you referring to the existence of viruses, or to their ability to quickly adapt?

For the second point, I would think that it is usually a non design scenario, fully in the range of what RV + NS can do. I must state again, however, that I am not very confident in that field, so I could be wrong in what I say.

For the first point, I am rather confident that viruses have high levels of FI in their small genomes, and proteins. They are not extremely complex, but still the genes and proteins, IMO, are certainly designed.

So, are viruses designed? Probably. My only doubt is that I don’t understand well what are the current theories about the origin of viruses. My impression is that there is still great uncertainty about that issue. I would be happy to hear what you think. In a sense, viruses could be derived from bacteria or other organisms. Their FI could originate elsewhere. But again, I have not dealt in depth with those issues, and I am ready to accept any ideas or suggestions.

Again, I have no problem with the idea that viruses may be designed. If they are, they are.

So, my support of my first statement is very simple. I maintain that empirically there is no known example of non biological objects exhibiting more than 500 bits of FI that are not designed human artifacts, I invite everyone, including you, to present one counter-example and defend it.

I am also ready to discuss your biological example of cancer. That requires, of course, a separate discussion in a later comment.

For viruses, please explain better what is your point. The information in their genes and proteins is of course complex, and designed. Their adaptations, instead, as far as I can understand, do not generate any complex FI.

sfmatheson · August 27, 2019, 1:21pm

You are in a conversation with knowledgeable experts on this topic, including computational biologists (@swamidass and @glipsnort) with depth of expertise among the world’s best. I am not in their league, and yet I can see at a glance what your metric is doing. You are using BLAST to estimate sequence conservation in a few lineages. That’s all you’re doing. I know you are also calculating some kind of probability, but that probability is completely meaningless without very important additional information. This is what my questions are getting at, and I am not seeing any meaningful answers.

I might be missing something about the metric, but I doubt it. I know I’m not missing anything about the probability calculations, because those can’t be meaningful by themselves. (That’s old, old hat in “design” debates.)

The point about neutral evolution is about applying a method to a negative control. Every method of every kind in science has to do this in order to claim efficacy or even meaning. When you read phylogenetics papers, you should notice this. (Outgroups are frequently serving as negative controls.) When you read any paper about any analytical method in any good journal, you will see negative controls, and in many cases you will see that analysis of the negative control is a substantial undertaking that accounts for much of the work. In the case of “FI,” the scientists here would be negligent if they didn’t immediately ask you and themselves about negative controls. A method is advanced to detect X. It should register something close to zero when X is not present. That’s what I’m asking you, and without a substantive answer I can’t even consider whether the method is interesting.

And that’s not even addressing the phylogenetic problem.

These questions are actually really basic. They are not new, and the answers are not simple, and some of the experts here (not me) devote significant portions of their professional effort to asking questions quite like this. Without good answers, you can only say that you used BLAST to “measure” sequence conservation in a few hand-picked evolutionary trajectories. And that, my friend, is not informative. My opinion is that it isn’t even a good start.

If I have misunderstood the metric, please let me know. Otherwise, I will leave you to discuss ways to improve your analysis with two of the world’s most skilled computational biologists.

Topic		Replies	Views
Gpuccio on Common Descent Conversation Science	1	751	August 26, 2019
Miller: Axe Decisively Confirmed? Conversation Science , Design	31	4568	February 23, 2019
Gauger and Mercer: Bifunctional Proteins and Protein Sequence Space Office Hours Design	188	7478	November 15, 2018
Mercer's Work on Protein Function and Sequence Space Office Hours Design	5	810	June 19, 2021
Simulating 500 million years of evolution with a language model Conversation Science , Artificial-Intelligence	9	183	February 2, 2025

Gpuccio: Functional Information Methodology

Related topics