Jonathan Bartlett: Measuring Active Information in Biological Systems

This is meant to be a thread on my paper, Measuring Active Information in Biological Systems.

Sorry I have not been participating in the initial conversation, but the conversation seemed like it was already generating more heat than light when I looked in on it. Additionally, I have spent the last month in a combination of finalizing my work at New Medio / ITX (where I have worked for almost 20 years) to start a new position on Monday, finalizing a new book on electronics, helping various organizations get connected digitally during this time, and it is the end of the school year for our homeschool co-op, so my time has been limited.

Anyway, I have been answering questions about the paper at The Skeptical Zone, but thought I would also stop by here to answer questions. I would like this thread to be dedicated to the content of the paper (and obviously related items). If there are questions that are outside of this, I would like those to be handled in a different thread. Note that the paper does not specifically mention Intelligent Design, or the paperā€™s relevance to Intelligent Design. While I think the paper does have relevance to Intelligent Design, I would like to separate that from the discussion of the paper itself. So, if people want to discuss that, we can open up a thread for that. However, I think the paper stands alone as a useful contribution, whether or not it can be linked to ID.

So, let me start with an overview of the paper and the background that it comes from. I have been arguing for a long time that mutations are indeed not random with respect to the needs of the organism. You can see a short, short video I did on this a long time ago, or a more in-depth one here, or an old BIO-Complexity paper about the subject here.

The problem I found was that, even though many of the facts of directed mutations were widely known. See for instance here for a conversation with Larry Moran which is pretty typical of the conversations I have. Despite agreeing that mutations are only occurring in the gene that needs to be modified, they donā€™t view this as being ā€œdirectedā€. Essentially, everyone has a preconceived notion in their mind about what ā€œdirectionā€ should look like. If the mode of directionalization doesnā€™t match their particular pre-conceived notion, they simply state that the mutation is not directed.

Additionally, I found that there was no quantitative evidence that anyone could point to for mutations being random. This was simply being stated in the literature without justification. It might be true, but, lacking a mechanism for quantification, it would be impossible to know. On that note, I should point out that it is theoretically possible to agree with my paper almost wholeheartedly and not believe in directed mutations. That is, I could perhaps have come up with a correct way of quantifying it, but, when we actually apply that quantification to nature itself, it always comes up zero. This would mean both that my paper is correct, and that random mutations are the norm (note - I do have some examples of positive active information in my paper, but, in theory, these could wind up being total anomalies to the norm). Additionally, as mentioned earlier, one could also agree with my paper wholeheartedly, think that directed mutations occur, and not think it has anything at all to do with Intelligent Design.

Therefore, my goal with using active information was to find a way to quantify ā€œdirectednessā€ which was independent of the actual mechanism used for directing. What active information does is that it measures what the effects of an actual randomized process will do, and it compares it to what is actually happening. So, we can compare and see if the process that is actually happening in biology is better than, equal to, or worse than random with respect to the fitness of the organism. Obviously there will be chemical bias in the mutation system. But, if that bias reliably points more in the direction of fitness than a non-biased process, then that is an effect that requires explanation. If the specific base pairs that are more likely to be mutated are more likely to be beneficial than other base pairs chosen at random, then that is an effect that requires explanation.

The way I actually envision active information being used, however, is the opposite. My hope is that people will use active information in order to tell when a mechanism for mutation should be searched for. It takes a lot of time, effort, and money to search for a biological mechanism for mutation (Zhang and Saier have done a lot of work on this, and, if you read their papers, it takes a lot of experiments to track this stuff down). Therefore, it would be good if, going in to the process, you knew there was something to find. My hope is that active information will be used like a metal detector. It is a simple way to find out whether or not there is something happening worth investigating. That way, time, effort, and funds can be directed to elucidating the systems most likely to be biologically interesting.

I also fully expect that, if people wind up agreeing with active information, they will eventually find it so obvious as to not be worth speaking about. I actually think thatā€™s largely true, except for the fact that, as of this moment, people do have a mental block when thinking about directed mutations. My hope is that active information will remove this mental block. If that happens generally, then my guess is that shortly afterward, everyone will wonder what the hubbub was originally about, and why my paper was needed to begin with, since the concept was so obvious.

Additionally, my paper also notes that actually performing these experiments leaves a lot of questions, because there will always be intermixing between induced random mutations, and the mutations that the organism is already doing (whether they are random or not). The paper offers a statistical mechanism for separating these out in certain circumstances.

So, if you can take the time to read the paper, please do so, but I think that is enough to start a conversation. I donā€™t have time to just sit on the thread all day, but I will try to answer questions at least by the end of each day (though probably not Monday). If you would like a side topic addressed (i.e., something not relevant to this thread), just make a new topic and link to it here so it is findable.

4 Likes

Thanks for the comments @johnnyb. Iā€™m moving this to a scholars thread.

We invite scientists to engage with you, and @moderators will be enforcing tighter rules on this thread. Explain disagreements but refrain from ad hominems. Be respectful. It is privilege to have @johnnyb here to discuss his work.

We ask readers to assist moderators by flagging any inappropriate posts.

@Joe_Felsenstein, @glipsnort, @sygarte, @Tom_English, and @dga471 maybe important contributors to discuss with if they choose to participate.

2 Likes

I have trouble with the most basic statements in that paper. I find your writing confusing, which I hope is not intentional. The answers to several questions might help. Could you define ā€œactive informationā€? Why have you chosen that as the term? Is this information in any technical sense? Do you think that transition/transversion bias demonstrates active information? How do you choose what distribution to consider ā€œrandomā€? Does changing the rate of mutations while keeping the same distribution constitute active information?

Do you agree that much (perhaps 90%) of the human genome is junk? If so, would the mutation spectrum in junk DNA be a good one to choose as a random distribution?

6 Likes

@johnnyb, Iā€™ve read the paper, and I also remember talking to you in person about these ideas.

As I understand it, you see any deviation from a uniform distribution as evidence of active information. Is that right?

Soā€¦

I recall specifically asking @johnnyb this, and he would say I think that it is in fact active information, because it is a deviation from uniform sampling.

I believe he always compares to a uniform distribution.

I think this accurate. Because in the case of transitions and transitions and transversions are examples of active information (deviations from uniformity) that I think @johnnyb agrees we have a mechanism, so this deviation is not caused by direction.

So, the thing is, we already do exactly this. However, we dontā€™ use deviations from uniformity, but deviations from more complex models. The reason why is that using uniformity as the default model would trigger the detector every time. What we really want is patterns not well captured by our current knowledge, so we use statistical models that capture our current knowledge. This way we avoid constantly having our detector triggered by the transisition/transversion bias, for example.

One example of this in the literature is how the mechanism of recombination associated mutational clusters was discovered. We found a pattern of mutation clusters. We found a molecular mechanism for it later. If there is interest, I can provide references and explain the story. Of note right now, however, is that they didnā€™t use deviations from a uniform distribution, but deviations from a more complex model that better captured our current understanding. Still the basic approach is the same.

@johnnyb, did I accurately engage your views here?

2 Likes

One clarifying question here. What do you mean by ā€œrandomā€?

If you mean uniformly distributed, I agree. There is an immense amount of evidence mutations do not follow a uniform distribution.

If you mean ontologically random, I agree. That is a metaphysical question beyond scienceā€™s purview.

That isnā€™t what scientists mean by random though. So Iā€™m a bit confused here.

2 Likes

Hi there Jonathan. Itā€™s great that you are open to discussing the paper.

I will start with a few problems with the paper, offered as constructive criticism, then end with what I think is good and interesting.

First, the sparseness of citation of the relevant (and active) literature is problematic. It becomes a truly big concern when considered alongside the pattern of works you do cite. I will leave it at that, but it has to be pointed out that the paper could not be considered at any reputable scientific journal, in its current form, with a references list like that one.

Second, the framing of the questions about mutations is too simplistic. Below I will argue that this probably doesnā€™t matter when evaluating the utility of your approach, but it hurts the paper a lot. This concern is related to the first one, and I think the two concerns might have the same root, which is a failure to engage with and understand the literature on mutation patterns.

Those are the big big problems with the paper, offered from the viewpoint of an editor who makes decisions on when to send papers for review and how to gauge the reviews when they come in. Scant or highly selective/peculiar citation of previous work can sink a paper. It happens regularly.

But again from an editorā€™s perspective, there is the central question about this or any paper, which is whether it reports an advance worth discussing. Your paper is more of an approach, so instead of advance, we will consider utility. In short: does the paper describe a useful approach or tool? Will other scientists use it?

Here, I actually liked reading your paper, because I thought you did a pretty good job of explaining what the approach is meant to do and how and why it could fail. I liked how you explain that if your approach works, it can point to questions and potential explanations beyond ā€œsimple random mutation.ā€ Now, as I hope others have noted elsewhere, the initial hypothesis of ā€œrandom mutationā€ is much too simplistic, and so there is a sense that the whole paper is about a strawman. But I am willing to put that aside and focus instead on what the approach is trying to do, which is to find and perhaps quantify examples of genomic/genetic signatures that suggest a process that moved directionally. I think thatā€™s a fair characterization of your approach, adding that you nicely describe how this could involve resources intrinsic to the evolving organisms. But if you think Iā€™m missing something, Iā€™ll be eager to explore that with you.

So the one thing that saves the paper is the approach itself, which is scientific and is described adequately. There are two reasons it is probably trivial (by this I mean that itā€™s not new or particularly useful). The first is that it is likely too susceptible to GIGO, because the inputs into the approach are (as you note) poorly understood, wildly variable, controlled by myriad other variables, etc. ANY model of ANYTHING, to be a solid contribution to the literature, has to show that it does something other than quantify all the things we know we donā€™t know. The second is that it is simply not clear when or how the method, if it even works generally, could tell us something new. The example of SHM, very nicely used in your paper, actually says to me that you are providing a number to describe something we already know that we donā€™t know: how does SHM target specific genomic regions and/or genes and/or segments of those genes? We published a paper just a few months ago on this topic, and it seems to me that you have both illustrated why your question is interesting and relevant, and illustrated why putting a number on ā€œactive informationā€ just isnā€™t useful.

To sum up, I think you are right when you say above that this might be ā€œso obvious as to not be worth speaking about,ā€ and that those who work in these areas would simply add: ā€œand thatā€™s now.ā€

6 Likes

Even more relevant, maybe, given the context of adaptation and evolution here, is the ongoing struggle to refine methods for detecting positive selection. I wonder if one could run a quick find-and-replace that substitutes ā€œactive informationā€ for ā€œpositive selection,ā€ in a paper on the topic, and get something that is roughly coherent. At least logically/linguistically speaking.

4 Likes

Thatā€™s a problem if heā€™s conflating uniform with random. One must ask what if anything a measure of the departure from a uniform distribution means and whether the uniform distribution can possibly make any biological sense. And this leads into the question of just what ā€œactive informationā€ is really supposed to mean; I donā€™t refer to the definition but to the interpretation.

As an addendum to my original questions, Iā€™ll add this: if we accept that some DNA is nonfunctional, i.e. junk, would we not expect that mutations in junk would be ā€œrandomā€ by the standard meaning, i.e. with respect to fitness? And wouldnā€™t the distribution of mutations in junk make a for better comparison than starting with a uniform distribution?

8 Likes

I did not choose the term. The term was originally developed by William Dembski and Robert Marks in their analysis of evolutionary algorithms. Active information is the amount of information an algorithm has about the space it is searching.

Quoting from the paper:

The point of this is that it would have to be tested to know. The point of doing this is to establish a way of measuring it. It certainly would be interesting to find out.

Any sufficiently blindly chosen set of mutations should be workable, but I think the optimal way to do so would be a binomially distributed set of mutations at the per-base pair mutation rate of the organism.

It depends on what you mean. If the per-organism mutation rate increased, but the actual mutations per organism remained the same, that would not be considered active information on this calculation (we are basically only considering organisms that do have mutations). If you mean that the per-base-pair mutation rate changed, then whether or not it was active information would have to be determined experimentally.

No, but I donā€™t think it is relevant to the measurement. Nothing in the proposed experiment assumes a given amount of junk DNA.

Again, the point of this is to minimize assumptions - to find a measurement which is largely independent of what we think about it. If you knew unquestionably that junk DNA was junk, and you also knew unquestionably that the mutations here were haphazard, then sure. However, I do not think we know these things. Even many things which were supposedly unequivocally junk DNA have come into question. For instance, there are organisms for which pseudogenes play a large functional role, sometimes simply as repositories of alternate configurations of genes.

1 Like

Note: This is what I mean by a uniform distribution. So it seems like I did not misread you.

The thing is that we know that mutations donā€™t follow this pattern at all, e.g. with transitions/transversions ratio. So what other measurements do we need?

3 Likes

I donā€™t understand the emphasis on the conditional in your response. What exactly is preventing or inhibiting you or anyone else from measuring it?

2 Likes

Not quite. It is true that any deviation from from the successfulness of a uniform distribution would be evidence of either positive or negative active information. That is, it could deviate in a negative way, and be worse than a uniform distribution. The uniform distribution is simply the measuring stick - it is the ā€œaverageā€ strategy that we are measuring against, and allows us to say that what occurs in biology is ā€œbetter than averageā€ or ā€œworse than averageā€.

So I think this is where you are misunderstanding me. Only being significantly more successful than uniformity triggers the detector. Being equally successful or less successful does not. Uniformity simply establishes an ā€œaverageā€ baseline.

I am familiar with many cases where people have found mechanisms of mutation, and cited several. I even know of several where the case was made that the mutation was directed towards a purpose. The problem is that there is not currently a standardized way of asking the question of whether or not a mutational mechanism is directed towards a purpose. Again, Iā€™ll point you to the conversation I had with Larry Moran where he agreed 100% with the biology, but refused to say that the mechanism was aligned towards function. This provides a way of answering that question.

You were incorrect, but it certainly seemed to be a good faith error.

I think I answered this above, but let me reiterate. The purpose of the random distribution is as a measuring stick, not because I think biologists currently think this. They think that mutations are random with respect to fitness (again, see my conversation with Moran above). This is a claim that can be tested, and the point of my paper is to make a method/criteria of testing that claim.

1 Like

Okay this makes more sense.

You are saying it is not enough to show that transitions/transversions deviate from uniform, we also have to show that this is positive, or beneficial, leading to more successful mutations than would take place with a uniform distribution.

In the case of transitions/transversions, actually, we know that this is the case. Transitions are more common, and are more likely be beneficial than transversions. That means that the transition/transversion imbalance skews mutations to beneficial ones, much more than we would expect from a uniform distribution.

That makes them ā€œactiveā€ in your terminology, right?

Well, as I think Iā€™m showing here, transitions/transversions would be more successful than uniformity, so they would triger the detector. Is that a good or bad thing from your point of view? I donā€™t know. From my point of view thatā€™s a bad thing.

Yes, I know what you are talking about, and if that is your point I agree with your intended meaning (though I do cringe at the use of ā€œrandomā€ in this way). Strictly speaking, mutations are not independent of fitness in important ways (but they are still technically random), and are often biased towards fitness. One such example is the transition/transversion imbalance, which skews mutations towards beneficial (or at least less harmful) changes to proteins.

I donā€™t think that Larry Moran would disagree with my intended meaning here. If I am understanding @johnnyb correctly this has to do with a real wart or imprecision in how biologists often talk about ā€œrandom.ā€

1 Like

Possibly. The paper, however, is largely a mathematics paper with some examples thrown in for context. Mathematics papers usually have a much shorter works cited list than, say, review papers. For instance, in the current Journal of Theoretical Biology, the paper ā€œPredicting protein-peptide binding sites with a Deep Convolutional Neural Networkā€ has a reference list only a little longer than mine, and many of them are to things like URLs of programs (they have a citations for general stats textbooks, Github repositories, generally-available software like PyTorch, etc.). So, Iā€™m not sure Iā€™m actually out of line for the type of paper Iā€™ve written.

However, itā€™s entirely possible that I missed literature which would be directly relevant to the question (quantitatively measuring the directedness of mutations). If you are familiar with other literature which addresses this topic which Iā€™ve overlooked and would have materially impacted my paper, please send me links so I can further check them out. It might be worthwhile for future research.

Again, I would love a few references to get me started on this.

As Iā€™ve pointed out to others. The purpose of ā€œrandom mutationā€ is to establish a baseline. Thatā€™s what most people are missing. Random mutation gives us an expected average value for directedness. This gives us a way of measuring surprise prior to knowing mechanism.

Yes, and not just directionally, but a specific type of direction - towards fitness.

You have dealt with an example that we know about already, SMH, but what about those that we donā€™t? I showed in the paper (ā€œRelative Active Informationā€) that the work of Hofwegen actually demonstrates that there is active information within E. coli for generating Cit+ mutations when under selection. Was that known? Most of the people I talked to (and, from the paper, I think Hofwegen et al themselves) believed that Cit+ mutations were random with respect to fitness. Therefore, the fact that my measurement shows active information would mean that there is probably a mechanism there worth investigating.

I wasnā€™t talking about the number of citations. I am concerned about both the quality and quantity of citations about the biological topic you are discussing. The math is simple and uninteresting by itself. The key points of the paper are about mutation patterns.

To write a convincing paper about mutation patterns, you need more than papers that ā€œquantitatively measure directedness of mutations.ā€ Iā€™ll be glad to list papers of clear relevance here but I will not further pursue my very significant concern about your citations, which is this: both the number and the focus of the references is a problem.

I think you are mistaken here. I know I am not ā€œmissingā€ anything. I and others in this conversation are saying, correctly, that the framing of your question and your approach is so disconnected from what we know that it hurts the argument. I think it renders the work, sadly, a strawman argument. Jonathan, we understand your argument all too well.

What you showed is that the mutation pattern is different from some expectation. (I canā€™t use vague phrases like ā€œactive informationā€ in good scientific conscience.) At best, you have a measure that suggests that those mutations evolve in interesting ways that could involve ā€œintrinsicā€ biological factors (physiology, metabolism, etc). Perhaps more likely, you have a measure that suggests the involvement of more complex genetic processesā€“epistasis, clonal interference, hitchhiking, and so onā€“that must be accounted for in order to explain mutation trajectories in that experiment.

6 Likes

I think that it is difficult to have a scientific discussion without agreement on what is known about the very examples you choseā€“the hard, indisputable facts.

Hereā€™s an example.

In your paper, you wrote:
ā€œAs mentioned previously, the actual mutations are limited to a single half of a single gene where mutations are likely to be beneficial.ā€

Yet thereā€™s no citation of the literature to support that claim. Do you have any evidence that supports your claim that actual mutations are limited to a single half of a single gene? Thereā€™s an awful lot of emphasis in that claim.

On UD, you wrote:
ā€œNow, first of all, you should notice that the mutations only happen in the correct gene ā€“ the antibody gene.ā€

Where should I notice that ā€œonly,ā€ exactly? What am I missing when I notice instead the many papers that describe many mutations in ~275 non-antibody genes and their often deadly consequences?

Do you think that this may be an example of @sfmathesonā€™s point regarding the sparseness of your citation of relevant literature?

4 Likes

He already linked to a paper with more than a few.

Hereā€™s another:

If youā€™re pressed for time, I would suggest the section with the heading ā€œAID targets are recurrently mutated in human lymphomasā€ as most worthy of your attention.

2 Likes

Yes itā€™s true that the immune system does not perfectly target mutations to the right part of the genome, and so it can essentially backfire, causing cancer by making mutations in the wrong place.

Thatā€™s the exception to the rule though. It is undoubtedly also true that the immune related mutation are largely targeted in a place which increases chances of a beneficial mutation,

I did not have Jonathanā€™s characterization of SHM in mind when I warned about the citation pattern in the paper. I had in mind the extensive and currently active literature on mutation patterns in general.

This could be an opportunity to open a side conversation about SHM, which is SUPER INCREDIBLY INTERESTING and is indeed an example of ā€œdirected mutation.ā€ @Mercer is right that SHM is not the surgical laser that some comments seem to suggest. But @johnnyb is right to use it as an example of ā€œdirected mutation.ā€ How it is ā€œaimedā€ at Ig loci is fairly well understood, but open questions remain, and readers can explore the latest questions and answers in the paper I liked to above. SHM should not be described as ā€œlimited to a single half of a single geneā€ without important equivocation: add the word ā€œlargelyā€ and I think weā€™re all set.

7 Likes

Then Iā€™m not sure how you use your method here to compute it. We have no algorithm, and we have no search, and it isnā€™t clear that we even have a space. What is your operational definition of active information in DNA sequences?

Not sure what that means. What is it that supposed to be binomially distributed? The number of mutations in the sequence? What would be the mean? Why?

I canā€™t make any sense of that. It seems self-contradictory. I do in fact mean that the per-base-pair mutation rate changed, which means the per-organism rate would also change in the same way. How would a change in mutation rate change the active information, even theoretically?

Thatā€™s quite problematic. You canā€™t study a system if you have a seriously wrong idea of how it works.

Iā€™m not sure why you think the assumption of a uniform distribution is not an assumption.

I know of no organisms in which this is true. What are you talking about? There are a few cases in which some pseudogenes have evolved new functions, but it certainly isnā€™t true for the vast majority of pseudogenes. And I donā€™t know what these ā€œrepositoriesā€ are even supposed to be.

Are you sure? I thought you were referring to a uniform distribution of point mutation types. That doesnā€™t seem to be at all what JohnnyB is talking about.

3 Likes