Jonathan Bartlett: Measuring Active Information in Biological Systems

This is meant to be a thread on my paper, Measuring Active Information in Biological Systems.

Sorry I have not been participating in the initial conversation, but the conversation seemed like it was already generating more heat than light when I looked in on it. Additionally, I have spent the last month in a combination of finalizing my work at New Medio / ITX (where I have worked for almost 20 years) to start a new position on Monday, finalizing a new book on electronics, helping various organizations get connected digitally during this time, and it is the end of the school year for our homeschool co-op, so my time has been limited.

Anyway, I have been answering questions about the paper at The Skeptical Zone, but thought I would also stop by here to answer questions. I would like this thread to be dedicated to the content of the paper (and obviously related items). If there are questions that are outside of this, I would like those to be handled in a different thread. Note that the paper does not specifically mention Intelligent Design, or the paper’s relevance to Intelligent Design. While I think the paper does have relevance to Intelligent Design, I would like to separate that from the discussion of the paper itself. So, if people want to discuss that, we can open up a thread for that. However, I think the paper stands alone as a useful contribution, whether or not it can be linked to ID.

So, let me start with an overview of the paper and the background that it comes from. I have been arguing for a long time that mutations are indeed not random with respect to the needs of the organism. You can see a short, short video I did on this a long time ago, or a more in-depth one here, or an old BIO-Complexity paper about the subject here.

The problem I found was that, even though many of the facts of directed mutations were widely known. See for instance here for a conversation with Larry Moran which is pretty typical of the conversations I have. Despite agreeing that mutations are only occurring in the gene that needs to be modified, they don’t view this as being “directed”. Essentially, everyone has a preconceived notion in their mind about what “direction” should look like. If the mode of directionalization doesn’t match their particular pre-conceived notion, they simply state that the mutation is not directed.

Additionally, I found that there was no quantitative evidence that anyone could point to for mutations being random. This was simply being stated in the literature without justification. It might be true, but, lacking a mechanism for quantification, it would be impossible to know. On that note, I should point out that it is theoretically possible to agree with my paper almost wholeheartedly and not believe in directed mutations. That is, I could perhaps have come up with a correct way of quantifying it, but, when we actually apply that quantification to nature itself, it always comes up zero. This would mean both that my paper is correct, and that random mutations are the norm (note - I do have some examples of positive active information in my paper, but, in theory, these could wind up being total anomalies to the norm). Additionally, as mentioned earlier, one could also agree with my paper wholeheartedly, think that directed mutations occur, and not think it has anything at all to do with Intelligent Design.

Therefore, my goal with using active information was to find a way to quantify “directedness” which was independent of the actual mechanism used for directing. What active information does is that it measures what the effects of an actual randomized process will do, and it compares it to what is actually happening. So, we can compare and see if the process that is actually happening in biology is better than, equal to, or worse than random with respect to the fitness of the organism. Obviously there will be chemical bias in the mutation system. But, if that bias reliably points more in the direction of fitness than a non-biased process, then that is an effect that requires explanation. If the specific base pairs that are more likely to be mutated are more likely to be beneficial than other base pairs chosen at random, then that is an effect that requires explanation.

The way I actually envision active information being used, however, is the opposite. My hope is that people will use active information in order to tell when a mechanism for mutation should be searched for. It takes a lot of time, effort, and money to search for a biological mechanism for mutation (Zhang and Saier have done a lot of work on this, and, if you read their papers, it takes a lot of experiments to track this stuff down). Therefore, it would be good if, going in to the process, you knew there was something to find. My hope is that active information will be used like a metal detector. It is a simple way to find out whether or not there is something happening worth investigating. That way, time, effort, and funds can be directed to elucidating the systems most likely to be biologically interesting.

I also fully expect that, if people wind up agreeing with active information, they will eventually find it so obvious as to not be worth speaking about. I actually think that’s largely true, except for the fact that, as of this moment, people do have a mental block when thinking about directed mutations. My hope is that active information will remove this mental block. If that happens generally, then my guess is that shortly afterward, everyone will wonder what the hubbub was originally about, and why my paper was needed to begin with, since the concept was so obvious.

Additionally, my paper also notes that actually performing these experiments leaves a lot of questions, because there will always be intermixing between induced random mutations, and the mutations that the organism is already doing (whether they are random or not). The paper offers a statistical mechanism for separating these out in certain circumstances.

So, if you can take the time to read the paper, please do so, but I think that is enough to start a conversation. I don’t have time to just sit on the thread all day, but I will try to answer questions at least by the end of each day (though probably not Monday). If you would like a side topic addressed (i.e., something not relevant to this thread), just make a new topic and link to it here so it is findable.


Thanks for the comments @johnnyb. I’m moving this to a scholars thread.

We invite scientists to engage with you, and @moderators will be enforcing tighter rules on this thread. Explain disagreements but refrain from ad hominems. Be respectful. It is privilege to have @johnnyb here to discuss his work.

We ask readers to assist moderators by flagging any inappropriate posts.

@Joe_Felsenstein, @glipsnort, @sygarte, @Tom_English, and @dga471 maybe important contributors to discuss with if they choose to participate.


I have trouble with the most basic statements in that paper. I find your writing confusing, which I hope is not intentional. The answers to several questions might help. Could you define “active information”? Why have you chosen that as the term? Is this information in any technical sense? Do you think that transition/transversion bias demonstrates active information? How do you choose what distribution to consider “random”? Does changing the rate of mutations while keeping the same distribution constitute active information?

Do you agree that much (perhaps 90%) of the human genome is junk? If so, would the mutation spectrum in junk DNA be a good one to choose as a random distribution?


@johnnyb, I’ve read the paper, and I also remember talking to you in person about these ideas.

As I understand it, you see any deviation from a uniform distribution as evidence of active information. Is that right?


I recall specifically asking @johnnyb this, and he would say I think that it is in fact active information, because it is a deviation from uniform sampling.

I believe he always compares to a uniform distribution.

I think this accurate. Because in the case of transitions and transitions and transversions are examples of active information (deviations from uniformity) that I think @johnnyb agrees we have a mechanism, so this deviation is not caused by direction.

So, the thing is, we already do exactly this. However, we dont’ use deviations from uniformity, but deviations from more complex models. The reason why is that using uniformity as the default model would trigger the detector every time. What we really want is patterns not well captured by our current knowledge, so we use statistical models that capture our current knowledge. This way we avoid constantly having our detector triggered by the transisition/transversion bias, for example.

One example of this in the literature is how the mechanism of recombination associated mutational clusters was discovered. We found a pattern of mutation clusters. We found a molecular mechanism for it later. If there is interest, I can provide references and explain the story. Of note right now, however, is that they didn’t use deviations from a uniform distribution, but deviations from a more complex model that better captured our current understanding. Still the basic approach is the same.

@johnnyb, did I accurately engage your views here?


One clarifying question here. What do you mean by “random”?

If you mean uniformly distributed, I agree. There is an immense amount of evidence mutations do not follow a uniform distribution.

If you mean ontologically random, I agree. That is a metaphysical question beyond science’s purview.

That isn’t what scientists mean by random though. So I’m a bit confused here.


Hi there Jonathan. It’s great that you are open to discussing the paper.

I will start with a few problems with the paper, offered as constructive criticism, then end with what I think is good and interesting.

First, the sparseness of citation of the relevant (and active) literature is problematic. It becomes a truly big concern when considered alongside the pattern of works you do cite. I will leave it at that, but it has to be pointed out that the paper could not be considered at any reputable scientific journal, in its current form, with a references list like that one.

Second, the framing of the questions about mutations is too simplistic. Below I will argue that this probably doesn’t matter when evaluating the utility of your approach, but it hurts the paper a lot. This concern is related to the first one, and I think the two concerns might have the same root, which is a failure to engage with and understand the literature on mutation patterns.

Those are the big big problems with the paper, offered from the viewpoint of an editor who makes decisions on when to send papers for review and how to gauge the reviews when they come in. Scant or highly selective/peculiar citation of previous work can sink a paper. It happens regularly.

But again from an editor’s perspective, there is the central question about this or any paper, which is whether it reports an advance worth discussing. Your paper is more of an approach, so instead of advance, we will consider utility. In short: does the paper describe a useful approach or tool? Will other scientists use it?

Here, I actually liked reading your paper, because I thought you did a pretty good job of explaining what the approach is meant to do and how and why it could fail. I liked how you explain that if your approach works, it can point to questions and potential explanations beyond “simple random mutation.” Now, as I hope others have noted elsewhere, the initial hypothesis of “random mutation” is much too simplistic, and so there is a sense that the whole paper is about a strawman. But I am willing to put that aside and focus instead on what the approach is trying to do, which is to find and perhaps quantify examples of genomic/genetic signatures that suggest a process that moved directionally. I think that’s a fair characterization of your approach, adding that you nicely describe how this could involve resources intrinsic to the evolving organisms. But if you think I’m missing something, I’ll be eager to explore that with you.

So the one thing that saves the paper is the approach itself, which is scientific and is described adequately. There are two reasons it is probably trivial (by this I mean that it’s not new or particularly useful). The first is that it is likely too susceptible to GIGO, because the inputs into the approach are (as you note) poorly understood, wildly variable, controlled by myriad other variables, etc. ANY model of ANYTHING, to be a solid contribution to the literature, has to show that it does something other than quantify all the things we know we don’t know. The second is that it is simply not clear when or how the method, if it even works generally, could tell us something new. The example of SHM, very nicely used in your paper, actually says to me that you are providing a number to describe something we already know that we don’t know: how does SHM target specific genomic regions and/or genes and/or segments of those genes? We published a paper just a few months ago on this topic, and it seems to me that you have both illustrated why your question is interesting and relevant, and illustrated why putting a number on “active information” just isn’t useful.

To sum up, I think you are right when you say above that this might be “so obvious as to not be worth speaking about,” and that those who work in these areas would simply add: “and that’s now.”


Even more relevant, maybe, given the context of adaptation and evolution here, is the ongoing struggle to refine methods for detecting positive selection. I wonder if one could run a quick find-and-replace that substitutes “active information” for “positive selection,” in a paper on the topic, and get something that is roughly coherent. At least logically/linguistically speaking.


That’s a problem if he’s conflating uniform with random. One must ask what if anything a measure of the departure from a uniform distribution means and whether the uniform distribution can possibly make any biological sense. And this leads into the question of just what “active information” is really supposed to mean; I don’t refer to the definition but to the interpretation.

As an addendum to my original questions, I’ll add this: if we accept that some DNA is nonfunctional, i.e. junk, would we not expect that mutations in junk would be “random” by the standard meaning, i.e. with respect to fitness? And wouldn’t the distribution of mutations in junk make a for better comparison than starting with a uniform distribution?


I did not choose the term. The term was originally developed by William Dembski and Robert Marks in their analysis of evolutionary algorithms. Active information is the amount of information an algorithm has about the space it is searching.

Quoting from the paper:

The point of this is that it would have to be tested to know. The point of doing this is to establish a way of measuring it. It certainly would be interesting to find out.

Any sufficiently blindly chosen set of mutations should be workable, but I think the optimal way to do so would be a binomially distributed set of mutations at the per-base pair mutation rate of the organism.

It depends on what you mean. If the per-organism mutation rate increased, but the actual mutations per organism remained the same, that would not be considered active information on this calculation (we are basically only considering organisms that do have mutations). If you mean that the per-base-pair mutation rate changed, then whether or not it was active information would have to be determined experimentally.

No, but I don’t think it is relevant to the measurement. Nothing in the proposed experiment assumes a given amount of junk DNA.

Again, the point of this is to minimize assumptions - to find a measurement which is largely independent of what we think about it. If you knew unquestionably that junk DNA was junk, and you also knew unquestionably that the mutations here were haphazard, then sure. However, I do not think we know these things. Even many things which were supposedly unequivocally junk DNA have come into question. For instance, there are organisms for which pseudogenes play a large functional role, sometimes simply as repositories of alternate configurations of genes.

1 Like

Note: This is what I mean by a uniform distribution. So it seems like I did not misread you.

The thing is that we know that mutations don’t follow this pattern at all, e.g. with transitions/transversions ratio. So what other measurements do we need?


I don’t understand the emphasis on the conditional in your response. What exactly is preventing or inhibiting you or anyone else from measuring it?


Not quite. It is true that any deviation from from the successfulness of a uniform distribution would be evidence of either positive or negative active information. That is, it could deviate in a negative way, and be worse than a uniform distribution. The uniform distribution is simply the measuring stick - it is the “average” strategy that we are measuring against, and allows us to say that what occurs in biology is “better than average” or “worse than average”.

So I think this is where you are misunderstanding me. Only being significantly more successful than uniformity triggers the detector. Being equally successful or less successful does not. Uniformity simply establishes an “average” baseline.

I am familiar with many cases where people have found mechanisms of mutation, and cited several. I even know of several where the case was made that the mutation was directed towards a purpose. The problem is that there is not currently a standardized way of asking the question of whether or not a mutational mechanism is directed towards a purpose. Again, I’ll point you to the conversation I had with Larry Moran where he agreed 100% with the biology, but refused to say that the mechanism was aligned towards function. This provides a way of answering that question.

You were incorrect, but it certainly seemed to be a good faith error.

I think I answered this above, but let me reiterate. The purpose of the random distribution is as a measuring stick, not because I think biologists currently think this. They think that mutations are random with respect to fitness (again, see my conversation with Moran above). This is a claim that can be tested, and the point of my paper is to make a method/criteria of testing that claim.

1 Like

Okay this makes more sense.

You are saying it is not enough to show that transitions/transversions deviate from uniform, we also have to show that this is positive, or beneficial, leading to more successful mutations than would take place with a uniform distribution.

In the case of transitions/transversions, actually, we know that this is the case. Transitions are more common, and are more likely be beneficial than transversions. That means that the transition/transversion imbalance skews mutations to beneficial ones, much more than we would expect from a uniform distribution.

That makes them “active” in your terminology, right?

Well, as I think I’m showing here, transitions/transversions would be more successful than uniformity, so they would triger the detector. Is that a good or bad thing from your point of view? I don’t know. From my point of view that’s a bad thing.

Yes, I know what you are talking about, and if that is your point I agree with your intended meaning (though I do cringe at the use of “random” in this way). Strictly speaking, mutations are not independent of fitness in important ways (but they are still technically random), and are often biased towards fitness. One such example is the transition/transversion imbalance, which skews mutations towards beneficial (or at least less harmful) changes to proteins.

I don’t think that Larry Moran would disagree with my intended meaning here. If I am understanding @johnnyb correctly this has to do with a real wart or imprecision in how biologists often talk about “random.”

1 Like

Possibly. The paper, however, is largely a mathematics paper with some examples thrown in for context. Mathematics papers usually have a much shorter works cited list than, say, review papers. For instance, in the current Journal of Theoretical Biology, the paper “Predicting protein-peptide binding sites with a Deep Convolutional Neural Network” has a reference list only a little longer than mine, and many of them are to things like URLs of programs (they have a citations for general stats textbooks, Github repositories, generally-available software like PyTorch, etc.). So, I’m not sure I’m actually out of line for the type of paper I’ve written.

However, it’s entirely possible that I missed literature which would be directly relevant to the question (quantitatively measuring the directedness of mutations). If you are familiar with other literature which addresses this topic which I’ve overlooked and would have materially impacted my paper, please send me links so I can further check them out. It might be worthwhile for future research.

Again, I would love a few references to get me started on this.

As I’ve pointed out to others. The purpose of “random mutation” is to establish a baseline. That’s what most people are missing. Random mutation gives us an expected average value for directedness. This gives us a way of measuring surprise prior to knowing mechanism.

Yes, and not just directionally, but a specific type of direction - towards fitness.

You have dealt with an example that we know about already, SMH, but what about those that we don’t? I showed in the paper (“Relative Active Information”) that the work of Hofwegen actually demonstrates that there is active information within E. coli for generating Cit+ mutations when under selection. Was that known? Most of the people I talked to (and, from the paper, I think Hofwegen et al themselves) believed that Cit+ mutations were random with respect to fitness. Therefore, the fact that my measurement shows active information would mean that there is probably a mechanism there worth investigating.

I wasn’t talking about the number of citations. I am concerned about both the quality and quantity of citations about the biological topic you are discussing. The math is simple and uninteresting by itself. The key points of the paper are about mutation patterns.

To write a convincing paper about mutation patterns, you need more than papers that “quantitatively measure directedness of mutations.” I’ll be glad to list papers of clear relevance here but I will not further pursue my very significant concern about your citations, which is this: both the number and the focus of the references is a problem.

I think you are mistaken here. I know I am not “missing” anything. I and others in this conversation are saying, correctly, that the framing of your question and your approach is so disconnected from what we know that it hurts the argument. I think it renders the work, sadly, a strawman argument. Jonathan, we understand your argument all too well.

What you showed is that the mutation pattern is different from some expectation. (I can’t use vague phrases like “active information” in good scientific conscience.) At best, you have a measure that suggests that those mutations evolve in interesting ways that could involve “intrinsic” biological factors (physiology, metabolism, etc). Perhaps more likely, you have a measure that suggests the involvement of more complex genetic processes–epistasis, clonal interference, hitchhiking, and so on–that must be accounted for in order to explain mutation trajectories in that experiment.


I think that it is difficult to have a scientific discussion without agreement on what is known about the very examples you chose–the hard, indisputable facts.

Here’s an example.

In your paper, you wrote:
“As mentioned previously, the actual mutations are limited to a single half of a single gene where mutations are likely to be beneficial.”

Yet there’s no citation of the literature to support that claim. Do you have any evidence that supports your claim that actual mutations are limited to a single half of a single gene? There’s an awful lot of emphasis in that claim.

On UD, you wrote:
“Now, first of all, you should notice that the mutations only happen in the correct gene – the antibody gene.”

Where should I notice that “only,” exactly? What am I missing when I notice instead the many papers that describe many mutations in ~275 non-antibody genes and their often deadly consequences?

Do you think that this may be an example of @sfmatheson’s point regarding the sparseness of your citation of relevant literature?


He already linked to a paper with more than a few.

Here’s another:

If you’re pressed for time, I would suggest the section with the heading “AID targets are recurrently mutated in human lymphomas” as most worthy of your attention.


Yes it’s true that the immune system does not perfectly target mutations to the right part of the genome, and so it can essentially backfire, causing cancer by making mutations in the wrong place.

That’s the exception to the rule though. It is undoubtedly also true that the immune related mutation are largely targeted in a place which increases chances of a beneficial mutation,

I did not have Jonathan’s characterization of SHM in mind when I warned about the citation pattern in the paper. I had in mind the extensive and currently active literature on mutation patterns in general.

This could be an opportunity to open a side conversation about SHM, which is SUPER INCREDIBLY INTERESTING and is indeed an example of “directed mutation.” @Mercer is right that SHM is not the surgical laser that some comments seem to suggest. But @johnnyb is right to use it as an example of “directed mutation.” How it is “aimed” at Ig loci is fairly well understood, but open questions remain, and readers can explore the latest questions and answers in the paper I liked to above. SHM should not be described as “limited to a single half of a single gene” without important equivocation: add the word “largely” and I think we’re all set.


Then I’m not sure how you use your method here to compute it. We have no algorithm, and we have no search, and it isn’t clear that we even have a space. What is your operational definition of active information in DNA sequences?

Not sure what that means. What is it that supposed to be binomially distributed? The number of mutations in the sequence? What would be the mean? Why?

I can’t make any sense of that. It seems self-contradictory. I do in fact mean that the per-base-pair mutation rate changed, which means the per-organism rate would also change in the same way. How would a change in mutation rate change the active information, even theoretically?

That’s quite problematic. You can’t study a system if you have a seriously wrong idea of how it works.

I’m not sure why you think the assumption of a uniform distribution is not an assumption.

I know of no organisms in which this is true. What are you talking about? There are a few cases in which some pseudogenes have evolved new functions, but it certainly isn’t true for the vast majority of pseudogenes. And I don’t know what these “repositories” are even supposed to be.

Are you sure? I thought you were referring to a uniform distribution of point mutation types. That doesn’t seem to be at all what JohnnyB is talking about.