Jonathan Bartlett: Measuring Active Information in Biological Systems

I can’t understand how anyone could describe the most fearsome oncogene we know, V12Ras, as resulting from a “loss of information.” Or for that matter, how it works to describe gains of function (horribly common in cancer) as “loss of information.” The claims are not credible.

2 Likes

More to the point though @sfmatheson,

The focus of this thread should be @johnnyb’s paper. We can start a new thread to discuss his other ideas, and it would serve us best if he refrained (in this thread) from arguing for ideas not in his paper.

3 Likes

That’s very interesting. Thanks. I do hope that you aren’t trying to use that one gene in one organism as evidence that all pseudogenes are functional, though.

4 Likes

Thank you for the response, and I want to give the paper another read before I say much more. But first, we have a miscommunication to clear up.

You have essentially set up a likelihood ratio test of heterogeneity versus the null of uniform probability of mutations. The null expectation simply cannot occur in nature, and no one is going to be surprised when you discover that mutations are not uniform (we already know that). I’m suggesting using a null that more accurately reflects what is already known. There are a multitude of alternate hypotheses for how the “AI” can differ from this more realistic null.

ETA: The statistical approach would be to test against a null of the expectation given function (current function). I’ll have to think about how that could be defined - random drift perhaps?.

It’s still unclear what you mean by “random with respect to function”. What function is that? I’ll re-read and follow-up.

But you want to test if mutations are random. We already know selection is not random. Selection is going bea non-random bias in any data you can get, or it may confound AI measurement entirely. I agree the single generation measurement will be the cleanest in this respect.

2 Likes

6 posts were split to a new topic: Comments on Bartlett

I would suggest reading the current empirical literature on the subject.

Your citations of the literature are far too old and only consider the products of SMH after selection.

They wrote that 18 or 19 years ago, man. They weren’t looking at the actual mutations before selection. They couldn’t. Now we can, and you are ignoring those data.

That’s why the ages, in addition to the tiny numbers, of the papers you cite give you away.

That’s simply not true.

The mutations occur in selective places, including hundreds of genes where they are not needed. We know that from sequencing before selection, which wasn’t done in 2002. People have since microdissected B-cells from germinal centers, before selection, and done single-cell genome sequencing.

Have you forgotten that you asked for quantitative papers and that I supplied one, free full text available, from 2018?

It appears that you either didn’t read it or didn’t understand it.

Let’s look at Figure 1, panel A:

Just to be as generous and basic as possible, I’ll point out that the numbers and bars around the circular graph represent chromosomes and the red bars represent mutations.

So, Jonathan, what is the relative frequency of Ig variable region mutations relative to other sites in the genome before selection, according to the data graphed above?

I should add that this was predicted given the characteristics of mutations in B-cell lymphomas, but we only knew that because lymphomas are also the products of positive selection. Therefore, the interpretation that followed your grudging admission that lymphomas also occur was not merely sneaking in teleology, but completely wrong.

You simply can’t look at sequences after selection and credibly claim that you are looking at sequences before selection.

4 Likes

Your primary example misrepresents the products of mutation plus selection as the products of mutation alone.

1 Like

We can do that? :astonished:

@johnnyb It seems that it IS possible to observe data that is not conditional on selection. I will need to revise my comments on this. @Mercer Thanks for the info!

3 Likes

Yes. One sections a mouse spleen or human lymph node. The germinal centers are discernible. Then with a dissecting microscope (I suspect that young grad students can do it with the naked eye), one samples cells from the centers.

This paper has some figures that may explain it better:

Before and after microdissection:
https://www.nature.com/articles/2402073/figures/2

This slide show is more computer science-centric and may ring more bells for you and @johnnyb:

All the links above are ONLY for illustration of the technique used in this paper linked below, which has the data:

As shown in the cartoon below, @johnnyb is citing papers from long ago, when we could only look at cell populations (shown in pink on the right) after selection. Today, people can follow and have followed individual clones in germinal centers–in vivo.
https://www.semanticscholar.org/paper/Dynamics-of-B-cells-in-germinal-centres-Silva-Klein/dd9bd7d5159f3009b7e0f3cf10ad8786e499b960/figure/2

Another version is here, which has the cells we could see in 2002 properly outside the follicle at the top, and multiple rounds of selection that Jonathan is missing:
https://onlinelibrary.wiley.com/action/downloadFigures?id=imr12396-fig-0003&doi=10.1111%2Fimr.12396

4 Likes

Is your point that most mutations fall outside the intended target?

A post was merged into an existing topic: Side comments on Bartlett: Measuring Active Information

@Mercer -

Your point is well-taken. It is true that I haven’t been heavy in the research of this lately. To give a bit of history, this paper that I’m presenting is actually itself almost ten years old. I wrote the paper shortly after doing a poster presentation on the topic. The road to publication has been long. When I first wrote it, I tried to publish it at an ordinary journal, and had mixed reviews (one very positive, one very negative). I wasn’t very familiar with the publication process at the time, so I wasn’t really sure where to go or what to do or who to send it to. Anyway, I first sent it to BIO-Complexity in 2012, where it was rejected because at the time they were only doing experimental papers. A few years later, they had opened it up to more types of papers, and I submitted it again, and it was rejected because I had some mathematics errors. A few years later, I went back through and did a detailed cleanup of the math and got a mathematician friend, Asatur Khurshudyan (who had previously coauthored a mathematics paper with me on changes to the second derivative) to help me out (he’s mentioned in the acknowledgements). I submitted this, and the mathematician had some initial pushback, but finally got it through.

All that to say, there are indeed older references, but that is due to the publication history. Would it be great if I had the time to keep up with everything? Sure, but I also have a day job, a teaching job, and two secondary writing jobs, so it is true that I don’t always have time to keep up with everything. However, a cursory review of the paper you cited seemed to indicate that it was a AID site prediction study, not an experimental study (actually it looked like a combined study, but it was hard to tell from a brief review how much was predicted vs. demonstrated). The problem with prediction techniques is that it assumes that the cell doesn’t have compensation machinery as well (in fact, it also assumes that SMH is the only valid usage of AID). But nonetheless, I will certainly grant that you have a broader knowledge of the SMH literature than I do.

Anyway, if you are correct, then that would not argue against my main idea (that you can measure active information and this is a good way to measure it), but only the application (that I have correctly applied the idea that I stated). If there winds up being less active information in SMH than I think there is because I measured it too casually, so be it. It’s not the main idea, just a way to understand how simplifications of the concept can work.

Another thing to notice, though, is that “selection” is being used in an equivocal way here. If the organism targets a cell for destruction because it doesn’t meet the standards, that is not “selection” in the Darwinian sense, but more similar to targeted mutation. If the cell simply falls apart because it is broken, then that is indeed selection in the Darwinian sense. But, if the organism is terminating the cell because it detects that the cell is operating outside its boundaries, that’s not Darwinian selection. Is it active information? Probably not, because the organism did indeed “try” at that point. Anyway, it is an interesting discussion, and I’m happy to say that you know more about the details than I do.

Again, my goal, as stated in the paper and in this thread, is to supply a mechanism for testing the question of whether or not organisms contain information about likely targets of evolution. I am not invested in any particular outcome for any particular process, especially as the present paper is concerned. I’m only concerned with whether the testing mechanism, if wielded by the appropriate person, would perform as indicated. Matheson, despite thinking the paper isn’t worthwhile, did in fact think that the testing mechanism would in fact perform as indicated. I don’t really care if Matheson likes me or the paper, or thinks that me or the paper is worth the time of day. I am interested in the fact that he thinks that the paper does what it says it will do.

I will say, however, that, in my (admittedly limited) reading, I have several times found that lack of targeting has occurred due to identifiable mutations which make off-target sites more likely to occur. This is more well-documented with the RAG enzymes during V(D)J recombination, but I have found papers on it. I imagine we will find more, but, as Matheson points out, my suppositions aren’t science.

@Dan_Eastwood -

I think you are still thinking of the paper in the same way that Swamidass started out thinking. You say,

This is a complete misunderstanding of what I am doing. I am not testing whether or not mutations follow a uniform probability. I am testing whether or not mutations, as they occur in nature, are more successful or less successful (or even) than they would be if they did follow a uniform probability. This is the important point. Let’s say there is a strong bias of mutations. But, let’s say that this bias generally doesn’t favor the organism’s results. This would be negative active information, because it would be pointing away from success. The reason why a uniform distribution of mutations is the right thing to test against is because we have no reason for thinking that mutations should or should not be biased towards function. Using the amount of function that a uniform distribution gives tells us where the “zero point” is - the expected value of an arbitrary mutation. Mutational biases may be more helpful or more harmful than arbitrary mutations. That’s what active information seeks to find out.

When someone says that “mutation is random with respect to function”, they are saying (or at least most people are understanding) that the outcome of the mutation is no more or less functional than any other arbitrary transformation on the genome. Therefore, this measures this question and assigns a value to it. Note that the measurement is useful (as mentioned previously) even if this was not the original question intended by the statement “mutation is random with respect to function”. But, in addition to its uses in detecting possible, previously-unknown mutational mechanisms, it also serves the function of forcing people to be more quantitative about statements like this :slight_smile:

Anyway, this has been fun, and I appreciate everyone’s thoughtful contributions. Unless anyone has specific questions for me, I’m happy to leave y’all with the last word.

1 Like

To be clear, I have not stopped thinking this is what you are doing.

Yes, that is what I thought from the beginning. Our point is that this doesn’t make much sense.

I’ve already granted you that mutations are not independent of function. They are skewed to functional mutations that are beneficial in important ways. That is true, but we don’t know that from your work and analysis.

They are still random, in that we cannot fully predict the mutations we will see. That’s why the whole “random with respect to” is a very poor way to put this.

Thanks for joining the conversation. I do have some specific questions.

Do you understand why the focused issues I just raised here are a deal breaker for us? Do you still think your argument is valid?

2 Likes

A post was merged into an existing topic: Comments on Jonathan Bartlett’s Office Hours

You haven’t actually said specifically what the problem is, except when misstating what I’m doing. And, Matheson, who seemed to understand what I was doing, wound up agreeing that the procedure did what it said it should do. So, you shouldn’t use “us” as if this represented all present. It just isn’t the case.

That’s literally true of everything, including throwing a baseball to a target. Nobody would say that the baseball player threw randomly with respect to the target.

Except that people do, and they mean something by it. If you don’t, great! It actually seems you are agreeing with me, but disagreeing about other what other people mean. Let’s say that no one ever talked about if mutations are random with respect to function. That doesn’t mean one can’t recognize a bias towards function, and recognize that this indicates that there must be a reason for a significant regularized bias towards function.

But, in truth, I bet that you couldn’t get Larry Moran to say that mutations are biased in a way that leads towards function. It sounds to me like he means every word he says about it, and many others, too. When he and I discussed it, the only thing he would admit to is bias, but specifically not bias towards function.

More so now than before. Every argument against my thesis seems to think that I’m presuming that mutations should be randomly distributed, or that biologists are saying that mutations should be randomly distributed. That is 100% not the case, but the fact that the common link between the people who disagree with me is that they misunderstand my arguments seems to indicate that there is nothing wrong with my thesis, though perhaps it does say that I am poor at communicating it.

Perhaps for a different way of communicating it, let me walk through it in a backwards fashion, and see if there is more clarity.

Let’s presume, counter-factually, that mutations did have a uniform random distribution. In such a case, would it be correct to say that they are “random” with respect to fitness? I think so. In fact, in such a case, it would be correct to say that they are random with respect to just about anything, fitness included. If this wouldn’t count as “random with respect to fitness”, I don’t know what would.

Therefore, knowing what the fitness effects of “random with respect to fitness” looks like. We can then measure whether or not particular biases are biased towards function, away from function, or biased in a way that does not deviate from randomness towards function. These are all things which are performable, measurable actions that have definite meanings based on the terms being used. We can, even, measure whether or not the specific biases that actually occur in organisms are equivalent to being random towards fitness or whether they are biased one way or the other.

This line of reasoning only relies on two questions - (1) if mutations were randomly distributed, would the be random with respect to fitness? I believe the answer would be yes. (2) Can we measure the fitness of actual mutations? I believe the answer here would be yes. Since we can answer (1) and (2), we can also compare the measurements. If (2) has greater fitness than (1), then in what sense is “random with respect to fitness” true?

2 is greater than 1. I’ve already agreed that “random with respect to fitness” is a horrible way to explain this and it is not true in a mathematically precise way. It is true that this is an incorrect claim, but for a different reason than you’ve laid out.

If your point is “mutations are not random with respect to fitness,” well that is as flawed as saying “mutations are random with respect to fitness.” They are random (not totally predictable) in both cases.

Actually, mathematical modelers would model this as a random variable, which is not independent of the target. It would be “random event even though it is dependent on the target.” The whole “random with respect to” is not a sensible way of describing this.

If your point is that “mutations are not independent of fitness,” yes that is correct. I think that is what you mean. Once again, we know this from other observations than the argument you’ve made here.

1 Like

Oh dear, that’s potentially misleading though I don’t think you intended it that way. What I wrote is:

I then identified the best-case scenario for what that would mean. I do NOT think that your math is likely to generate insights, because “some expectation” refers to what is rather clearly a strawman.

This is what your paper reports: Some pretty basic math that generates an untested metric that is claimed to represent a difference between a real-world dataset and a benchmark. We don’t know the utility of the metric, nor do we know whether its apparent quantitative nature is related to anything in the real world. What we do know is that the benchmark is a strawman and that the data fed into the process render the process unacceptably prone to GIGO. Given my assessment above and summarized here, I think it’s potentially confusing to readers to repeatedly cite me as having somehow affirmed that the paper “does what it says it will do.”

Well, I’m a humanist, so I think you are worth a lot more than the time of day. I’ve never met you so I have no basis for liking you or not. As for whether you care about my opinion of the paper, that seems inconsistent with your whole post here. Don’t you think?

6 Likes

Yes, after re-reading I see I did have a misunderstanding, but it really doesn’t change that much. First though, a few comments that I offer as a reviewer to help you improve the method, but do not change the overall discussion:

  1. The binomial distribution is OK as you are using it here, but Poisson and negative-binomial models could also be used, I think.
  2. Equations 23 is a probability, and equation 25 is a conditional probability. Given that you have parallel variables for everything else, why not make #23 a conditional probability too?
  3. Equations 21 and 22 are rates of mutant organisms in the population, and not per-organism mutation rates. This will lead to underestimating the numbers of mutations in the population. What you want is simply M_s * G. This is fixable, but will change your derivation of equation #36. ETA: I might have this backwards, and you are estimating based on assumed M_s?
  4. Equation #34 need not be zero. There could be some mysterious external force inserting non-random mutations. :wink:

OK, but this still has the form of a likelihood ratio test comparing two mutation rates.

Now the troublesome part - it is not stated if the function should be specified before the test or after observing the data/function. The latter suffers from the sharpshooter fallacy, and seems entirely dependent on post-hoc choice of function. The GENERAL METHOD section is written in a way that could be performed as a planned experiment, IF some function can be decided on ahead of time. I should leave that question to others, but as @sfmatheson has noted, AI doesn’t seem to test anything interesting.

If I choose to measure AI as “mutations which preserve the current function (drift)”, then I could have positive information for identical function. In the post-hoc sense, and for some mutation to new function, I could define AI to be positive or negative at my whim (maybe I don’t want E.Coli that eats citrate?).

You have pieces of a useful method here. I know this because I recognize the bits of statistical theory you have rediscovered. You have a good start towards design of experiments too, and that is the key to answering useful questions in a testable way. You also seem to understand the need to separate multiple sources of variation as in “internal and external” sources.

I don’t want to be overly critical - you really have come a long way - but there are still a few pieces missing. The connection to what you want to test is not complete yet, or at least you aren’t expressing it in a way that I can understand.

3 Likes

Dan - thanks for the comments. I will think more about how these things can be expressed and more directly connected. You are probably right that what seems clear in my mind is not necessarily fully communicated (and therefore not fully defended). Just to be sure I’ve answered your points:

#2 - I believe you are correct. I’ll look at it more to be sure, and if so I’ll see if I can get a correction issued.
#3 - It does means what you suggest. Perhaps my terminology was muddy. The O_ variables refer to the rates of mutant organisms. By “per organism” I meant “The rate of incidence per organism”.
#4 - cute :slight_smile:

As to the positive/negative issue, I am going based on biological measures of fitness, not just “outcomes I want to see”. So, it is not “can you do X”, it is more of “can you survive”. Survival/fitness is a biological function that is not chosen by the experimenter. The experimenter chooses the survival task, but only measures the organism’s success.

SFMatheson - my reasons for posting here are several. First of all, Josh and I have some history together, and I wanted to be sure to share important parts of what I was doing here as well as other places (I also shared at UD and TSZ as well). Second, if there were significant problems, I certainly want to know what they are. Asking seems like a good enough way to find out. I don’t need anyone to like me better, or think my paper is significant, or interesting, or anything like that. I need to know if there is something significant that I’ve missed. What I have learned is that there haven’t been disagreements with the basic logic of the paper that I have found convincing. There has been some disagreement with my treatment of SMH, and I acknowledge that this is certainly something I should dig into more recent studies of. However, it doesn’t impact my main thesis (i.e., it is a question of whether or not I got the inputs to the process right, not whether or not the process itself works).

“I then identified the best-case scenario for what that would mean”

And what is that best-case scenario?

And what did I say I was trying to do?

I don’t really see a difference between what you said and what I said, except to maybe add “at best” to the beginning of it. Nonetheless, if you are retracting that, no worries, I’ll accept that.

Josh -

That’s good, but you should know that many people (including many, many biologists) do believe that this description represents a precise way to describe mutations.

The point is that my method provides a quantifiable way to describe this, and to test this in specific cases. Certainly, there are other ways to know this qualitatively, when there are very strong examples. The nice thing about the quantitative approach is that we can tell, in places where we are not familiar with the mechanism or where the effect isn’t as pronounced, that there is something worth investigating, as I mentioned with the example from Hofwegen et al.

So, I think I will conclude my participation in this thread, as I don’t think there is any remaining progress to be made. I believe that looking at truly random mutations provides a valid expected value to compare against, and others disagree. I think finding the point of departure of the different viewpoints is sometimes the most useful result of interactions such as this.

Thanks all for your engagement!

2 Likes

Immunologists who study VDJ recombination, such as Dr David Schatz, at Yale, are interested in finding biological explanations for why RAG recombinases target particular sites during SHM. Here is one paper describing such biological explanations for apparent non-random mutation.

Biology is complex. We could imagine various biologic explanations. Perhaps how different sites of chromatin are spatially related to one another in the nucleus. A good friend of mine tested that hypothesis, and found it to be incorrect:

However, there could be other biological explanations, such as more open areas of chromatin, suggested in this paper where they observed mutation of newly integrated DNA

or location of Immunoglobulin enhancer elements:

or location of “recombination signal sequences”

In addition, it is important to note that B cells undergoing SHM are under high selective pressure in the germinal center reactions, forcing survival of cells in which productive mutations have occurred that enhance antigen binding affinity. So only the surviving B cells are quantitated, and those are the B cells that had productive, as opposed to detrimental mutations.

6 Likes