Jonathan Bartlett: Measuring Active Information in Biological Systems

swamidass · April 22, 2020, 2:45am

A post was merged into an existing topic: Comments on Jonathan Bartlett’s Office Hours

johnnyb · April 22, 2020, 3:03am

You haven’t actually said specifically what the problem is, except when misstating what I’m doing. And, Matheson, who seemed to understand what I was doing, wound up agreeing that the procedure did what it said it should do. So, you shouldn’t use “us” as if this represented all present. It just isn’t the case.

That’s literally true of everything, including throwing a baseball to a target. Nobody would say that the baseball player threw randomly with respect to the target.

Except that people do, and they mean something by it. If you don’t, great! It actually seems you are agreeing with me, but disagreeing about other what other people mean. Let’s say that no one ever talked about if mutations are random with respect to function. That doesn’t mean one can’t recognize a bias towards function, and recognize that this indicates that there must be a reason for a significant regularized bias towards function.

But, in truth, I bet that you couldn’t get Larry Moran to say that mutations are biased in a way that leads towards function. It sounds to me like he means every word he says about it, and many others, too. When he and I discussed it, the only thing he would admit to is bias, but specifically not bias towards function.

More so now than before. Every argument against my thesis seems to think that I’m presuming that mutations should be randomly distributed, or that biologists are saying that mutations should be randomly distributed. That is 100% not the case, but the fact that the common link between the people who disagree with me is that they misunderstand my arguments seems to indicate that there is nothing wrong with my thesis, though perhaps it does say that I am poor at communicating it.

Perhaps for a different way of communicating it, let me walk through it in a backwards fashion, and see if there is more clarity.

Let’s presume, counter-factually, that mutations did have a uniform random distribution. In such a case, would it be correct to say that they are “random” with respect to fitness? I think so. In fact, in such a case, it would be correct to say that they are random with respect to just about anything, fitness included. If this wouldn’t count as “random with respect to fitness”, I don’t know what would.

Therefore, knowing what the fitness effects of “random with respect to fitness” looks like. We can then measure whether or not particular biases are biased towards function, away from function, or biased in a way that does not deviate from randomness towards function. These are all things which are performable, measurable actions that have definite meanings based on the terms being used. We can, even, measure whether or not the specific biases that actually occur in organisms are equivalent to being random towards fitness or whether they are biased one way or the other.

This line of reasoning only relies on two questions - (1) if mutations were randomly distributed, would the be random with respect to fitness? I believe the answer would be yes. (2) Can we measure the fitness of actual mutations? I believe the answer here would be yes. Since we can answer (1) and (2), we can also compare the measurements. If (2) has greater fitness than (1), then in what sense is “random with respect to fitness” true?

swamidass · April 22, 2020, 3:07am

2 is greater than 1. I’ve already agreed that “random with respect to fitness” is a horrible way to explain this and it is not true in a mathematically precise way. It is true that this is an incorrect claim, but for a different reason than you’ve laid out.

If your point is “mutations are not random with respect to fitness,” well that is as flawed as saying “mutations are random with respect to fitness.” They are random (not totally predictable) in both cases.

Actually, mathematical modelers would model this as a random variable, which is not independent of the target. It would be “random event even though it is dependent on the target.” The whole “random with respect to” is not a sensible way of describing this.

If your point is that “mutations are not independent of fitness,” yes that is correct. I think that is what you mean. Once again, we know this from other observations than the argument you’ve made here.

sfmatheson · April 22, 2020, 4:01pm

Oh dear, that’s potentially misleading though I don’t think you intended it that way. What I wrote is:

I then identified the best-case scenario for what that would mean. I do NOT think that your math is likely to generate insights, because “some expectation” refers to what is rather clearly a strawman.

This is what your paper reports: Some pretty basic math that generates an untested metric that is claimed to represent a difference between a real-world dataset and a benchmark. We don’t know the utility of the metric, nor do we know whether its apparent quantitative nature is related to anything in the real world. What we do know is that the benchmark is a strawman and that the data fed into the process render the process unacceptably prone to GIGO. Given my assessment above and summarized here, I think it’s potentially confusing to readers to repeatedly cite me as having somehow affirmed that the paper “does what it says it will do.”

Well, I’m a humanist, so I think you are worth a lot more than the time of day. I’ve never met you so I have no basis for liking you or not. As for whether you care about my opinion of the paper, that seems inconsistent with your whole post here. Don’t you think?

Dan_Eastwood · April 22, 2020, 10:19pm

Yes, after re-reading I see I did have a misunderstanding, but it really doesn’t change that much. First though, a few comments that I offer as a reviewer to help you improve the method, but do not change the overall discussion:

The binomial distribution is OK as you are using it here, but Poisson and negative-binomial models could also be used, I think.
Equations 23 is a probability, and equation 25 is a conditional probability. Given that you have parallel variables for everything else, why not make #23 a conditional probability too?
Equations 21 and 22 are rates of mutant organisms in the population, and not per-organism mutation rates. This will lead to underestimating the numbers of mutations in the population. What you want is simply M_s * G. This is fixable, but will change your derivation of equation #36. ETA: I might have this backwards, and you are estimating based on assumed M_s?
Equation #34 need not be zero. There could be some mysterious external force inserting non-random mutations.

OK, but this still has the form of a likelihood ratio test comparing two mutation rates.

Now the troublesome part - it is not stated if the function should be specified before the test or after observing the data/function. The latter suffers from the sharpshooter fallacy, and seems entirely dependent on post-hoc choice of function. The GENERAL METHOD section is written in a way that could be performed as a planned experiment, IF some function can be decided on ahead of time. I should leave that question to others, but as @sfmatheson has noted, AI doesn’t seem to test anything interesting.

If I choose to measure AI as “mutations which preserve the current function (drift)”, then I could have positive information for identical function. In the post-hoc sense, and for some mutation to new function, I could define AI to be positive or negative at my whim (maybe I don’t want E.Coli that eats citrate?).

You have pieces of a useful method here. I know this because I recognize the bits of statistical theory you have rediscovered. You have a good start towards design of experiments too, and that is the key to answering useful questions in a testable way. You also seem to understand the need to separate multiple sources of variation as in “internal and external” sources.

I don’t want to be overly critical - you really have come a long way - but there are still a few pieces missing. The connection to what you want to test is not complete yet, or at least you aren’t expressing it in a way that I can understand.

johnnyb · April 24, 2020, 1:28am

Dan - thanks for the comments. I will think more about how these things can be expressed and more directly connected. You are probably right that what seems clear in my mind is not necessarily fully communicated (and therefore not fully defended). Just to be sure I’ve answered your points:

#2 - I believe you are correct. I’ll look at it more to be sure, and if so I’ll see if I can get a correction issued.
#3 - It does means what you suggest. Perhaps my terminology was muddy. The O_ variables refer to the rates of mutant organisms. By “per organism” I meant “The rate of incidence per organism”.
#4 - cute

As to the positive/negative issue, I am going based on biological measures of fitness, not just “outcomes I want to see”. So, it is not “can you do X”, it is more of “can you survive”. Survival/fitness is a biological function that is not chosen by the experimenter. The experimenter chooses the survival task, but only measures the organism’s success.

SFMatheson - my reasons for posting here are several. First of all, Josh and I have some history together, and I wanted to be sure to share important parts of what I was doing here as well as other places (I also shared at UD and TSZ as well). Second, if there were significant problems, I certainly want to know what they are. Asking seems like a good enough way to find out. I don’t need anyone to like me better, or think my paper is significant, or interesting, or anything like that. I need to know if there is something significant that I’ve missed. What I have learned is that there haven’t been disagreements with the basic logic of the paper that I have found convincing. There has been some disagreement with my treatment of SMH, and I acknowledge that this is certainly something I should dig into more recent studies of. However, it doesn’t impact my main thesis (i.e., it is a question of whether or not I got the inputs to the process right, not whether or not the process itself works).

“I then identified the best-case scenario for what that would mean”

And what is that best-case scenario?

And what did I say I was trying to do?

I don’t really see a difference between what you said and what I said, except to maybe add “at best” to the beginning of it. Nonetheless, if you are retracting that, no worries, I’ll accept that.

Josh -

That’s good, but you should know that many people (including many, many biologists) do believe that this description represents a precise way to describe mutations.

The point is that my method provides a quantifiable way to describe this, and to test this in specific cases. Certainly, there are other ways to know this qualitatively, when there are very strong examples. The nice thing about the quantitative approach is that we can tell, in places where we are not familiar with the mechanism or where the effect isn’t as pronounced, that there is something worth investigating, as I mentioned with the example from Hofwegen et al.

So, I think I will conclude my participation in this thread, as I don’t think there is any remaining progress to be made. I believe that looking at truly random mutations provides a valid expected value to compare against, and others disagree. I think finding the point of departure of the different viewpoints is sometimes the most useful result of interactions such as this.

Thanks all for your engagement!

Michelle · April 26, 2020, 12:53pm

Immunologists who study VDJ recombination, such as Dr David Schatz, at Yale, are interested in finding biological explanations for why RAG recombinases target particular sites during SHM. Here is one paper describing such biological explanations for apparent non-random mutation.

ncbi.nlm.nih.gov

RAG Represents a Widespread Threat to the Lymphocyte Genome.

G Teng, Y Maman, W Resch, M Kim, A Yamane, J Qian, KR Kieffer-Kwon, M Mandal, Y Ji, E Meffre, MR Clark, LG Cowell, R Casellas and DG Schatz, Cell, Aug 2015 13

The RAG1 endonuclease, together with its cofactor RAG2, is essential for V(D)J recombination but is a potent threat to genome stability. The sources of RAG1 mis-targeting and the mechanisms that have evolved to suppress it are poorly understood. Here, we report that RAG1 associates with chromatin at thousands of active promoters and enhancers in the genome of developing lymphocytes. The mouse and human genomes appear to have responded by reducing the abundance of "cryptic" recombination signals near RAG1 binding sites. This depletion operates specifically on the RSS heptamer, whereas nonamers are enriched at RAG1 binding sites. Reversing this RAG-driven depletion of cleavage sites by insertion of strong recombination signals creates an ectopic hub of RAG-mediated V(D)J recombination and chromosomal translocations. Our findings delineate rules governing RAG binding in the genome, identify areas at risk of RAG-mediated damage, and highlight the evolutionary struggle to accommodate programmed DNA damage in developing lymphocytes.

Biology is complex. We could imagine various biologic explanations. Perhaps how different sites of chromatin are spatially related to one another in the nucleus. A good friend of mine tested that hypothesis, and found it to be incorrect:

ncbi.nlm.nih.gov

AID-targeting and hypermutation of non-immunoglobulin genes does not correlate with proximity to immunoglobulin genes in germinal center B cells.

HS Gramlich, T Reisbig and DG Schatz, PloS one, 2012

Upon activation, B cells divide, form a germinal center, and express the activation induced deaminase (AID), an enzyme that triggers somatic hypermutation of the variable regions of immunoglobulin (Ig) loci. Recent evidence indicates that at least 25% of expressed genes in germinal center B cells are mutated or deaminated by AID. One of the most deaminated genes, c-Myc, frequently appears as a translocation partner with the Ig heavy chain gene (Igh) in mouse plasmacytomas and human Burkitt's lymphomas. This indicates that the two genes or their double-strand break ends come into close proximity at a biologically relevant frequency. However, the proximity of c-Myc and Igh has never been measured in germinal center B cells, where many such translocations are thought to occur. We hypothesized that in germinal center B cells, not only is c-Myc near Igh, but other mutating non-Ig genes are deaminated by AID because they are near Ig genes, the primary targets of AID. We tested this "collateral damage" model using 3D-fluorescence in situ hybridization (3D-FISH) to measure the distance from non-Ig genes to Ig genes in germinal center B cells. We also made mice transgenic for human MYC and measured expression and mutation of the transgenes. We found that there is no correlation between proximity to Ig genes and levels of AID targeting or gene mutation, and that c-Myc was not closer to Igh than were other non-Ig genes. In addition, the human MYC transgenes did not accumulate mutations and were not deaminated by AID. We conclude that proximity to Ig loci is unlikely to be a major determinant of AID targeting or mutation of non-Ig genes, and that the MYC transgenes are either missing important regulatory elements that allow mutation or are unable to mutate because their new nuclear position is not conducive to AID deamination.

However, there could be other biological explanations, such as more open areas of chromatin, suggested in this paper where they observed mutation of newly integrated DNA

ncbi.nlm.nih.gov

Activation-induced cytidine deaminase-mediated sequence diversification is transiently targeted to newly integrated DNA substrates.

SY Yang, SD Fugmann, HS Gramlich and DG Schatz, The Journal of biological chemistry, Aug 31 2007

The molecular features that allow activation-induced cytidine deaminase (AID) to target Ig and certain non-Ig genes are not understood, although transcription has been implicated as one important parameter. We explored this issue by testing the mutability of a non-Ig transcription cassette in Ig and non-Ig loci of the chicken B cell line DT40. The cassette did not act as a stable long term mutation target but was able to be mutated in an AID-dependent manner for a limited time post-integration. This indicates that newly integrated DNA has molecular characteristics that render it susceptible to modification by AID, with implications for how targeting and mis-targeting of AID occurs.

or location of Immunoglobulin enhancer elements:

ncbi.nlm.nih.gov

Targeting of somatic hypermutation by immunoglobulin enhancer and enhancer-like sequences.

JM Buerstedde, J Alinikula, H Arakawa, JJ McDonald and DG Schatz, PLoS biology, Apr 2014

Somatic hypermutation (SH) generates point mutations within rearranged immunoglobulin (Ig) genes of activated B cells, providing genetic diversity for the affinity maturation of antibodies. SH requires the activation-induced cytidine deaminase (AID) protein and transcription of the mutation target sequence, but how the Ig gene specificity of mutations is achieved has remained elusive. We show here using a sensitive and carefully controlled assay that the Ig enhancers strongly activate SH in neighboring genes even though their stimulation of transcription is negligible. Mutations in certain E-box, NFκB, MEF2, or Ets family binding sites--known to be important for the transcriptional role of Ig enhancers--impair or abolish the activity. Full activation of SH typically requires a combination of multiple Ig enhancer and enhancer-like elements. The mechanism is evolutionarily conserved, as mammalian Ig lambda and Ig heavy chain intron enhancers efficiently stimulate hypermutation in chicken cells. Our results demonstrate a novel regulatory function for Ig enhancers, indicating that they either recruit AID or alter the accessibility of the nearby transcription units.

or location of “recombination signal sequences”

ncbi.nlm.nih.gov

The in vivo pattern of binding of RAG1 and RAG2 to antigen receptor loci.

Y Ji, W Resch, E Corbett, A Yamane, R Casellas and DG Schatz, Cell, Apr 30 2010

The critical initial step in V(D)J recombination, binding of RAG1 and RAG2 to recombination signal sequences flanking antigen receptor V, D, and J gene segments, has not previously been characterized in vivo. Here, we demonstrate that RAG protein binding occurs in a highly focal manner to a small region of active chromatin encompassing Ig kappa and Tcr alpha J gene segments and Igh and Tcr beta J and J-proximal D gene segments. Formation of these small RAG-bound regions, which we refer to as recombination centers, occurs in a developmental stage- and lineage-specific manner. Each RAG protein is independently capable of specific binding within recombination centers. While RAG1 binding was detected only at regions containing recombination signal sequences, RAG2 binds at thousands of sites in the genome containing histone 3 trimethylated at lysine 4. We propose that recombination centers coordinate V(D)J recombination by providing discrete sites within which gene segments are captured for recombination.

In addition, it is important to note that B cells undergoing SHM are under high selective pressure in the germinal center reactions, forcing survival of cells in which productive mutations have occurred that enhance antigen binding affinity. So only the surviving B cells are quantitated, and those are the B cells that had productive, as opposed to detrimental mutations.

Topic		Replies	Views
I need help with language to use in regard to evolution Conversation	45	1591	May 10, 2019
Directionality in Mutation? Conversation Science	15	919	February 14, 2022
Mutations Are Consistent With Biochemistry Conversation Science , Reference	22	3185	December 13, 2020
Gauger and Mercer: Bifunctional Proteins and Protein Sequence Space Office Hours Design	188	7757	November 15, 2018
The significance of random mutations in the Origins debate Conversation Science , Design	144	3232	March 24, 2023

Jonathan Bartlett: Measuring Active Information in Biological Systems

Related topics