Genetic Mutations and Information Entropy

The article below reports a linear decrease in information entropy (IE) in SARS-CoV-2. I guess that this observation can be explained by mutational bias. Isn’t the case that such a linear decrease in IE squares well with GE?

If there were a corresponding decrease in fitness, then maybe. There are other possible causes of this decrease, such as a single strain becoming more prevalent over time. I don’t have time to sort it out this morning.

A sample size of 17 is not adequate to establish anything as any sort of “law”. That is hyperbole.

2 Likes

This paper is … bad. The information they are measuring is simply the frequency of GCAT in each window (sub-sequence). Information is minimized when all letter are equally likely. In table 2 where the window size is 103, information would be minimized if there were 25.75 of each letter.

In the first row of the table the new mutation is “G”. There are 15 “G” nucleotides (not counting the mutation itself). Because there are fewer G than would be expected from a fully random distribution, the number of Gs is more likely to increase than decrease, making the distribution of GCAT letters more uniform, and so more predictable, decreasing the information.

This is the case for every row of the tables, and they specifically state that information did not always decrease.

Note that in figure 2 the Y-axis does not stay the same, but that IE decreases with windows size, and becomes more random across the samples at the same time. Decreasing average IE is expected as the window sizes get smaller (less info per window), as is increase variability between samples.

So to sum up, the authors have no understanding of information theory.

I left a short comment on the article at biorxiv.

4 Likes

The article was uploaded in 2022 to bioarchiv. It hasn’t been published in a real journal AFAIK. I explain this observation by the paper being worthless.

Come on, Gil. Only 17 sequences from the thousands we have?

2 Likes

Defenders of ‘genetic entropy’ have consistently stated that the way they use the word ‘entropy’ is not based on Shannon’s information theory. If this was the case, then genetic entropy would actually have a definable unit (e.g. bits) and also be easily falsifiable.

Just a toy example to illustrate this. Let’s say you have a gene of 50 base pairs, and let’s say that the identity of each nucleotide is fully random: A or T or G or C are equally likely. That would mean each nucleotide has an entropy of log2(4) = 2 bits… 100 bits of entropy in total. The sequence has no shannon information: 0 bits.

However, what if you were dealing with a 50 bp gene that has one consistent sequence in a population, except for one single-nucleotide polymorphism (SNP) with equal instances of A, T, G, and C at this site… then the gene carriers 98 bits of Shannon information with 2 bits of Shannon entropy.

If selection (or drift) were to take place such that the SNP has unequal instances of each nucleotide, e.g. A becomes the most common one with 50% occuring in the population, while T, G and C each have ~16.6%. Now the entropy formula has to be calculated for each nucleotide and be summed up:

H(A) = -0.5 x log2(0.5) = 0.5
H(T) = -1/6 x log2(1/6) = ~0.431
H(G) = -1/6 x log2(1/6) = ~0.431
H(C) = -1/6 x log2(1/6) = ~0.431

Total entropy = 1.793 bits

In other words, selection (or drift) has decreased entropy from 2.000 bits to 1.793 bits, with a corresponding increase of information from 98.000 bits to 98.207 bits. If A is fixed (probability 1.00) Then the total entropy becomes:

H(A) = -1 x log2(1) = 0 bits

And information content is maximized: 100 bits

So, every-time a nucleotide is fixed by natural selection or drift, information increases and entropy decreases. This is not what people like Sanford would agree with, and they would not use the unit of bits to define ‘genetic entropy’ (it does not have a defined unit as of now).

3 Likes

And we have a pretty good reason to believe there was an increase in fitness. Which is consistent with a decrease in information entropy (although you raise valid points about how well that has actually been assessed), since it could indicate an increase in mutual information with the environment. Or becoming more fit to the environment, one might say.

3 Likes

You got it backwards. Shanon entropy (IE) is:

  • Maximized when all outcomes are equally probable — maximum uncertainty, maximum unpredictability
  • Minimized (approaching zero) when one outcome dominates — minimum uncertainty, the result is highly predictable

In the context of the article, a genome window where all four bases (G, C, A, T) appear with equal frequency (25% each) has maximum Shannon entropy. A window dominated by say 90% U’s has low entropy.

I stand corrected, but everything I wrote about that article is still true. He shows the IE decreasing as the distribution becomes more uniform. Why is that?

And @Giltil

When I calculate IE based on the letter counts in table 2 I get different values for “IE Mutated”. I used base 2, so something else is wrong. I have:
1.805849332
1.95090
1.977878133
1.922572064
(Gil: what do you get?).

I can’t say if there is an increase or decrease since I don’t know the original letter, but since the new letter is always C or G, and A and T are most frequent, it’s mostly likely an increase for all four.

I don’t even know how we could relate information to fitness for this. It might correspond to information in the environment related to the host, but that environment would be constantly changing as people gain immunity to older strains.

In any case, the trend shown in this paper, increasing or decreasing, is an artifact related to number of mutations and nucleotide frequency.

@Giltil: An honest question: If GE is a real thing, and assuming we could reliably measure IE in a population, should IE increase or decrease over time?
My answer: I think we should expect a “regression to the mean” GCAT frequencies as more random mutations accumulate in parts of the non-functional genome. That could be an increase or a decrease depending on the initial frequencies in the genome. GE might increase IE in one population and decrease it in another.

This is about the point where Joe Felsenstein shows up to say that Shannon Information has no relation to fitness. :wink:

Back to …

(although you raise valid points about how well that has actually been assessed)

If it could be assessed, we still need a single population over time, and that population is branching with every new mutation. We would need to measure the distribution of IE over time, not single samples over time. This becomes problematic when part of the virus population mutates into a significantly different strain.

1 Like

[Poof! I appear in a puff of smoke]. Yes, sort-of. In a simple case with (say) 2 alleles, A and a, where A has higher fitness, if we start with A at frequency 0.01, as natural selection raises it to 0.50, the entropy goes up, and so does the fitness. When we continue having the frequency of A increase, entropy goes down as the fitness increases. That is why we look for other measures of information such as functional information.

4 Likes

Actually, you were still right Dan. You stated ‘(Shannon) information is minimized’ which is equivalent of stating that “(Shannon) entropy is maximized”.

1 Like

First, not in any way related to what I said.

Second, for that observation to be consistent with GE, there would have to be an corresponding decrease in fitness, which has not happened.

Third, this could also be explained by selection.

1 Like

That paper is calculating IE incorrectly, and the trend is an artifact. IOW it’s not even wrong.

@Giltil Please verify to my calculations of IE at #151 are correct (or not). There is no point to arguing if the paper is just plain wrong.

3 Likes

I made a sub-topic for this, because the original thread is cluttered and there are a few points we might follow up.

1 Like

@Giltil a couple of things:

  1. There is a published article that includes part of the unpublished abstract. I think this same article came up a while back during the discussion of Assembly Theory.
  2. I suspect the same errors are being made in the published paper, but I will need to look up those same sequences to verify.
  3. I would still like for you to verify my IE calculations (see above). The error here is Vopson’s, I think.

I figured it might be that (it is a common point of confusion). It didn’t bother me because that pre-print is wrong either way. I might write a letter to the journal pointing out the errors, but I want to get all my ducks in a row before I make a fuss about it.