Mung's Primer on Information Theory

In my mind, and perhaps I need a refresher, the entropy is the uncertainty. The information is the reduction in uncertainty. If this is the case then they are not synonymous.

I hope that you will acknowledge that I have a reason for asking. I’m not just asking to be contentious. If my thinking about this is mistaken, I want to know why.

1 Like

Given the discussions I’ve been able to find online I don’t think my question is either ignorant or unwarranted. Some examples:

So to recap, entropy measures randomness and is proportional to information which measures uncertainty.

Statistical entropy is a probabilistic measure of uncertainty or ignorance; information is a measure of a reduction in that uncertainty.

Links available on request.

See also:

1 Like

Yes they are ignorant questions. Let me show you the formula for entropy and for information.

H(p) = - \sum p(i) \ln p(i)

I(p) = - \sum p(i) \ln p(i)

Can you detect any difference in these formulas? It really seems like:

I(p) = H(p)

Good news is that our instincts are correct in this case. Information does equal entropy. They are synonyms. What happened with @EricMH is that he explaint that he means mutual information when ever he says “information.”

Entropy and information are essentially exchangeable terms. The only differences arise are in conventions. For example, it is common to use natural logarithm for entropy and log base 2 for information (but neither is required). Information and entropy are, however, fundamentally equivalent concepts. They are the same thing.


That’s always a great way to start off.

The mutual information of a random variable with itself is the entropy of the random variable. This is the reason that entropy is sometimes referred to as self-information.

  • Cover and Thomas p. 21

Can you tell me where your two formulas came from? Did you look at he page I linked where they give the formal definition of entropy and the formal definition of information?

What am I to make of entropy as average information or expected value if entropy is the same as information? Or is that too an ignorant question?

I am genuinely trying to understand this. Can you at least try to be civil? Thank you.


Josh, I would like you to know that I have done some reading on information theory. Perhaps even a lot. When I saw it appearing in ID arguments I took an interest in it. I admit I have a difficult time following all the math. So any help is sincerely appreciated.

Last evening I cracked at least two books, one the Cover and Thomas 2nd edition and the other a tutorial introduction that I have. So am am putting some effort into trying to understand where you are getting the idea that entropy is information, that entropy=information, and that the two terms are synonymous. Perhaps I am taking too narrow a view of what you are saying.

One book described the relationship between entropy and information as “two sides of the same coin.” So perhaps you are saying that entropy is to information as heads is to tails. But that makes no better sense to me. :slight_smile:

I’m just hoping for an explanation that I can understand. Showing me two equations and pointing out that they look the same doesn’t bring understanding.

I’ll also note that in your comments to Kirk you refer to information and information content. Is there a reason you are not using the terms entropy and entropy content instead?

This is where I see confusing usage of “information”.

At least as used by Shannon, the information is the message itself, while the entropy is the amount of information. And the term “information content” should be redundant, because the information content is just the same thing as the information. However, this can get tricky. We send out information as a string of syntactic elements. The “content” can refer to that string of syntactic elements, or it can refer to the implied semantics to be derived from those syntactic elements. Because of this, there is some ambiguity in what we mean by “information”. However, the entropy is still distinct from the information.


Would it be more accurate to say that the entropy is equal to the average information?

1 Like

The entropy H(X) of a random variable X is a functional of the probability distribution p(x) which measures the average amount of information contained in X, or equivalently, the amount of uncertainty removed upon revealing the outcome of X.

  • A First Course in Information Theory. p. 11

The entropy of X is sometimes called the self-information of X.

  • p. 14

…all Shannon’s information measures are special cases of conditional mutual information.

  • p. 16

@mung, in context, you were suggesting your question was appropriate in a conversation between @kirk and I because it wan informed question. I was telling you the fact that it was an ignorant question. It would have confused that very important thread to include it. That thread is not a place for you to work out your personal understanding of information theory.

In general, if you don’t know what is going on, stay out of threads between scientists when they are hashing out complex issues. We have side comment threads precisely to give voice to the thoughts and questions you have. If they are good and informed, they might even be promoted into the main thread.

We are happy to help you make sense of it. This thread is being started to help answer your questions about information theory generally speaking. I prefer if at this stage that we refrain from bringing in ID arguments. Starting from the ID literature on this will just confuse you. We do not even need to bring in evolutionary science to get this straight. Remember, information theory was invented decades ago and has been in use for a whole host of useful things inside and outside biology, without usually making any reference to origins. There is no reason to discuss ID theory to make sense of it.

I understand the instinct of some to give some general credence to what they have heard about information theory form ID. From that instinct, it can seem rational to try and build up correspondences from what the field is saying and what ID is saying. This is sure recipe for long term confusion. Learn information theory first, then we can discuss in in relation to ID.

We can try and help you.

No. It would not be more accurate. The entropy of a message is EQUAL to the information content. The entropy of a statistical distribution, on the other hand, has to be defined as the expectation of the information/entropy. The difference is not in information vs. entropy, but in what is being measured, a deterministic or stochastic object. Moreover, all measurements presume some stochastic/random model, even for deterministic objects, so the language becomes very bound to context. You cannot learn much of anything coherent by string together sentences from different places.

Why is this the case?

I encourage you to read these two threads: Information = Entropy and Chance = Choice and Information = Entropy - Scientific Evidence - The BioLogos Forum.

Now the very interesting thing about information (and entropy) is that has historically been derived several different ways.

  1. Thermodynamics (the first derivation of entropy)
  2. Shannon Information (usually computed assuming IID)
  3. Kolmogorov Complexity
  4. Minimum Description Length
  5. Compression (e.g. integer coding)
  6. Machine Learning/Model Building (a type of lossy compression)
  7. Auto-Encoding/Dimensionality Reduction//Embeddings
  8. Dynamical systems and Equation fitting

One of the really interesting things that happened over the last several decades is the discovery that all these different types of information (or approaches) are all computing the same thing and relying on the same theory of information . Of course, they often compute information in different ways, and have different strengths, weaknesses, and assumptions. But we can demonstrate that all these things actually map to one another.

It turns out that all the types of information discussed in ID can be reduced to either “entropy” or “mutual information”. Turns out all the theories above have analogues to these two measures (under specific assumptions) also. This coherence is one of the most beautiful things of information theory. If you really understand one domain, and you learn how to map between different domains, you immediately understand something salient about all of them.

A great example of this is the theoretical result (excuse the technical language) that (approximately) the minimum compression size of a message = minimum description length = IID shannon entropy of the compressed message; moreover, the correct compression will produce a string of bits that is (1) uncorrelated and (2) 50% 1s and 0s, and therefore indistinguishable from random. Another surprising result is that the highest information content entity is noise , exactly the opposite of our intuition of what “information” actually is.

Another very important result is that the true entropy of a sequence (its true information content independent of assumptions like IID) is uncomputable . The proof is esoteric, but the intuitive explanation is that we can never be sure that we did not see a hidden pattern in the data that would allow us to “compress” or “understand” it better. We can never be sure that what looks like noise to us actually is noise. Equivalently, we can never be sure what looks like to information to us, actually is information.

One might wonder, if all these theories map to one another, why we keep them all. Each theory comes with its own assumptions, ability (or lack thereof) to represent specific types of data, and ultimate goals. Moreover, the uncomputability results pragmatically guarantees that there is no single solution that will work in all domains. So, even though these are all connected by a common information theory, there is reason to maintain and develop each one. There is even value in cross-pollinating between them, as has happened several times in the history of these fields.
Information = Entropy - Scientific Evidence - The BioLogos Forum

The other point it is common to slip back and forth between different langagues in discussing this. It is a type of “code switching” that becomes second nature when you really understand the field (Code-switching - Wikipedia). So, for example, when I am talking to @kirk and @EricMH, I’m doing the best I can to mirror their language, for the sake of understanding them, but occasionally I do slip and say “entropy” instead of “information,” even though the two are equivalent. It isn’t a mathematical error, and it frankly is helpful because it provokes these conversations, but it can be confusing in a negative way too.


I have an exercise for you. Take a look at this figure from the wikipedia page on mutual information:

It depicts four types of information: Information content, conditional information, joint information, and mutual information. Here are your tasks:

  1. Find the formulas for all these assuming you know the corresponding probability distributions.
  2. Identify the corresponding probability distributions for each.
  3. Identify the corresponding version of entropy, both the names and the formula.
  4. Identify the corresponding types of compression measures, both the names and the formula.
  5. For all the cases above, explain the differences in notation and convention and terminology.

If you can do that, you will have learned quite a bit. You’ll see they are all essentially equivalent.

1 Like

@Patrick, you will enjoy this exercise too.

I want to make it clear that I am not getting what I know about information theory from ID. What I said was, or what I meant to communicate was, that I became interested in information theory because of ID. I have plenty of books on information theory that predate ID and that have nothing whatsoever to do with ID.

In a similar vein I became interested in topics such as entropy and the second law because of Creationist or ID arguments. But I do not derive my understanding of either entropy or the second law from Creationist or ID writings.

Currently my favorite author on these subjects is Arieh Ben-Naim. I’ve mentioned him a few times.

So could we please drop this entire you must be misled by ID schtick? It’s tiresome, and wrong.

Thanks for the thread though!

1 Like

You’ve already clarified you think the ID information arguments are missing the mark. I agree.

This forum is not private. I’m just setting ground rules for everyone here. With this scoping, you’ll get exactly what you want. Chill out.

I would say it depicts different entropies or different measures of information. I think calling them four “types of information” confuses discussion. [and this is before I even went to the wiki page]

The wikipedia page even refers to them thusly.

“…additive and subtractive relationships various information measures associated with correlated variables X and Y.”

Nowhere does the wiki page refer to them as types of information. It speaks of the “relation to conditional and joint entropy.”

If we see H(…) can we use the term entropy and not the term information in order to keep things clear?

Not looking for a guess or your personal opinion. I’m asking for a well researched answer.

Looks like you have your work cut out for you. Have fun with this excercise.

These are not guesses or personal opinions. They are observations. I was relating facts.

The same diagram can be found on page 15 of A First Course in Information Theory. That book also uses the terms joint entropy and conditional entropy. I am fairly certain that the Cover and Thomas book I mentioned does so as well, something I will of course verify.

Why are you opposed to using the term entropy when that certainly appears to be the convention?

I’m not opposing it. ID proponents flip out when I say that information = entropy. For that reason, I try to use information so we can at least have a conversation.

1 Like

Dan’s simplification
I’m adding a little bit of notation: Be aware that notation may differ between fields of application!

In Shannon Information theory, Information is the bandwidth needed to send a message from A to B, where A is some probability distribution representing all possible message that can be sent. Common notation: H(A) denotes the information in A, and H(B) the information in B.

In statistics Shannon Information (SI) is the Variance of a unitless Normal random variable. If you have ever calculated a standard deviation, then you have made a sample estimate of the Shannon Information in a population. Covariance and correlation are measures of Mutual Information between populations A and B, assuming a multivariate Normal distribution…

Kolmogorov Information and Algorithmic Information are the same thing, but different from Shannon Information. KI measures the compressibility of a message. message than can be compressed to a few bits have almost no information, while messages that cannot be compress at have maximal KI. File and image compression can be used as a proxy measure for KI; You can compare the sizes of compressed and uncompressed files to estimate KI using WINZIP.
Technically you should measure the size of a self-extracting file, which includes the information needed to reverse the compression. Theoretically this is just a constant that usually get ignored.
For random messages we expect KI and SI to be equal on average (equal expectation).

KI can measure information in a single message, and with multiple messages can estimate the information in the source population. This make KI a property of both the source population and messages from that population. Shannon Information is always a property of the population, and cannot be measure information single message (single observation).

Joint Information or Joint Entropy between A and B is all the information in both, or the union of the two. Commonly written H;A,B) or H(B;A), the joint information is the same either way.

Mutual Information (MI) in any formulation is the amount of information shared between populations A and B. It doesn’t belong to any single population or message, it’s always a measure between populations (SI), or either populations or messages (KI). In statistics MI is proportional to Covariance, assuming a multivariate Normal distribution. Commonly denoted I(A;B).

Conditional Entropy (or Conditional Information) is the information about B which remains after we remove information already known about A. Commonly written H(B|A). Note: H(A) = H(A|B) + I(A;B) and H(B) = H(B|A) + I(A;B). For conditional probability see Bayes Rules.

A and B are Independent if they are only related by chance. That is, knowing a message/information A doesn’t tell you if that information is shared with B. Independence between A and B implies H(A|B) = H(A) and H(B|A) = H(B).

If A and B share NO information, then they are Mutually Disjoint. Knowing a message is from A tells you that information IS NOT shared with B. In this case I(A;B) = 0.

Relative Entropy (also Relative Information) is a distance measure between A and B (ie: Kullback-Leibler divergence). If A and B share a high level of Mutual Information then the distance between them is small. Commonly written as D_{_{KL}}(A||B), this distance is always non-negative.
At risk of mixing notations:
D_{_{KL}}(A||B) = H(A,B) - I(A;B)
D_{_{KL}}(A||B) = H(A) + H(B) -2I(A;B)
D_{_{KL}}(A||B) = H(A|B) + H(B|A)

None of this has anything to do with the “meaning” of a message, which is common usage for information for most people.

[let me know if you spot bugs so I can fix it]


Can you add functional information to the list?

No. It is not a part of information theory in the usual sense. It would be better for you to figure out how FI maps to information theory. If you can’t do that, you do not know what it is.

1 Like