Mung's Primer on Information Theory

swamidass · October 12, 2018, 4:25pm

No. It would not be more accurate. The entropy of a message is EQUAL to the information content. The entropy of a statistical distribution, on the other hand, has to be defined as the expectation of the information/entropy. The difference is not in information vs. entropy, but in what is being measured, a deterministic or stochastic object. Moreover, all measurements presume some stochastic/random model, even for deterministic objects, so the language becomes very bound to context. You cannot learn much of anything coherent by string together sentences from different places.

Why is this the case?

I encourage you to read these two threads: Information = Entropy and Chance = Choice and Information = Entropy - Faith & Science Conversation - The BioLogos Forum.

Now the very interesting thing about information (and entropy) is that has historically been derived several different ways.

Thermodynamics (the first derivation of entropy)

Shannon Information (usually computed assuming IID)

Kolmogorov Complexity

Minimum Description Length

Compression (e.g. integer coding)

Machine Learning/Model Building (a type of lossy compression)

Auto-Encoding/Dimensionality Reduction//Embeddings

Dynamical systems and Equation fitting

…

One of the really interesting things that happened over the last several decades is the discovery that all these different types of information (or approaches) are all computing the same thing and relying on the same theory of information . Of course, they often compute information in different ways, and have different strengths, weaknesses, and assumptions. But we can demonstrate that all these things actually map to one another.

It turns out that all the types of information discussed in ID can be reduced to either “entropy” or “mutual information”. Turns out all the theories above have analogues to these two measures (under specific assumptions) also. This coherence is one of the most beautiful things of information theory. If you really understand one domain, and you learn how to map between different domains, you immediately understand something salient about all of them.

A great example of this is the theoretical result (excuse the technical language) that (approximately) the minimum compression size of a message = minimum description length = IID shannon entropy of the compressed message; moreover, the correct compression will produce a string of bits that is (1) uncorrelated and (2) 50% 1s and 0s, and therefore indistinguishable from random. Another surprising result is that the highest information content entity is noise , exactly the opposite of our intuition of what “information” actually is.

Another very important result is that the true entropy of a sequence (its true information content independent of assumptions like IID) is uncomputable . The proof is esoteric, but the intuitive explanation is that we can never be sure that we did not see a hidden pattern in the data that would allow us to “compress” or “understand” it better. We can never be sure that what looks like noise to us actually is noise. Equivalently, we can never be sure what looks like to information to us, actually is information.

One might wonder, if all these theories map to one another, why we keep them all. Each theory comes with its own assumptions, ability (or lack thereof) to represent specific types of data, and ultimate goals. Moreover, the uncomputability results pragmatically guarantees that there is no single solution that will work in all domains. So, even though these are all connected by a common information theory, there is reason to maintain and develop each one. There is even value in cross-pollinating between them, as has happened several times in the history of these fields.
Information = Entropy - Faith & Science Conversation - The BioLogos Forum

The other point it is common to slip back and forth between different langagues in discussing this. It is a type of “code switching” that becomes second nature when you really understand the field (Code-switching - Wikipedia). So, for example, when I am talking to @kirk and @EricMH, I’m doing the best I can to mirror their language, for the sake of understanding them, but occasionally I do slip and say “entropy” instead of “information,” even though the two are equivalent. It isn’t a mathematical error, and it frankly is helpful because it provokes these conversations, but it can be confusing in a negative way too.

Great.

I have an exercise for you. Take a look at this figure from the wikipedia page on mutual information:

It depicts four types of information: Information content, conditional information, joint information, and mutual information. Here are your tasks:

Find the formulas for all these assuming you know the corresponding probability distributions.
Identify the corresponding probability distributions for each.
Identify the corresponding version of entropy, both the names and the formula.
Identify the corresponding types of compression measures, both the names and the formula.
For all the cases above, explain the differences in notation and convention and terminology.

If you can do that, you will have learned quite a bit. You’ll see they are all essentially equivalent.

Topic		Replies	Views
Explaining the Cancer Information Calculation Conversation	85	6654	September 28, 2020
Durston: Functional Information Office Hours Design	63	8184	December 5, 2018
Origin of Life and Information Theory Conversation Science , AI , Artificial-Intelligence	3	83	February 27, 2025
The ID Arguments From Information Theory Conversation Design	1	742	November 11, 2018
Shannon information and COVID-19 Conversation Science , Article	93	1896	October 7, 2022

Mung's Primer on Information Theory

Related topics