Mung's Primer on Information Theory

It’s still an open question. See my post here.

I also don’t think that the equation you did give is an equation for information but that could just be because of the terms we are using. What is it the information for, of, or about?

I hope to post tomorrow and hopefully clarify my thinking. Moving away from the “uncertainty” and “average information views.”

1 Like

Mathematical Information isn’t about the meaning of a message, it’s about the mechanics of communication, bandwidth, compression, and error correction. The meaning of a message is tied to common knowledge between the sender and receiver. Ifya Iya erewya itingwrya inya aya anguagelya ouyya ontdya owknya, isthya ouldwya ebya eaninglessmya otya ouyya!

Note to self: shut of autocorrect before attempting to write in Pig-Latin! :slight_smile:

There is a strong overlap between statistics theory and Shannon Information, which let’s me understand SI pretty well. Statistics is well grounded in the sense that we always deal with numbers and counts of things we can measure. Algorithmic Information is more abstracte, because it could represent almost anything, like words, languages, sounds, images, and codes. Even so, the basic concepts and theories from statistics and SI are mirrored in AI.

4 Likes

This thread has been split off into many other threads so it’s hard to keep track. So I apologize for giving the continuation here for something that was raised elsewhere. But I think it is appropriate in a primer on information theory.

What’s in a name? In the case of Shannon’s measure the naming was not accidental. In 1961, one of us (Tribus) asked Shannon what he had thought about when he had finally confirmed his famous measure. Shannon replied: “My greatest concern was what to call it. I thought of calling it “information,” but the word was overly used, so I decided to call it “uncertainty.” When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place, your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one knows what entropy really is, so in a debate you will always have the advantage.’”

  • Tribus and McIrvine (1971)
3 Likes

22.1 Origins of the Theory

Firstly, the difficulty of giving credit where credit is due. All major advances in understanding have their precursors, whose full significance is never recognized at the time. Relativity theory had them in the work of Mach, Fitzgerald, Lorentz, and Poincare, to mention only the most obvious examples. Communication theory had many precursors, in the work of Gibbs, Nyquist, Hartley, Szilard, von Neumann, and Wiener. But there is no denying that the work of Shannon (1948) represents the arrival of the main signal, just as did Einstein’s of 1905. In both cases ideas which had long been, so to speak, ‘in the air’ in a vague form, are grasped and put into sharp focus.

  • E.T. Jaynes. Probability Theory: The Logic of Science. p. 627
1 Like

It’s like saying that uncertainty is synonymous with information.

E.T. Jaynes relies heavily on the “amount of uncertainty” interpretation of H (entropy). That could be where I got it from though no doubt he is not alone.

We can regard H as a measure of the ‘amount of uncertainty’ in any probability distribution.

  • p. 358

… if we do accept the expression for entropy, very literally, as the correct expression for the ‘amount of uncertainty’ represented by a probability distribution, this will lead us to a much more unified picture of probability theory in general. It will enable us to see that the principle of indifference, and many frequency connections of probabilty, are special cases of a single principle, and that statistical mechanics, communication theory, and a mass of other applications are all instances of a single method of reasoning.

  • p. 353

Next I will come to why I have move away from the “uncertainty” interpretation, for which I rely heavily on Ben-Naim. But first a trip to the store to get ice cream. :slight_smile:

In this introductory chapter, we introduce Shannon’s measure of information (SMI). We start with Shannon’s motivation for searching for such a measure. We emphasize, from the outset, that there is an immense difference between the concept of information and the measure of information. We also note that Shannon erred when he named his measure “entropy.” This has caused huge confusion in both information theory and thermodynamics. Finally, we survey the various interpretations, as well as a few misinterpretations, of SMI.

  • Arieh Ben-Naim. Information Theory (Part 1: An Introduction to the Fundamental Concepts). p. 55.

I have quite a few books by Ben-Naim and enjoy his approach to many topics which just happen to touch on things that come up in the debates over Creation/ID/Evolution though he does not write on those subjects. It would be hard to think of him s a Creationist or an ID proponent.

1 Like

As this thread is meant as a primer, I thought the following distinction between (Shannon) information and (Shannon) entropy might be helpful. It is something I have seen in some of the introductory tutorials,

“Shannon information” is used with a single sample x from a Random Variable X. In particular, Shannon information is -log Prob(X=x) for some x in the sample space of X. Shannon information is also called “suprisal”.

Entropy, on the other hand, is used to refer to the expected value of information, that is E(-log Prob(X=x)) over all x. So information is -log of a single sample x from a Random Variable X; entropy is the expectation of the distribution of -log of that RV.

As an example of this usage, Stone’s Intro says in section 4 that “Entropy is Average Shannon Information”.

However. I think it is fair to say tha tin practice this relatively minor difference is often ignored. In fact, the author of the linked tutorial does exactly that here. Scroll down to answer from Stone to see what I mean.

I don’t think the standard Cover and Thomas text ever uses the bare term “information” as a term of art. See Summary of Chapter 2, page 41 in second edition of their text for the terms they do use for thew basic information entities.

Shannon does hint at this definition of information on page 1, para 3, of his 1948 paper. “If the number of messages in the set is finite then this number or any monotonic function of this number can be regarded as a measure of the information produced when one message is chosen from the set”. But he does not formalize the term “information” further than that, as far as I can see. In particular, he does not use the term “average information” for entropy that I can see.

3 Likes

Sort of. Except we can also refer to log p as the entropy of the sample, and E[log p] as the information of the distribution. The terms are essentially interchangeable. The two concepts are essentially synonyms, thought at times you will see distinctions being made within certain conventions. It is more important to clarify what is being referred too, knowing that any use of “information” or “entropy” could be flipped from a different convention.

2 Likes

@Dan_Eastwood:
Dan:
If I understand your definitions correctly, you are equating statistical variance and Shannon entropy and also you are equating covariance (or correlation) and mutual information. If this is fair, I’d offer the following feedback.

First, I do agree that variance and entropy both relate to the dispersion of a Random Variable (RV). However, they are not the same. For one thing, the units are different: variance is will change if the units of measure of the RV change, but info will not. But the main thing is that the formula differ so calculated values differ. This link has details..

I also think covariance and mutual information differ. Both relate to statistical independence, of course. But the formula differ. This leads to results like covariance being zero for some situations where the two variables have a non-linear relation, whereas mutual information is only zero for independent variables. One way to see this is to recognize that (1) mutual information is the KL divergence of the joint distribution of the variables Prob(X,Y) with the product of their marginals P(X)P(Y) and that (2) two variables are independent iff their joint is the product of their marginals. So we have independence iff KL is divergence is zero. See here for details

2 Likes

This doesn’t seem to be an accurate understanding of Shannon. As I posted in another thread:

To be sure this word information in communication theory relates not so much to what you do say, as to what you could say . That is, information is a measure of one’s freedom of choice when one selects a message. … Note that it is misleading (although often convenient) to say that one or the other message conveys unit information. The concept of information applies not to the individual messages (as the concept of meaning would), but rather to the situation as a whole , the unit of information indicating that in this situation one has an amount of freedom of choice, in selecting a message, which is convenient to regard as a standard or unit amount.

  • Shannon and Weaver. The Mathematical Theory of Information. p. 8-9

The term “information” does not apply at all to the message.

To be somewhat more definite, the amount of information is defined, in the simplest cases, to be measured by the logarithm of the number of available choices.

@mung that is just not true. Do you aim to teach us or to learn?

There is information in a message and it is computed in the simplest cases as -log p, or log W. However this is only true in the simplest of cases.

Remember also this is a 70 year old paper. I promise you we’ve improved our understanding since then.

1 Like

I am quoting directly from Shannon and Weaver. It’s interesting that you can quote from Shannon’s paper in one post and then ignore what he says in another, dismissing it as old and out-dated.

When you say there is information in a message what on earth do you mean?

When you say it [the information in a message] is computed in the simplest cases as -log p or log W, -log p of what or log W of what? The message itself, or the information source? The space of all possible messages?

And the question is what Shannon means by the term. So what he write about it is relevant.

Further:

Before closing this section on information, it should be noted that the real reason that Level A analysis deals with a concept of information which characterizes the whole statistical nature of the information source, and is not concerned with the individual messages (and not at all directly concerned with the meaning of the individual messages) is that from the point of view of engineering, a communication system must face the problem of handling any message that the source can produce. … This sort of consideration leads at once to the necessity of characterizing the statistical nature of the whole ensemble of messages which a given kind of source can and will produce. and information, as used in communication theory, does just this.

  • Shannon and Weaver. p. 14

The “information content” of any particular message is simply not in view. Shannon explicitly rejects that. So I am teaching.

1 Like

Don’t forget the context of that thread. I was told I was in error, and that paper was quoted at me. However that paper it self demonstrated I was not in error. This is a different context, or at least it should be. As I understood it, you wanted to better understand information theory, not challenge my competence.

In ordinary language, “information” refers to what is meaningful to people. And I agree that is not Shannon’s concern. Shannon’s theory is sometimes described as a theory of syntactic information. The human meaningful information is encoded as symbols (natural language letters or binary codes), and Shannon’s theory is concerned with the transportation of those sequences of symbols.

The theory is very much about the syntactic content of messages, even if it is not concerned with the semantic content. That syntactic content is what is transported by the communication system.

2 Likes

But I directly quoted @nwrickert. And in the other thread I was discussing with @Patrick and @dga471. I don’t know why you took it as a challenge to your competence.

I would, however, since you brought it up, appreciate an answer to my question, or questions.

Take this one:

log W of what? What does the W stand for? The number of possible messages?

But “a communication system must face the problem of handling any message that the source can produce. … This sort of consideration leads at once to the necessity of characterizing the statistical nature of the whole ensemble of messages which a given kind of source can and will produce . and information , as used in communication theory, does just this” (Shannon and Weaver).

That’s Shannon information. The measure H is “a measure of the amount of information contained in or associated with, a given probability distribution” (Ben-Naim). Not a given message.

As we have seen, both the uncertainty and the unlikelihood interpretations of H are derived from the meaning of the probabilities pi. The interpretation of H as a measure of information is a little trickier and less straightforward. It is also more interesting since it conveys a different kind of information on the Shannon measure of information. As we have already emphasized, SMI is not information. Also, it is not a measure of any piece of information, but of a very particular kind of information. The confusion of SMI with information is almost the rule, not the exception, for both scientists and nonscientists.

  • Ben-Naim, Arieh. Information Theory. p. 63

Yes, I agree with that.

However, it is arguably also true that the source is constrained to produce only messages that the transport system can carry.

We see an example of this with email, originally designed to carry ASCII text. So images are encoded to ASCII text with MIME encoding, so that they are capable of being transported.

You didn’t challenge my competence.

W is a term used in physics to mean the total number of possible states (possible messages). If (and only if) all these states are equally probable, the entropy (information/uncertainty) of the system is log W. If some states are more or less likely than others, than this trick no longer applies.

That’s very helpful to understanding your point. I agree that we should be clear on the context before using the terms or saying they mean the same thing.