**Dan’s simplification**

I’m adding a little bit of notation: Be aware that notation may differ between fields of application!

**In Shannon Information theory**, Information is the bandwidth needed to send a message from **A** to **B**, where A is some probability distribution representing all possible message that can be sent. Common notation: **H(A)** denotes the information in **A**, and **H(B)** the information in **B**.

**In statistics** Shannon Information (**SI**) is the Variance of a unitless Normal random variable. If you have ever calculated a standard deviation, then you have made a sample estimate of the Shannon Information in a population. Covariance and correlation are measures of Mutual Information between populations **A** and **B**, assuming a multivariate Normal distribution…

**Kolmogorov Information and Algorithmic Information** are the same thing, but different from Shannon Information. **KI** measures the *compressibility* of a message. message than can be compressed to a few bits have almost no information, while messages that cannot be compress at have maximal **KI**. File and image compression can be used as a proxy measure for **KI**; You can compare the sizes of compressed and uncompressed files to estimate **KI** using WINZIP.

Technically you should measure the size of a *self-extracting* file, which includes the information needed to reverse the compression. Theoretically this is just a constant that usually get ignored.

For random messages we expect **KI** and **SI** to be equal *on average* (equal expectation).

**KI** can measure information in a single message, and with multiple messages can estimate the information in the source population. This make **KI** a property of both the source population and messages from that population. Shannon Information is always a property of the population, and cannot be measure information single message (single observation).

**Joint Information** or **Joint Entropy** between **A** and **B** is all the information in both, or the *union* of the two. Commonly written **H;A,B)** or **H(B;A)**, the joint information is the same either way.

**Mutual Information** (**MI**) in any formulation is the amount of information shared between populations **A** and **B**. It doesn’t belong to any single population or message, it’s always a measure between populations (**SI**), or either populations or messages (**KI**). In statistics MI is proportional to Covariance, assuming a multivariate Normal distribution. Commonly denoted **I(A;B)**.

**Conditional Entropy** (or Conditional Information) is the information about B which remains after we remove information already known about A. Commonly written H(B|A). Note: H(A) = H(A|B) + I(A;B) and H(B) = H(B|A) + I(A;B). For conditional probability see Bayes Rules.

**A** and **B** are **Independent** if they are only related by chance. That is, knowing a message/information **A** doesn’t tell you if that information is shared with **B**. Independence between A and B implies H(A|B) = H(A) and H(B|A) = H(B).

If **A** and **B** share NO information, then they are **Mutually Disjoint**. Knowing a message is from **A** tells you that information IS NOT shared with **B**. In this case I(A;B) = 0.

**Relative Entropy** (also Relative Information) is a *distance measure* between **A** and **B** (ie: Kullback-Leibler divergence). If **A** and **B** share a high level of Mutual Information then the distance between them is small. Commonly written as **D_{_{KL}}(A||B)**, this distance is always non-negative.

At risk of mixing notations:

D_{_{KL}}(A||B) = H(A,B) - I(A;B)

D_{_{KL}}(A||B) = H(A) + H(B) -2I(A;B)

D_{_{KL}}(A||B) = H(A|B) + H(B|A)

None of this has anything to do with the “meaning” of a message, which is common usage for information for most people.

[let me know if you spot bugs so I can fix it]