Dan’s simplification
I’m adding a little bit of notation: Be aware that notation may differ between fields of application!
In Shannon Information theory, Information is the bandwidth needed to send a message from A to B, where A is some probability distribution representing all possible message that can be sent. Common notation: H(A) denotes the information in A, and H(B) the information in B.
In statistics Shannon Information (SI) is the Variance of a unitless Normal random variable. If you have ever calculated a standard deviation, then you have made a sample estimate of the Shannon Information in a population. Covariance and correlation are measures of Mutual Information between populations A and B, assuming a multivariate Normal distribution…
Kolmogorov Information and Algorithmic Information are the same thing, but different from Shannon Information. KI measures the compressibility of a message. message than can be compressed to a few bits have almost no information, while messages that cannot be compress at have maximal KI. File and image compression can be used as a proxy measure for KI; You can compare the sizes of compressed and uncompressed files to estimate KI using WINZIP.
Technically you should measure the size of a self-extracting file, which includes the information needed to reverse the compression. Theoretically this is just a constant that usually get ignored.
For random messages we expect KI and SI to be equal on average (equal expectation).
KI can measure information in a single message, and with multiple messages can estimate the information in the source population. This make KI a property of both the source population and messages from that population. Shannon Information is always a property of the population, and cannot be measure information single message (single observation).
Joint Information or Joint Entropy between A and B is all the information in both, or the union of the two. Commonly written H;A,B) or H(B;A), the joint information is the same either way.
Mutual Information (MI) in any formulation is the amount of information shared between populations A and B. It doesn’t belong to any single population or message, it’s always a measure between populations (SI), or either populations or messages (KI). In statistics MI is proportional to Covariance, assuming a multivariate Normal distribution. Commonly denoted I(A;B).
Conditional Entropy (or Conditional Information) is the information about B which remains after we remove information already known about A. Commonly written H(B|A). Note: H(A) = H(A|B) + I(A;B) and H(B) = H(B|A) + I(A;B). For conditional probability see Bayes Rules.
A and B are Independent if they are only related by chance. That is, knowing a message/information A doesn’t tell you if that information is shared with B. Independence between A and B implies H(A|B) = H(A) and H(B|A) = H(B).
If A and B share NO information, then they are Mutually Disjoint. Knowing a message is from A tells you that information IS NOT shared with B. In this case I(A;B) = 0.
Relative Entropy (also Relative Information) is a distance measure between A and B (ie: Kullback-Leibler divergence). If A and B share a high level of Mutual Information then the distance between them is small. Commonly written as D_{_{KL}}(A||B), this distance is always non-negative.
At risk of mixing notations:
D_{_{KL}}(A||B) = H(A,B) - I(A;B)
D_{_{KL}}(A||B) = H(A) + H(B) -2I(A;B)
D_{_{KL}}(A||B) = H(A|B) + H(B|A)
None of this has anything to do with the “meaning” of a message, which is common usage for information for most people.
[let me know if you spot bugs so I can fix it]