Eric Holloway: Wrap up Experiment on Mutual Information

Yup, that is LZ compression. The problem is that data looks more like this:


So, the compression is always the same as the input string, which is to say there is no compression. There are several ways to fix this. The best way is to get a better compression algorithm, which is exactly what GZ is. It uses a way of detecting repeats (the BW transform), so:


Can be (essentially) compressed to:


Another one of the mistakes that @EricMH makes is computing the mutual information as:

I_LZ(X) + I_LZ(Y) - I_LZ(X+Y)

This is just an error with LZ compression because it will always be close to zero, except in some boundary cases. LZ can’t pick up on long range repetitions, so it will just fail to compute MI. A revised version that might work in some limited cases is:

I_LZ(X) - I_LZ(X ^ Y)

Here the “^” is the XOR on the bit strings, returning a bit string that is 1 every position the two strings are different. This function would actually be able to compute the mutual information, assuming the two are perfectly aligned. So, for example,

if X and Y are equal to:TAGGCTTAGGCAA
Then X^Y, would be a very easy to compress string.


This is good, making it so that MI(X,Y) is equal to I(X) and I(Y). It is very easy to break this however.

and Y is a rotation: ATAGGCTTAGGCA
then X^Y is the much more difficult to compress:


This means the MI will be computed as I(X) + I(Y) when it should actually be very close to just I(X). To avoid this error, the right way to cover a large range (though not all) of cases is to use GZ compression. The GZ compression detects offsets and a whole range of other changes.

[NOTE: The (DNA xor DNA) to BINARY is not valid. It is just an illustration. A valid approach would convert the ATCG to numbers and use (a-b) mod 4 as the XOR between two DNA strings A and B.]