Eric Holloway: Wrap up Experiment on Mutual Information

swamidass · September 19, 2018, 8:10pm

Yup, that is LZ compression. The problem is that data looks more like this:

TAGGCTTAGGCAA

So, the compression is always the same as the input string, which is to say there is no compression. There are several ways to fix this. The best way is to get a better compression algorithm, which is exactly what GZ is. It uses a way of detecting repeats (the BW transform), so:

TAGGCTTAGGCAATAGGCTTAGGCAA

Can be (essentially) compressed to:

2*(TAGGCTTAGGCAA)

Another one of the mistakes that @EricMH makes is computing the mutual information as:

I_LZ(X) + I_LZ(Y) - I_LZ(X+Y)

This is just an error with LZ compression because it will always be close to zero, except in some boundary cases. LZ can’t pick up on long range repetitions, so it will just fail to compute MI. A revised version that might work in some limited cases is:

I_LZ(X) - I_LZ(X ^ Y)

Here the “^” is the XOR on the bit strings, returning a bit string that is 1 every position the two strings are different. This function would actually be able to compute the mutual information, assuming the two are perfectly aligned. So, for example,

if X and Y are equal to:TAGGCTTAGGCAA
Then X^Y, would be a very easy to compress string.

0000000000000

This is good, making it so that MI(X,Y) is equal to I(X) and I(Y). It is very easy to break this however.

If X is: TAGGCTTAGGCAA
and Y is a rotation: ATAGGCTTAGGCA
then X^Y is the much more difficult to compress:

11101001001001

This means the MI will be computed as I(X) + I(Y) when it should actually be very close to just I(X). To avoid this error, the right way to cover a large range (though not all) of cases is to use GZ compression. The GZ compression detects offsets and a whole range of other changes.

[NOTE: The (DNA xor DNA) to BINARY is not valid. It is just an illustration. A valid approach would convert the ATCG to numbers and use (a-b) mod 4 as the XOR between two DNA strings A and B.]

Topic		Replies	Views
Great Example of the Appropriate Use of Algorithmic Mutual Information in Biology Conversation Science	34	1204	February 3, 2019
Eric Holloway: Algorithmic Specified Complexity Office Hours Design	35	5590	September 8, 2018
What A Darwinian Algorithm Designs Conversation Science , Design	25	1788	April 27, 2021
Computing the Functional Information in Cancer Conversation Design	41	5387	July 6, 2020
Open Challenge to ID Advocates Conversation Design	12	1591	June 21, 2020

Eric Holloway: Wrap up Experiment on Mutual Information

Related topics