Eric Holloway: Wrap up Experiment on Mutual Information

Thanks for giving me the closing comment again. This is to your credit.


As an opening note, while this discussion has focussed a whole lot on my own capabilities (and lack thereof) of translating algorithmic information theory into empirically testable settings, it should be said that much more qualified people than myself have done just this. Li and Vitanyi have written an entire book on practical applications of Algorithmic Information Theory, and famously derived a compression based clustering metric from algorithmic information theory.

Now for my responses.

We call each other out on three items, which I’ll address in turn.

1. MI_LZ(Y:Y) > 0 on long enough sequences

My algorithm will give MI(Y;Y) > 0 on long enough sequences, but not on every sequence due to LZ taking awhile to get warmed up.

swamidass:

I increased the length to 10000, and got this result:

Prediction I_LZ(Y:Y)>0
Experiment 5: iid Y
Average success: 0.0

Increasing the string size does not fix this problem. Your instinct is false. Why didn’t you do the simulaiton yourself instead of putting out an obviously false claim like this? Why would you make these claims without testing them? This is really bad form.
[/quote]

LZ only makes asymptotically optimal guarantees for ergodic sequences. For highly random bitstrings, this means it takes a very long time for LZ to compress the bitstring.

So, looks like the sequences must become rrrreeeeeaaaaaalllllyyyyy long for MI_LZ(Y:Y) > 0 if iid Y. I have a loglog plot of LZ(Y+Y)/len(Y+Y) for iid Y, and it looks like it won’t cross the X axis until Y’s length reaches 10^{12}.

2. Demonstration of tractability

You have pointed out flaws in the experiments, but I have also demonstrated successes. For instance, the experiment that started off this thread can guarantee with high probability that if X and Y are iid, then MI_LZ(X:Y) = 0, which matches the algorithmic mutual information property in this regard. I’ve repeated the code in a new repl.it below, and included the control showing that also MI_LZ(X:Y) > 0 for some X and some Y, so I didn’t just cheat by setting MI_LZ(X:Y) = 0 for everything. It is admittedly a very modest empirical guarantee, but it is a guarantee nonetheless.

You have not yet refuted this particular experiment. What you have done is proposed your own compression metric, which I’ll address next.

3. Issues with the GZip mutual information

You are not a fan of my LZ implementation, so you use the standard LZ algorithm in GZip. This is fine, but you make the same sort of mistake I did with my first experiment, where I failed to account for decompression.

In the following repl.it, I copied your GZip version, and have two experiments. The first shows that GZip consistently decompresses random bitstrings. The problem is that since random bitstrings are decompressed, then they artificially register mutual information where there is none.

The second experiment shows the consequence of the decompression. I generate a random X from a Bernoulli 0.5 distribution. I show that it registers mutual information with any Y drawn from Bernoulli distributions taking probabilities 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. Information theory says a Bernoulli(0.5) X should have 0 mutual information with any other bitstring (barring a few highly improbable exceptions), so GZips decompression issue causes the metric to violate this information theory property.

That being said, this is not necessarily a ding against the GZip version. It can either be easily corrected by ensuring it doesn’t return decompressed values, as I did, or this issue can be acknowledged as a tradeoff for some other useful property the GZip version preserves.

The overall takeaway is that @swamidass is correct in that not all properties of algorithmic mutual information can be preserved when moving to the calculable, empirical realm.

On the other hand, these experiments seem to show we can pick and choose properties, and guarantee empirical preservation of at least some of them.

In particular, preserving the property that MI is not positive when there is no algorithmic mutual information could allow us to detect when there is algorithmic mutual information, even if we cannot measure the precise amount.

And I think that’s all I’ve got. Take care everyone! I seem to have gotten something out of my interactions here with you all, since I just stayed up until 2am when I should be studying for a major exam instead :stuck_out_tongue:

1 Like