Compressibility example

From the tutorial on my algorithms:

The county's annual equestrian race includes three purebred animals that never competed with each other. Excited, you study their past 200 races and summarize them as probability distributions over four results: first (β€œfirst place”), second, third and more.

Outcome Aurora Whirlwind Phantasm first 0.15 0.30 0.20 second 0.10 0.05 0.30 third 0.70 0.25 0.30 other 0.05 0.40 0.20 

Which horse is the most predictable? One quantitative approach to this is to consider compressibility. Record the history of each horse as a string of 200 values ​​(first, second, third, other). The total number of bits required to encode these lines of the track record can then be calculated using the Huffman algorithm. This works up to 290 bits for Aurora, 380 for Whirlwind and 420 for Phantasm (check it out!). Aurora has the shortest encoding and therefore in the strong sense is the most predictable.

How did they get 420 for Phantasm? I keep getting 400 bytes like this:

First combine, another = 0.4, combine the second, third = 0.6. Complete with 2 bits encoding each position.

Is there something I misunderstood about the Huffman coding algorithm?

The tutorial is available here: http://www.cs.berkeley.edu/~vazirani/algorithms.html (p. 156).

+7
algorithm compression huffman-code information-theory
source share
1 answer

I think you're right: Phantasm 200 results can be represented using 400 bits (rather than bytes). 290 for Aurora and 380 for Whirlwind are true.

The correct Huffman code is generated as follows:

  • Combine the two least likely outcomes: 0.2 and 0.2. Get 0.4.
  • Combine the following two least likely results: 0.3 and 0.3. Get 0.6.
  • Combine 0.4 and 0.6. Get 1.0.

If you do this, you will get 420 bits:

  • Combine the two least likely outcomes: 0.2 and 0.2. Get 0.4.
  • Combine 0.4 and 0.3. (Wrong!) Get 0.7.
  • Combine 0.7 and 0.3. Get 1.0
+4
source share

All Articles