From the tutorial on my algorithms:
The county's annual equestrian race includes three purebred animals that never competed with each other. Excited, you study their past 200 races and summarize them as probability distributions over four results: first (βfirst placeβ), second, third and more.
Outcome Aurora Whirlwind Phantasm first 0.15 0.30 0.20 second 0.10 0.05 0.30 third 0.70 0.25 0.30 other 0.05 0.40 0.20
Which horse is the most predictable? One quantitative approach to this is to consider compressibility. Record the history of each horse as a string of 200 values ββ(first, second, third, other). The total number of bits required to encode these lines of the track record can then be calculated using the Huffman algorithm. This works up to 290 bits for Aurora, 380 for Whirlwind and 420 for Phantasm (check it out!). Aurora has the shortest encoding and therefore in the strong sense is the most predictable.
How did they get 420 for Phantasm? I keep getting 400 bytes like this:
First combine, another = 0.4, combine the second, third = 0.6. Complete with 2 bits encoding each position.
Is there something I misunderstood about the Huffman coding algorithm?
The tutorial is available here: http://www.cs.berkeley.edu/~vazirani/algorithms.html (p. 156).
algorithm compression huffman-code information-theory
Dijkstra
source share