Shrink to entropy

Huffman coding is an encoding algorithm used to compress data without losing information. The maximum compression level is bounded by the information entropy, a measure of the information contained in the data.

Example (information compression): if you get a page full of A you can compress the page content to something like “A page full of A” or if the number of A is, say, 500, you can use “500 A“.

A common case is a file: your thesis, a never completed astonishing sci-fi novel, a photo of a dog and so on. In that case you can analyze the content and model a random variable upon it.

Let’s do another example: you get a string composed by A, B and C. If the string is composed by 50% A, 30% B and 20% C, the entropy (uncertainty) is 1.4854 bits. If you compress it, the resulting string can’t be shorter than 1.4854 times the original string length.

Huffman coding is based on two key concepts: if a character occurs more than the others (A in the example) must be coded as shortest encoding word, and each code word is not prefix of another.

Finding the Huffman coding is easy if you know the probability distribution of the characters:

  1. sort the characters by %
  2. pick the two characters with lowest %
  3. append 0 and 1 (respectively) in front of its coding word
  4. group the two characters and sum the %
  5. repeat

Example:

  1. sorting: A (50%), B (30%), C (20%)
  2. B and C have the lowest %
  3. B = 1, C = 0
  4. group B and C, sum %: BC (50%)
  5. sorting: BC (50%), A (50%)
  6. A = 0, BC = 1 -> A = 0, B = 11, C = 10

The binary Huffman coding may be (you can swap 0 and 1)

  • A = 0
  • B = 11
  • C = 10

Be AAACBCABBAABCABABAAC the string to encode (it’s too short, but this is a short example).

The string becomes 000101110011110011100110110010, 30 bit in length. Instead of Huffman encoding, if we use the usual encoding of 2 bits for each character (say A=00, B=01, C=10), the string becomes 40 bit long (it’s composed by 20 characters, 2 bits each).

Huffman coding can be n-ary instead of binary. If you want to experiment with Huffman coding, you can use a Java applet created by me (I’m still working on the GUI, don’t be scared): Java 1.5 version (almost everyone), Java 1.4 version (for Mac users) Java Web Start format.

References:

  • Claude E. Shannon, A Mathematical Theory of Communication, 1948
  • Daniel A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, 1952

Further reading:

  • Robert Ash, Information Theory
  • David MacKay, Information Theory, Inference and Learning Algorithms, (also available online)

Italian readers may also look at my Information theory course notes (in PDF, many things are missing).

(This is my first “Guy Montag” post: if you find it obscure, poor written or anything else, please let me know.)

Edit May 2012: modified link to the app.

Creative Commons License
Content is released under Creative Commons License.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s