This is an old revision of the document!

Section 4.8 Huffman Codes and Data Compression

This section begins by discussing different ways which one may encode data. One example discussed is the scheme using 1's and 0's which allows one to encode 2^b symbols using b bits. Another scheme is one such as Morse Code, which uses different lengths of strings to encode different characters, based on frequency. Letters that occur more frequently, such as e, t, and a, use one- or two-bit strings while less frequent letters use four or five bit strings. This type of encoding is advantageous because the more frequent letters are quicker to encode, which saves time in the long run.

The problem with such an encoding is that it is sometimes ambiguous. Since e is 0, t is 1, and 01 is a, a string 01 may either be et or a. The solution to this problem is adding a slight pause in between letters, but if the pause isn't long enough it may be missed and additionally the pause becomes its own type of bit.

Such an ambiguity can be overcome by a construction called a prefix code. A prefix code is one in which the encoding for one letter in the alphabet can not be a prefix for another letter. In the Morse code example, e's encoding - 0, was a prefix for the encoding of a - 01. An example of a prefix code for the alphabet {a, b, c} could be a = 11, b = 01, c = 00. Using variable length encoding, it is possible to make an optimal prefix code by setting the most used letters to the shortest characters