Summary & Motivations: Section 4.8 is Huffman Codes and Data Compression. We explore encoding symbols of an alphabet using bits. We're converting an alphabet that's readable to humans into sequences of 0's and 1's that are readable to computers. A simple approach would be a fixed-size encoding. Let's say the alphabet we want to encode is 26 characters (a through z, presumably), the space, the comma, the period, the question mark, the exclamation point, and the apostrophe. Since we can form 2^b different symbols of b bits, we can use 5 bits per symbol for 2^5 = 32 symbols. The
ASCII code does this, just with a larger alphabet, so there are more bits per symbol. A better approach would be to determine the length of each symbol's encoding based on the symbol's frequency in the language, and hope to use fewer than an average of b bits per symbol (from 2^b = number of symbols). Even low-tech Morse code does this! In Morse code, higher frequency letters use fewer dots/dashes. Morse code, however, actually uses a three-letter alphabet because pauses are required between symbols to avoid ambiguity. Ambiguity would arise because some symbols have encodings that are prefixes of other symbols' encodings. We can solve this problem by disallowing any individual encodings to be the prefix of any other individual encodings. Under this regime, the encodings are called “prefix codes.” Furthermore, if we have a frequency for each letter x, equal to “the fraction of letters in the text that are equal to x” (p 165), we can determine the average bits per letter of an encoding as the sum over all symbols of each symbol's frequency times the length of its encoding. Our goal is to find an algorithm that minimizes the ABL of an alphabet, given the alphabet and a set of frequencies for the alphabet's symbols.