Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:melkersonr:chapter4 [2018/03/10 00:10] – melkersonr
+++ courses:cs211:winter2018:journals:melkersonr:chapter4 [2018/03/13 01:40] (current) – [Section 4.8] melkersonr
@@ Line 73: / Line 73: @@
 ===== Section 4.8 =====
-   * **Summary & motivations:** Section 4.8 is Huffman Codes and Data Compression.
+   * **Summary & Motivations:** Section 4.8 is Huffman Codes and Data Compression. We explore encoding symbols of an alphabet using bits. We're converting an alphabet that's readable to humans into sequences of 0's and 1's that are readable to computers. A simple approach would be a fixed-size encoding. Let's say the alphabet we want to encode is 26 characters (a through z, presumably), the space, the comma, the period, the question mark, the exclamation point, and the apostrophe. Since we can form 2^b different symbols of b bits, we can use 5 bits per symbol for 2^5 = 32 symbols. The ASCII code does this, just with a larger alphabet, so there are more bits per symbol. A better approach would be to determine the length of each symbol's encoding based on the symbol's frequency in the language, and hope to use fewer than an average of b bits per symbol (from 2^b = number of symbols). Even low-tech Morse code does this! In Morse code, higher frequency letters use fewer dots/dashes. Morse code, however, actually uses a three-letter alphabet because pauses are required between symbols to avoid ambiguity. Ambiguity would arise because some symbols have encodings that are prefixes of other symbols' encodings. We can solve this problem by disallowing any individual encodings to be the prefix of any other individual encodings. Under this regime, the encodings are called "prefix codes." Furthermore, if we have a frequency for each letter x, equal to "the fraction of letters in the text that are equal to x" (p 165), we can determine the average bits per letter of an encoding as the sum over all symbols of each symbol's frequency times the length of its encoding. Our goal is to find an algorithm that minimizes the ABL of an alphabet, given the alphabet and a set of frequencies for the alphabet's symbols.
-   * **About the Algorithm:**
+   * **About the Algorithm:** We employ a binary tree to represent the prefix codes, since that's easier to visualize than a list of function mappings. If we only allow the leaves of this binary tree to represent the symbols in the alphabet, the codes are naturally prefix codes: there is not more than one way to arrive at a single leaf in a binary tree. Furthermore, we say that going left equates to a 0 and going right equates to a 1. The binary tree that gives an optimal prefix code is full, because if there were a node with only one child, you could simply replace that node with its one child, thus shortening the code. We use the knowledge that leaves of minimum depth in the tree should correspond to letters of maximum frequency to then conclude that leaves of maximum depth should correspond to letters of minimum frequency. From there, we formulate an algorithm to determine the optimal prefix code. We simply pick the two lowest-frequency letters y* and z* in the alphabet S and combine them into a "meta-letter" with frequency f_{y*} + f_{z*}. Then we "recursively find a prefix code for the smaller alphabet" (p 172), where the alphabet is now one letter smaller. When we get to the point where the alphabet only has two letters, we arbitrarily assign one 0 and the other 1. We can form the prefix code by "unfold[ing] the result back through the recursive calls" (p 173). If we maintain a heap-based priority queue using each letter's frequency as its key, this algorithm is O(n log n), where n is the number of letters in the alphabet.
-   * **My Questions:**
+   * **My Questions:** I find the term "prefix code" confusing. To me, it sounds like a "prefix code" //would// be the prefix of another code, rather than it strictly not being the prefix of another code.
-   * **Second Time Around:**
+   * **Second Time Around:** Something that actually made less sense the second time around: the algorithm itself. I didn't really understand what it meant by separating the "Define a prefix code for S as follows" block  from the rest.
-   * **Note to Self:**
+   * **Note to Self:** Approaching a problem top-down vs bottom-up can affect optimality, as in the case of Shannon-Fano codes vs Huffman codes.
-   * **Readability:**
+   * **Readability:** 8 - I almost gave this section a 9 or 10, but the algorithm made less sense in the reading than it did in class.