Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:nasona:chapter4 [2018/03/11 15:20] – [4.7 Clustering] nasona
+++ courses:cs211:winter2018:journals:nasona:chapter4 [2018/03/11 22:01] (current) – [4.8 Huffman Codes] nasona
@@ Line 301: / Line 301: @@
 ======= 4.8 Huffman Codes =======
+==Summary==
+We want to map encodings to letters so that no encoding is the prefix of another in order to avoid ambiguity. A prefix code that minimizes the average bits per letter is optimal. In other words, more frequently used letters should have shorter encodings in order to compress the data. A brute force algorithm is infeasible because the search space is so large and so variable. We will use a full binary tree to represent prefix codes all letters encoded will be leaf nodes to avoid any encoding being the prefix of another. Meta-letters will be interior nodes (parents to letters). The greedy aspect of Huffman's algorithm is that you merge the two lowest frequency letters in the algorithm. The bottom-up approach is the optimal approach. The runtime of the algorithm is O(klogk).
+==The Problem==
+  * A greedy rule is used to shrink the size of the problem instance so that an equivalent smaller problem can then be solved by recursion
+  * The global consequences of the initial greedy decision do not become fully apparent until the full recursion is complete
+  * Data compression, an area that forms part of the foundation for digital communication
+  * The mapping of bit strings to symbols is arbitrary
+  * The letters in most human alphabets do not get used equally frequently
+  * Issue of reducing the average number of bits per letter is a fundamental problem in the area of data compression
+  * How might we construct the optimal way to take advantage of the non-uniform frequencies of the letters?
+  * Appealing to the problem of compressing data: squeezes all the available gains out of non-uniformities in the frequencies
+  * Variable Length Encoding Schemes
+    * Morse code uses such short strings for the letters that the encoding of words becomes ambiguous
+    * To deal with this ambiguity, Morse code transmissions involve short pauses between letters
+  * Prefix Codes
+    * The ambiguity in Morse code arises because there exists pairs of letters where the bit string that encodes one letter is a prefix of the bit string that encodes another
+    * It is enough to map the letters to bit strings in such a way that no encoding is a prefix of any other
+    * A prefix code for a set S of letters is a function γ that maps each letter x in set S to some sequence of zeroes and ones, in such a way that for distinct x, y in set S, the sequence γ (x) is not a prefix of the sequence γ (y).
+    * If we hand this message to recipient who knows the function y, they will be able to reconstruct the text according to the following rule
+      * Scan the bit sequence from left to right
+      * As soon as you’ve seen enough bits to match the encoding of some letter, output this as the first letter of the text (must be the correct first letter because no shorter or longer prefix of the bit sequence could encode any other letter)
+      * Delete the corresponding set of bits from the front of the message and iterate
+  * Optimal Prefix Codes
+    * More frequent letters can have shorter encodings
+    * Notation to express the frequencies of letters; below is the average number of bits required per letter
+      * ALB(γ ) = Σx∈Sfx |γ(x)|
+    * A prefix code that minimizes the average number of bits per letter [ALB(γ )]  is optimal
+==Designing the Algorithm==
+  * Search space includes all possible ways of mapping letters to bit strings, subject to defining property of prefix codes
+    * Brute force is infeasible
+  * To construct an optimal prefix code: develop a tree-based means of representing prefix codes
+  * In using a binary tree, follow the path from the root to the leaf labeled, go from a node to its left child = 0, go from a node to its right child = 1, resulting string of bits is encoding of x
+  * The encoding of S constructed from T is a prefix code
+  * Start with a root; all letters x in set S whose encodings begin with 0 will be leaves of the left subtree of the root and all letters y in set S whose encodings begin with a l will be leaves in the right subtree, build two subtrees recursively using this rule
+  * Search for optimal prefix code can be viewed as the search for a binary tree T, together with the labeling of the leaves of T, that minimizes the average number of bits per letter
+  * The length of the encoding of a letter x in set S is simply the length of the path from the root to the leaf labeled x
+  * Seeking the labeled tree that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves: Σx∈Sfx *depthT(x)
+  * A binary tree is full if each node that is not a leaf has two children
+  * The binary tree corresponding to the optimal prefix code is full
+  * A natural way to do this would be to try building a tree from the top down by packing the leaves as tightly as possible: this is suboptimal
+  * Suppose that u and v are leaves of T*, such that depth(u) < depth(v). Suppose that in a labeling pf T*, corresponding to an optimal prefix code, leaf u is labeled with y in set S and leaf v is labeled with z in set S. Then fy >= fz.
+  * Go through leaves in order of increasing depth, assigning letters in order of decreasing frequency
+  * Among the labels we assign a block of leaves all at the same depth, it doesn’t matter which label we assign to which leaf
+  * Consider a lead v in T* whose depth is as large as possible. Leaf v has a parent w and T* is a full binary tree, so u has another child w. We refer to v and w as siblings since the since they have a common parent.
+    * w is a leaf of T*
+  * There is an optimal prefix code with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*.
+  * y* and z* are the two lowest frequency letters in S
+  * This common parent acts like a “meta-letter” whose frequency is the sum of frequencies of y* and z*.
+==Huffman’s Algorithm==
+     To construct a prefix code for an alphabet S, with given frequencies:
+     If S has two letters:
+          Encode one letter using 0 and the other using 1
+     Else:
+          Let y* and z* be the two lowest-frequency letters
+          Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter w of frequency fy*+fz*
+          Recursively construct a prefix code y’ for S’, with tree T’
+          Define a prefix code for S as follows:
+               Start with T’
+               Take the leaf labeled w and add two children below it labeled y* and z*
+      Endif
+  * The greedy rule underlying Huffman’s Algorithm – the merging of the two lowest frequency letters – fits into the structure of the algorithm as a whole
+  * Simply commit to having them be children of the same parent, and this is enough to produce a new, equivalent problem with one less letter
+  * Bottom-up approach: it focuses on the leaves representing the two lowest frequency letters, and then continues by recursion
+==Analyzing the Algorithm==
+  * The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code.
+  * Implementation and Runtime
+    * There are k-1 iterations of recursive calls
+    * Using an implementation of priority queues via heaps, we can make insertion and extraction of minimum in O(logk) time
+    * Each iteration, which performs three of these operations, takes time O(logk)
+    * Summing over all k iterations, we get a total runtime of O(klogk)
+==Extensions==
+  * The challenge here is to define an encoding schedule where the notion of using fractions of bits is well-defined
+  * Another drawback of prefix codes is that they cannot adapt to changes in the text
+==Additional Information==
+A concept that was troubling me in the beginning stages of understanding Huffman codes was the concept of the meta-letter that is the parent of leaf nodes. As a read the section after the class, it made more sense. The meta-letter is there because a letter node cannot be there. A letter encoding cannot be an interior node because that means that it is the prefix of another node, which is the point of the Huffman codes. There can be no encoding that is the prefix of another. Moreover, meta-letters are helpful and make more sense in Huffman's optimal bottom up approach anyway. The first meta-letter joins the two letters of the lowest frequency and that way its like a super letter and now the alphabet has one less letter. This explanation of the purpose of the meta-letter also makes it clear why we would implement the algorithm with a priority queue.
+On a scale of 1 to 10, the readability of this section is a 9. The book explained really well the notation and the concepts behind the Huffman Code. Also, the main part of the code involved binary trees, which we've already seen multiple times in this class, so it was really building off the knowledge we already knew.