Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
courses:cs211:winter2018:journals:patelk:chapter4 [2018/03/10 15:34] – [Personal Thoughts] patelkcourses:cs211:winter2018:journals:patelk:chapter4 [2018/03/10 20:29] (current) – [4.8 Huffman Codes and Data COmpression] patelk
Line 345: Line 345:
 Readability: 7 Interesting: 7 Readability: 7 Interesting: 7
  
-===== 4.7 Clustering =====+===== 4.8 Huffman Codes and Data Compression ===== 
 + 
 +  * Greedy rule to shrink the size of the problem instance so that an equivalent smaller problem can then be solved by recursion 
 +  * Data Compression: area that forms part of the foundations for digital communication.
  
 **The Problem** **The Problem**
 +
 +  * __Encoding Symbols Using Bits__
 +    * Use a fixed number of bits to represent each symbol in the alphabet
 +    * Concatenate bit strings to form text
 +    * Tremendous waste to translate all letters into the same number of bits
 +      * use a small number of bits for frequent letters and larger number of bits for less frequent ones
 +    * Need an optimal way to take advantage of the nonuniform frequencies of the letters.
 +  * __Variable-Length Encoding Schemes__
 +    * Morse Code: translating each letters into a sequence of dots and dashes (0s and 1s, for comparison) -> maps more frequent letters to shorter bit strings
 +      * Dealing with Ambiguity: short pauses between letters
 +  * __Prefix Codes__
 +    * Ambiguity: bit string that encodes one letter is a prefix of the bit string that encodes another
 +    * Prefix Code: for set S of letters is a function γ that maps each letter x in S to some sequence of zeros and ones, in such a way that for distinct x,y in S, the sequence γ(x) is not a prefix of the sequence γ(y).
 +    * When scanning text, as soon as you've seen enough bits to match the encoding of some letter, output this as the first letter of the text.
 +  * __Optimal Prefix Codes__
 +    * For each letter x in S, there is a frequency fx, representing the fraction of letters in the text that are equal to x. Frequencies sum to 1.
 +      * Total length of encoding = sum, over all letters x in S, of the number of times x occurs times the length of the bit string γ(x) used to encode x. 
 +        * Average number of bits required per letter = ABL(γ0
 +
 +**Designing the Algorithm**
 +  * __Representing Prefix Codes Using Binary Trees__
 +    * Rooted binary tree T: number of leaves is equal to the size of the alphabet S; each leaf has a distinct letter in S
 +    * For each letter x in S, we follow the path from the root to the leaf labeled x; each time the path goes from a node to its left child, we write down a 0, and each time the path goes from a node to its right child, we write down a 1.
 +    * //The encoding of S constructed from T is a prefix code.//
 +      * For the encoding of x to be a prefix of the encoding of y: path from root to x would have to be a prefix of the path from root to y. Same as saying that x would lie on the path from the root to y, which isn't possible if x is a leaf.
 +    * Given a prefix code y, we can build a binary tree recursively
 +      * Start with root, all letters x in S whose encodings begin with a 0 = leaves in the left subtree of the root; opposite for encodings beginning with a 1; use recursion
 +    * A binary tree is full if each node that is not a leaf has two children: there are no nodes with exactly one child.
 +    * //Binary tree corresponding to the optimal prefix code is full//
 +      * Proof using an exchange argument
 +      * T = binary tree corresponding to the optimal prefix code
 +      * Suppose it contains a node u with exactly one child v
 +      * Convert T into T' by replacing node u with v.
 +        * if u = root, delete u and use v as root
 +        * otherwise, let w be the parent of u in T, delete u, and make v be a child of w in place of u
 +  * __A First Attempt: The Top-Down Approach__
 +    * Goal: produce a labeled binary tree in which the leaves are as close to the root as possible
 +    * Pack leaves as tightly as possible: want a "nearly balanced" split of the alphabet
 +      * Shannon-Fano Codes: no version of this top-down splitting strategy is guaranteed to always produce an optimal prefix code
 +  * //What If We Knew The Tree Structure of the Optimal Prefix Code?//
 +    * Assume we have the binary tree T* corresponding to an optimal prefix code - need to figure out the labeling
 +    * Suppose that u and v are leaves of T*, such that depth(u) < depth(v). Further suppose that in a labeling of T* corresponding to an optimal prefix code, leaf u is labeled with y in S and leaf v is labeled with z in S. Then fy >= fx.
 +      * Proof using an exchange argument
 +      * if fy<fz, then effect of the exchange is as follows: the multiplier on fy increases [from depth(u) to depth(v)] and the multiple on fz decreases by the same amount [from depth(v) to depth(u)].
 +        * change to the overall sum is a negative number, contradicting the supposed optimality of the prefix code.
 +      * Having a lower-frequency letter at a strictly smaller depth than some other higher-frequency letter rules out for an optimal solution
 +      * Take all leaves of depth 1, label them with the highest-frequency letters, do the same with each level of increasing depth -> leads to an optimal T*
 +    * w is a leaf of T*
 +      * If w were not a leaf, there would be some leaf w' in the subtree below it. But then w' would have a depth greater than that of v, contradicting our assumption that v is a leaf of maximum depth in T*.
 +        * There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are sibling in T*.
 +  * //An Algorithm to Construct an Optimal Prefix Code//
 +    * Suppose y* and z* are the two lowest-frequency letters in S.
 +      * can lock them together because we know they end up as sibling leaves below a common parent
 +      * Replace y* and z* with meta-letter, obtaining an alphabet that is one letter smaller. Recursively find a prefix code for the smaller alphabet, and then "open up" the meta-letter back into y* and z* to obtain a prefix code for S:
 +//Huffman's Algorithm//
 +
 +{{:courses:cs211:winter2018:journals:patelk:optimalprefixcodesiblings.png?nolink&400|}}
 +
 +{{:courses:cs211:winter2018:journals:patelk:optimalprefixcodesiblings2.png?nolink&400|}}
 +
 +  * bottom-up approach: focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion.
 +
 +**Analyzing the Algorithm**
 +  * //The Optimality of the Algorithm//
 +    * Recursive algorithm: proof by induction
 +    * Base: optimal for all two letter alphabets (one bit per letter)
 +    * By induction: optimal for all alphabets of size k-1
 +    * Consider: alphabet S of size k
 +    * The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code
 +  * //Implementation and Running Time//
 +    * Can be polynomial time in k, the number of letters in the alphabet
 +      * recursive calls: sequence of k-1 iterations
 +      * each iteration consists simply of identifying the two lowest-frequency letters -> O(k)
 +      * combined: O(k^2)
 +    * Use a priority queue to maintain a set of k elements
 +      * allows for insertion and extraction 
 +      * Each iteration: extract the minimum twice; insert a new letter whose key is the sum of these two minimum frequencies
 +      * Using priority queues via heaps: each insertion and extraction: O(logk) -> over k iterations, O(klogk)
 +
 +**Extensions**
 +  * Data Compression: black and white images already use one bit per letter
 +    * need to be highly compressed: use a "fraction of a bit" for each white bixel
 +  * Cannot adapt to changes in text: frequency is stable for half the text, but then changes radically
 +    * Halfway into the sequence insert some kind of instruction to completely change the encoding
 +    * This will pay off in the long run: called adaptive compression schemes
 +
 +==== Personal Thoughts ==== 
 +
 +While arguably one of the more interesting topics of this textbook, this section was long and very dense. There was a lot of information that was hard to digest all at once. Our discussion in class was helpful, but not long enough to help make this section very clear. I will most likely need to review this chapter section again.
 +Readability: 5 Interesting: 8
  
courses/cs211/winter2018/journals/patelk/chapter4.1520696070.txt.gz · Last modified: by patelk
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0