Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/06 00:26] – [Section 4.7: Clustering] boyese
+++ courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/12 18:25] (current) – [Section 4.8: Huffman Codes and Data Compression] boyese
@@ Line 263: / Line 263: @@
 ====Section 4.8: Huffman Codes and Data Compression====
+===Huffman Codes===
+**Problem: Encoding**
+  * Computers use bits: 0s and 1s
+  * Need to represent what humans know to what computers know
+  * Map symbol → unique sequence of 0s and 1s
+  * Process is called encoding
+  * The least number of bits needed to encode 26 letters, space, and 5 punctuation marks (, . ? ! ‘) (32 characters = 25), so 5 bits
+  * For a longer string of characters, we can use shorter encodings for frequently used characters like ETAOINSHRDLU
+      * Goal: optimal encoding that takes advantage of non-uniformity of letter frequencies
+      * Looking to represent data as compactly as possible
+      * Example: Morse code, which is an example of variable-length encoding
+      * We run into a problem when doing this: ambiguity caused by the encoding of one character being a prefix of encoding another
+**Optimal Prefix Codes**
+  * Solution to ambiguity: Prefix codes, which are a map of letters to bit strings such that no encoding is a prefix of any other
+      * Goal: minimize the average number of bits per letter (ABL):
+      * Σ<sub>x∈S</sub> : frequency of x * length of encoding x
+      * fx : frequency that the letter x occurs
+      * |ϒ(x)|: length of encoding of x
+      * Minimize ABL =  Σ<sub>x∈S</sub> f<sub>x</sub>|ϒ(x)|
+  * Binary tree to represent prefix codes
+      * Exposes structure netter than list of mappings
+          * Each leaf node is a letter
+          * Follow path to the letter
+              * Going left: 0, Going right: 1
+  * Recursively generate tree
+      * All letters are in root node
+      * For all letters in node:
+          * If encoding begins with 0, letter belongs in left subtree
+          * Otherwise (encoding begins with 1), letter belongs in right subtree
+          * If last bit of encoding, make the letter a leaf node of that subtree
+          * Shift encoding one bit
+          * Process left and right children
+  * Tree Properties
+      * The length of the path from root to leaf is the depth
+      * The binary tree T corresponding to the optimal prefix code is full, i.e., each internal node has two children
+  * Conclusions
+      * The binary tree corresponding to the optimal prefix code is full, i.e., each internal node has two children
+      * We want to label the leaf nodes of the binary tree corresponding to the optimal prefix code such that nodes with greatest depth have least frequency
+      * The two letters with least frequency are definitely going to be siblings
+**Huffman’s Algorithm**
+  * Data structures needed:
+      * Binary tree for the prefix codes
+      * Priority queue for choosing the node with lowest frequency
+  * Costs
+      * Inserting and extracting node into PQ: O(log n)
+      * Number of insertions and extractions: O(n)
+      * O(n log n)
+<code>
+To construct a prefix code for an alphabet S, with given frequencies:
+    If S has two letters then
+        Encode one letter using 0 and the other letter using 1
+    Else
+        Let y* and z* be the two lowest-frequency letters
+        Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter ω of frequency f<sub>y*</sub> + f<sub>z*</sub>
+        Recursively construct a prefix code γ’ for S’, with tree T’
+        Define a prefix code for S as follows:
+            Start with T’
+        Take the leaf labeled ω and add two children below it labeled y* and z*
+    Endif
+</code>
 ====Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm====