Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/06 00:26] – [Section 4.7: Clustering] boyese | courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/12 18:25] (current) – [Section 4.8: Huffman Codes and Data Compression] boyese | ||
|---|---|---|---|
| Line 263: | Line 263: | ||
| ====Section 4.8: Huffman Codes and Data Compression==== | ====Section 4.8: Huffman Codes and Data Compression==== | ||
| + | ===Huffman Codes=== | ||
| + | **Problem: Encoding** | ||
| + | * Computers use bits: 0s and 1s | ||
| + | * Need to represent what humans know to what computers know | ||
| + | * Map symbol → unique sequence of 0s and 1s | ||
| + | * Process is called encoding | ||
| + | * The least number of bits needed to encode 26 letters, space, and 5 punctuation marks (, . ? ! ‘) (32 characters = 25), so 5 bits | ||
| + | * For a longer string of characters, we can use shorter encodings for frequently used characters like ETAOINSHRDLU | ||
| + | * Goal: optimal encoding that takes advantage of non-uniformity of letter frequencies | ||
| + | * Looking to represent data as compactly as possible | ||
| + | * Example: Morse code, which is an example of variable-length encoding | ||
| + | * We run into a problem when doing this: ambiguity caused by the encoding of one character being a prefix of encoding another | ||
| + | **Optimal Prefix Codes** | ||
| + | * Solution to ambiguity: Prefix codes, which are a map of letters to bit strings such that no encoding is a prefix of any other | ||
| + | * Goal: minimize the average number of bits per letter (ABL): | ||
| + | * Σ< | ||
| + | * fx : frequency that the letter x occurs | ||
| + | * |ϒ(x)|: length of encoding of x | ||
| + | * Minimize ABL = Σ< | ||
| + | * Binary tree to represent prefix codes | ||
| + | * Exposes structure netter than list of mappings | ||
| + | * Each leaf node is a letter | ||
| + | * Follow path to the letter | ||
| + | * Going left: 0, Going right: 1 | ||
| + | * Recursively generate tree | ||
| + | * All letters are in root node | ||
| + | * For all letters in node: | ||
| + | * If encoding begins with 0, letter belongs in left subtree | ||
| + | * Otherwise (encoding begins with 1), letter belongs in right subtree | ||
| + | * If last bit of encoding, make the letter a leaf node of that subtree | ||
| + | * Shift encoding one bit | ||
| + | * Process left and right children | ||
| + | * Tree Properties | ||
| + | * The length of the path from root to leaf is the depth | ||
| + | * The binary tree T corresponding to the optimal prefix code is full, i.e., each internal node has two children | ||
| + | * Conclusions | ||
| + | * The binary tree corresponding to the optimal prefix code is full, i.e., each internal node has two children | ||
| + | * We want to label the leaf nodes of the binary tree corresponding to the optimal prefix code such that nodes with greatest depth have least frequency | ||
| + | * The two letters with least frequency are definitely going to be siblings | ||
| + | |||
| + | **Huffman’s Algorithm** | ||
| + | * Data structures needed: | ||
| + | * Binary tree for the prefix codes | ||
| + | * Priority queue for choosing the node with lowest frequency | ||
| + | * Costs | ||
| + | * Inserting and extracting node into PQ: O(log n) | ||
| + | * Number of insertions and extractions: | ||
| + | * O(n log n) | ||
| + | |||
| + | < | ||
| + | To construct a prefix code for an alphabet S, with given frequencies: | ||
| + | If S has two letters then | ||
| + | Encode one letter using 0 and the other letter using 1 | ||
| + | Else | ||
| + | Let y* and z* be the two lowest-frequency letters | ||
| + | Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter ω of frequency f< | ||
| + | Recursively construct a prefix code γ’ for S’, with tree T’ | ||
| + | Define a prefix code for S as follows: | ||
| + | Start with T’ | ||
| + | Take the leaf labeled ω and add two children below it labeled y* and z* | ||
| + | Endif | ||
| + | </ | ||
| ====Section 4.9: Minimum-Cost Arborescences: | ====Section 4.9: Minimum-Cost Arborescences: | ||
