Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
courses:cs211:winter2014:journals:deirdre:chapter4 [2014/03/05 00:36] – [4.7 - Clustering] tobind | courses:cs211:winter2014:journals:deirdre:chapter4 [2014/03/05 04:55] (current) – [4.8 - Huffman Codes and Data Compression] tobind | ||
---|---|---|---|
Line 168: | Line 168: | ||
Suppose we are given a set //U// of //n// objects, labeled //p1, p2, ..., pn//. For each pair, //pi// and //pj//, we have a numerical distance //d(pi, pj)//. We require only that // | Suppose we are given a set //U// of //n// objects, labeled //p1, p2, ..., pn//. For each pair, //pi// and //pj//, we have a numerical distance //d(pi, pj)//. We require only that // | ||
- | *Designing the Algorithm** | + | **Designing the Algorithm** |
Look at that! We're doing exactly what Kruskal' | Look at that! We're doing exactly what Kruskal' | ||
Line 177: | Line 177: | ||
I give this section an 8.5. This made a lot of sense during class so I don't think I got too much more out of reading it. | I give this section an 8.5. This made a lot of sense during class so I don't think I got too much more out of reading it. | ||
====== 4.8 - Huffman Codes and Data Compression ====== | ====== 4.8 - Huffman Codes and Data Compression ====== | ||
+ | **The Problem** Since computes operate on sequences of bits, one needs encoding schemes that take text written in alphabets and converts this text into long strings of bits. If we take the simple solution and say 2^5 = 32 which is more than enough options to cover basic 26 letters and symbols of English. But this is spending an average of five bits per symbol. Since some letters are used much more frequently, if we used less bits for them, we'd have a lower average of bits. | ||
+ | |||
+ | If we think of Morse code, more frequent letters are mapped to shorter bit strings like ' | ||
+ | |||
+ | **Prefix codes** | ||
+ | The ambiguity problem in Morse code arises because there exist pairs of letters where the bit string that encodes one letter is a prefix of the bit string that encodes another. To eliminate this problem, it is necessary to map letters to bit strings in such a way that no encoding is a prefix of another encoding. Concretely, we say that a prefix code for a set //S// of letters is a function γ that maps each letter //x ∈ S// to some sequence of zeros and ones, in such a way that for distinct //x,y ∈ S//, the sequence γ(//x//) is not a prefix of the sequence γ(// | ||
+ | |||
+ | To reconstruct text that uses a prefix code, | ||
+ | * Scan the bit sequence from left to right. | ||
+ | * As soon as you have enough bits to match an encoding of a letter, output that letter. | ||
+ | * Delete the corresponding bits from the front of the message and repeat. | ||
+ | |||
+ | **Optimal Prefix Codes** | ||
+ | Suppose that for each letter //x ∈ S//, there is a frequency //fx//, representing the fraction of letters in the text that are equal to //x//. In other words, assuming there are //n// letters total, //nfx// of these letters are equal to //x//. We notice that the frequencies sum to 1. | ||
+ | |||
+ | We want a prefix code that minimizes the average number of bits per letter. -> optimal prefix code. | ||
+ | |||
+ | **Designing the algorithm** | ||
+ | We want to represent prefix codes using binary trees. For each letter //x ∈ S//, we follow the path from the root to the leaf labelled //x//, each time the path goes from a node to its left child, we write down a 0, and each time the path goes from a node to its right child, we write down a 1. The resulting string of bits is the encoding of //x//. This works the other way too. Given a prefix code γ, we can build a binary tree recursively. We start with a root. All letters whose encodings begin with a 0 will be leaves in the left subtree and all letters whose encodings begin with a 1 will be leaves in the right subtree of the root. | ||
+ | |||
+ | So the search for the optimal prefix code can be viewed as the search for a binary tree //T//, together with a labeling of the leaves of //T//, that minimizes the average number of bits per letter. The length of the encoding of a letter is simply the length of the path from the root to the leaf labeled //x//. We will refer to the length of this path as the depth of teh leaf. So, we are looking for the labeled tree that minimizes the weighted average of the depths of all leaves where the average is weighted by the frequencies of the letters that label the leaves. As a first step, we note that the binary tree corresponding to the optimal prefix code is full. | ||
+ | |||
+ | |||
+ | Suppose that u and v are leaves of T* such that depth(u) < depth(v). Further, suppose that in a labeling of T* corresponding to an optimal prefix code, leaf u is labeled with y and leaf v is labeled wiht z. Then fy >= fz. | ||
+ | | ||
+ | There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*. | ||
+ | | ||
+ | Huffman' | ||
+ | If S has two letters then | ||
+ | | ||
+ | Else | ||
+ | Let y* and z* be the two lowest-frequency letters | ||
+ | Form a new alphabet S' by deleting y* and z* and replacing them with a new letter w of frequency fy* + fz* | ||
+ | | ||
+ | | ||
+ | Take the leaf labeled w and add two children below it labeled y* and z* | ||
+ | |||
+ | The Huffman code for a given alphabet achieves the min average number of bits per letter of any prefix code. <--- optimality (proof 174-5) | ||
+ | |||
+ | By maintaining the alphabet //S// in a pq, using each letter' | ||
+ | |||
+ | I thought this section was really interesting - 9. | ||
+ |