| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| courses:cs211:winter2018:journals:melkersonr:chapter4 [2018/03/06 04:37] – [Section 4.6] melkersonr | courses:cs211:winter2018:journals:melkersonr:chapter4 [2018/03/13 01:40] (current) – [Section 4.8] melkersonr |
|---|
| ===== Section 4.7 ===== | ===== Section 4.7 ===== |
| |
| * **Summary & motivations:** Section 4.5 is Clustering. The problem of clustering arises when we want to classify subsets of a collection into "coherent groups" (p 158). To do so, we have to have a way of quantifying how similar the entities are. That can be done with a distance function. Distances don't have to be physical distances, though. Rather, they're some measure for similarity. In the end, entities in the same cluster should be similar and entities in different clusters should be dissimilar. The flavor of clustering that we explore here is Clusterings of Maximum Spacing. Given a set of entities where we can determine the distance between any two entities, we seek to find a k-clustering (divide the entities into k clusters). The spacing of a k-clustering is "the minimum distance between any pair of points lying in different clusters" (p 158). Maximum possible spacing suggests that the individual k clusters are as far apart as possible. If there are exponentially many k-clusterings for a set of entities, is there an algorithm for finding one with max spacing? | * **Summary & motivations:** Section 4.5 is Clustering. The problem of clustering arises when we want to classify subsets of a collection into "coherent groups" (p 158). To do so, we have to have a way of quantifying how similar the entities are. That can be done with a distance function. Distances don't have to be physical distances, though. Rather, they're some measure for similarity. In the end, entities in the same cluster should be similar and entities in different clusters should be dissimilar. The flavor of clustering that we explore here is Clusterings of Maximum Spacing. Given a set of entities where we can determine the distance between any two entities, we seek to find a k-clustering (divide the entities into k clusters). The spacing of a k-clustering is "the minimum distance between any pair of points lying in different clusters" (p 158). Maximum spacing suggests that the individual k clusters are as far apart as possible. If there are exponentially many k-clusterings for a set of entities, is there an algorithm for finding one with max spacing? |
| * **About the Algorithm:** The algorithm is quite simple. We just add edges in increasing distance order until we have k clusters. This algorithm is essentially Kruskal's Algorithm stopped "before it adds its last k - 1 edges" (p 159), which is the same as taking the MST produced by Kruskal's Algorithm and deleting the k - 1 most expensive edges. The text does not provide any further elaboration on this algorithm's implementation or runtime, which I assume is because it is so similar to Kruskal's Algorithm. I believe the runtime of this algorithm is O(m log n). | * **About the Algorithm:** The algorithm is quite simple. We just add edges in increasing distance order until we have k clusters. This algorithm is essentially Kruskal's Algorithm stopped "before it adds its last k - 1 edges" (p 159), which is the same as taking the MST produced by Kruskal's Algorithm and deleting the k - 1 most expensive edges. The text does not provide any further elaboration on this algorithm's implementation or runtime, which I assume is because it is so similar to Kruskal's Algorithm. I believe the runtime of this algorithm is O(m log n). |
| * **My Questions:** Am I correct that the runtime is O(m log n)? | * **My Questions:** Am I correct that the runtime is O(m log n)? |
| * **Readability:** I was ready to give this section a 9, but then I got to the proof for **4.26**. I didn't process that fully the first time around, so I'm giving this section a 7 instead. | * **Readability:** I was ready to give this section a 9, but then I got to the proof for **4.26**. I didn't process that fully the first time around, so I'm giving this section a 7 instead. |
| |
| | ===== Section 4.8 ===== |
| | |
| | * **Summary & Motivations:** Section 4.8 is Huffman Codes and Data Compression. We explore encoding symbols of an alphabet using bits. We're converting an alphabet that's readable to humans into sequences of 0's and 1's that are readable to computers. A simple approach would be a fixed-size encoding. Let's say the alphabet we want to encode is 26 characters (a through z, presumably), the space, the comma, the period, the question mark, the exclamation point, and the apostrophe. Since we can form 2^b different symbols of b bits, we can use 5 bits per symbol for 2^5 = 32 symbols. The ASCII code does this, just with a larger alphabet, so there are more bits per symbol. A better approach would be to determine the length of each symbol's encoding based on the symbol's frequency in the language, and hope to use fewer than an average of b bits per symbol (from 2^b = number of symbols). Even low-tech Morse code does this! In Morse code, higher frequency letters use fewer dots/dashes. Morse code, however, actually uses a three-letter alphabet because pauses are required between symbols to avoid ambiguity. Ambiguity would arise because some symbols have encodings that are prefixes of other symbols' encodings. We can solve this problem by disallowing any individual encodings to be the prefix of any other individual encodings. Under this regime, the encodings are called "prefix codes." Furthermore, if we have a frequency for each letter x, equal to "the fraction of letters in the text that are equal to x" (p 165), we can determine the average bits per letter of an encoding as the sum over all symbols of each symbol's frequency times the length of its encoding. Our goal is to find an algorithm that minimizes the ABL of an alphabet, given the alphabet and a set of frequencies for the alphabet's symbols. |
| | * **About the Algorithm:** We employ a binary tree to represent the prefix codes, since that's easier to visualize than a list of function mappings. If we only allow the leaves of this binary tree to represent the symbols in the alphabet, the codes are naturally prefix codes: there is not more than one way to arrive at a single leaf in a binary tree. Furthermore, we say that going left equates to a 0 and going right equates to a 1. The binary tree that gives an optimal prefix code is full, because if there were a node with only one child, you could simply replace that node with its one child, thus shortening the code. We use the knowledge that leaves of minimum depth in the tree should correspond to letters of maximum frequency to then conclude that leaves of maximum depth should correspond to letters of minimum frequency. From there, we formulate an algorithm to determine the optimal prefix code. We simply pick the two lowest-frequency letters y* and z* in the alphabet S and combine them into a "meta-letter" with frequency f_{y*} + f_{z*}. Then we "recursively find a prefix code for the smaller alphabet" (p 172), where the alphabet is now one letter smaller. When we get to the point where the alphabet only has two letters, we arbitrarily assign one 0 and the other 1. We can form the prefix code by "unfold[ing] the result back through the recursive calls" (p 173). If we maintain a heap-based priority queue using each letter's frequency as its key, this algorithm is O(n log n), where n is the number of letters in the alphabet. |
| | * **My Questions:** I find the term "prefix code" confusing. To me, it sounds like a "prefix code" //would// be the prefix of another code, rather than it strictly not being the prefix of another code. |
| | * **Second Time Around:** Something that actually made less sense the second time around: the algorithm itself. I didn't really understand what it meant by separating the "Define a prefix code for S as follows" block from the rest. |
| | * **Note to Self:** Approaching a problem top-down vs bottom-up can affect optimality, as in the case of Shannon-Fano codes vs Huffman codes. |
| | * **Readability:** 8 - I almost gave this section a 9 or 10, but the algorithm made less sense in the reading than it did in class. |