Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:thetfordt:chapter4 [2018/03/04 18:43] – thetfordt
+++ courses:cs211:winter2018:journals:thetfordt:chapter4 [2018/03/12 00:17] (current) – thetfordt
@@ Line 74: / Line 74: @@
 ===== Section 4.7 - Clustering =====
+This section addresses the issue of clustering. Given a distance function on objects, we are seeking to divide them into different groups such that objects within the same group are close together, and objects in different groups have their distance maximized. So, if we want k clusters, we are seeking a k-clustering with the maximum possible spacing between each cluster.
+The book walks through the process of designing such an algorithm that will do this but ultimately concludes that this is incredibly similar to Kruskal's algorithm. If we want k clusters on a graph G, we would just obtain the minimum spanning tree on this graph from Kruskal's algorithm, but without performing the last (k-1) edge-additions. Alternatively, the book defines a formal claim that if we have a minimum spanning tree T and delete the (k-1) most expensive edges, we have k components with maximum spacing. The proof for this was fairly straight-forward.
+I would rate the readability of this section at 9/10.
+===== Section 4.8 - Huffman Codes and Data Compression =====
+In this section, we consider another slightly different approach to a new idea of a greedy algorithm. We shrink the size of a problem such that a smaller problem of the same structure can be solved by recursion. An excellent example of this is the problem of encoding symbols using bits. (Note that this is also an area of data compression as this is a core issue in that field.) The problem is that we want to figure out the smallest number of average bits per symbol (or letter, in our case) such that we optimize the data compression, we can represent those symbols using as little space as possible.
+The text dives into the first true widespread usage of encoding, which was through morse code. But Morse code has one critical problem which we must address in data compression. It requires a sort-of 'space' between each letter. But we need to be able to represent symbols as a continuous string of bits such that we are only working with two bits - 1's and 0's.
+This introduces the idea of prefix codes. This is exactly what it sounds like. In using prefix codes, we map a set of symbols to bits such that no symbol is a prefix of any other. In doing that, we can represent any string of symbols using only 0's and 1's.
+The text walks through how prefix codes can be represented using binary trees. We can do this by having leaf nodes represent letters and left tree branches to represent zeros and right tree branches represent ones. In doing this, this actually establishes a prefix code for this set of letters. The book walks through a straightforward proof of this. I should note that this binary tree is also full.
+The book then walks through a few different approaches to the construction of the tree, because that is at the core of the problem. It first walks through the top-down approach. This approach takes the set of letters, splits it into two halves, and then forms the tree by using these two halves. However, the book note that there is no such version of the top-down solution such that it always produces an optimal prefix code.
+This leads us to the bottom-up approach, AKA Huffman codes. Huffman developed and proved the optimality of his algorithm by constructing the prefix tree from the bottom up. Suppose we are given an alphabet S with given frequencies. If S has two letters then we encode one letter using 0 and the other letter using 1 (base case). Otherwise, we let y* and z* be the two lowest-frequency letters. We form a new alphabet S' by deleting y* and z* and replace them with a new letter w of frequency f(y*)+f(z*). We then recursively construct a prefix code m' for S', with tree T' and define a prefix code for S as follows: Start with T', take the leaf labeled w and add two children below it labeled y* and z*. We can see that this algorithm addresses exactly the kind of recursive problem the chapter began by describing. The book follows the logic of the algorithm by proving the optimality of it. It does this by proving the Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code.
+The book then walks through implementation, first stating that it appears this would run in O(k^2) time. However, if we use an implementation of priority queues via heaps, we can actually reduce the running time down to O(k log k).
+In the section mentioning some extensions of Huffman codes, the idea of arithmetic encoding really jumped out to me. I would think that a series of black and white pixels would have to be represented by 0's and 1's for black and white. However, I did not think of the fact that you could represent certain common patterns of whites and blacks using prefix codes.
+I would rate the readability of this section at 7/10. Although very interesting, it was hard to stay focused over such a long section.