Entry , Ch. 4.7-8, 5.1
Section 4.7 explains the concept of clustering, a situation in which a collection of objects is organized into coherent groups. For our purposes, we generally use a distance function on the objects, and say that those with a larger distance are less similar to each other, and vice versa. The book defines the concept of maximum spacing by giving an example set U of n objects. If we divide U into k groups, we say that a k-clustering of U is a partition of U into k nonempty sets. Our goal is to efficiently find the maximum spacing from exponentially different k-clusterings of U. To do this, the book proposes growing a graph on the vertex set U, with the connected components as clusters. First we draw an edge between the closest pair of points, then another edge between the next closest set of points, and so on in order of increasing distance. If we encounter an edge already in our cluster, we don’t add it so that we avoid creating a cycle. The book calls this method single-link clustering. It is similar to Kruskal’s minimum spanning tree algorithm, except in this case we stop running the algorithm when we have k connected components. The proof for this algorithm is explained on p. 160.
Section 4.8 discusses data compression, using the example of converting strings of text into bits where we run into the problem of reducing the average number of bits per letter. The optimal solution for solving this problem involves finding all the available gains of nonuniformities in the frequencies in which letters and letter combinations occur (remember Morse code example from class). We run into another problem with prefix codes, which for a set S of letters is a function y that maps each letter x ∈ S to some sequence of zeroes and ones, in such a way that for distinct x, y ∈ S, the sequence y(x) is not a prefix of the sequence y(y) (see the rule on p. 164 for reconstructing encoded text). Page 165 explains the notation for expressing the frequencies of letters. The greedy algorithm that constructs optimal prefix codes uses a tree whose number of leaves is equal to the size of alphabet S, and each is labeled with a letter from S. On p. 166-7 the book proves that using this tree, the encoding of S constructed from T is a prefix code and defines the search for an optimal prefix code as a search for a binary tree, together with a labeling of leaves of that tree, that minimizes the average number of bits per letter. We seek a labeled tree that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves. The proof on p. 168-9 shows that the binary tree corresponding to the optimal prefix code is a full tree. Page 172 outlines the algorithm to construct an optimal prefix code, called Huffman’s Algorithm. On page 174 the book proves that the Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code. Huffman has a running time of O(k log k) (the recursive calls of the algorithm define a sequence of k-1 iterations over smaller and smaller alphabets, and each iteration except the last identifies the two lowest-frequency letters and merges them into a single letter that has the combined frequency).
Section 5.1 looks at the Mergesort Algorithm, which is a type of divide and conquer problem. Mergesort sorts a given list by dividing the list in two equal halves, sorting each half recursively, then combining the results. Page 210 explains how we get T9n) running time for the recurrence relation in the algorithm. We can approach a recurrence problem by either “unrolling” the recursion, keeping track of the running time for the first few levels until we see a pattern, or we can guess what the pattern is and check if it works. The book suggests unrolling the mergesort recurrence by analyzing the first few levels, as we did in class, identifying a patter, then summing over all levels of recursion (p. 212-3).
I give these sections a 9 for readability. I feel like the tricky parts to read were the examples we stepped through in class, so I feel like I kept up with the material. The only tricky part for me was analyzing runtimes for recursive problems—I’m not totally confident I could break down a problem and come up with the total runtime on my own.