Table of Contents

Entry Six

Chapter 4.7

Clustering

How do minimum spanning trees play a role in clustering???

Clustering is when you organize a collection of objects into coherent groups. How might we organize these groups? By how similar they are. We use a distance function on the objects with the assumption that those closer together are more similar to each other.

The Problem: Given a set of objects and a distance function on the objects, divide the objects into groups so that objects in the same group are close and ones in different groups are far apart.

Clusterings of Maximum Spacing- Given a set of objects each pair has a numeric distance and that distance is greater than 0 for distinct objects and the distances are symmetric. Given the number of groups k, we want to divide the set of objects into k groups. A k-clustering is a partition of the set into k non-empty sets. The spacing of a clustering is the minimum distance between nay pair of points in different clusters. Our goals is to maximize the minimum spacing. We want points in different clusters to be as far apart as possible from each other.

The Algorithm:

Grow a graph on the vertex set U edge by edge with connected components corresponding to clusters.

The graph is a union of trees. The algorithm is just like Kruskal's but instead of continuing until we have one connected component, we stop at k connected components. We stop before it adds k-1 edges. We are creating a minimum spanning tree and deleting the longest edges.

(4.26) The components C1, C2,…,Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum spacing.

Proof:

Let C be the clustering we produce from our algorithm. The spacing of our solution is the length of the k-1th most expensive edge in the MST. Consider another solution C' we will show that the spacing of C' is at most the spacing of C. Since the clusterings are not the same there are points that belong to different clusters in C'. Because they both belong in a different cluster in C' it means that each edge has a length of at most the spacing of solution C. ~*This proof in the book is really wordy and confusing*~

This section was a really short and easy read, other than the proof I thought it was decent. Readability 9/10.

Chapter 4.8

Huffman Codes and Data Compression

In this problem we use greedy rules to “shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.”

The Problem:

Encoding Symbols Using Bits

Variable-Length Encoding Schemes

Prefix Codes

Optimal Prefix Codes

The Algorithm:

We use a greedy method to construct an optimal prefix code by developing tree representation of the codes.

Representing Prefix Codes Using Binary Trees

(4.27) The encoding of S constructed from T is a prefix code

Proof: The letters can only be leaves, so an encoding of x cannot be a prefix to the encoding of y because x is not a node on the path to y.

We can also build a tree from a prefix code recursively!! So the search for a binary tree is the search for a prefix code. We want the tree that minimizes the weighted average (frequencies of the letters) of the depths of all leaves.

A binary tree is full if each node that is not a leaf has two children.

(4.28) The binary tree corresponding to the optimal prefix code is full. (p.168)

We prove this using an exchange argument.

The Top-Down Approach

If we knew the binary tree corresponding to an optimal prefix code but didn't know which letters the leaves were we can complete the tree with an easy fact. (4.31) There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*.

Lock the two lowest frequencies together into a metaletter…recursive strategy:

Huffman's Algorithm:

  To construct a prefix code fo an alphabet S, with given frequencies:
    If S has two letters then
      Encode one letter using 0 and the other letter using 1
    Else
      Let y* and z* be the two lowest-frequency letters
      Form a new alphabet S' by deleting y* and z* and replacing them with a new letter w of frequency f(y)+f(z)
      Recursively construct a prefix code A' for S', with tree T'
      Define a prefix code for S as follows:
        Start with T'
        Take the leaf labeled w and add two children below it labeled y* and z*
    Endif
    

The prefix code this algorithm produces is called Huffman code. This is a bottom-up approach by focusing on leaves with two lowest frequencies and moving up to leaves with highest frequencies.

Analysis:

We prove the optimality of Huffman's by induction. The proof is really long and complicated but follows to prove by contradiction that the tree Huffman's produces is optimal. The total runtime of Huffman's algorithm is O(k log k). If we use a PQ by a heap we can inset and extract the minimum in O(log k) time which we do a total of k times.

This section was realllllly long to read which made it more boring, but the presentation of it in class was much more interesting. The proof was definitely a part of the chapter that I will have to come back to to understand. Four pages of a lot of variables was intimidating. I would rate this readability: 7/10

Chapter 5.1

A First Recurrence: The Mergesort Algorithm

In this section we are beginning divide-and-conquer algorithms!!

Mergesort

Let T(n) be the worst-case running time of the algorithm on a list of size n. To divide the list into two even parts (n/2) takes O(n) time. For each division it takes T(n/2) to solve the sorting. It then spends O(n) to combine the lists back together. SO….

(5.1) For some constant c, T(n) =< 2T(n/2) + cn when n > 2 and T(2) =< c.

T(n) is bounded by an inequality. There is a base case that says T(n) is equal to a constant when n is a constant. We have to solve the recurrence so that T is only on the LHS of the inequality.

How to solve recurrences:

Unrolling the Mergesort Recurrence

Substituting a Solution into the Mergesort Recurrence

I would rate this chapter a readability of 8/10. It was fairly short and easy to follow but if I hadn't read this chapter after it was presented in class I would have been a little lost.