Stephen's Sidebar

Chapter 4.7: Clustering

Clustering arises whenever one has a collection of objects- say, a set of photographs, documents, or microorganisms - that one is trying to classify or organize into coherent groups. The clustering problem seeks to divide objects into groups so that, intuitively, objects within the same group are “close,” and objects in different groups are “far apart.”

The spacing of a k-clustering is the minimum distance between any pair of points lying in different clusters. We seek to find the k-clustering with the maximum possible spacing.

Designing the Algorithm

Add edges in order of closest pairs, but we are only interested in the connected component of the graph. Each time we add an edge that spans two distinct components, it is as though we have merged the two corresponding clusters. This is called single-link clustering. This is exactly what Kruskal's algorithm does, but the difference is that we seek a k-clustering, so we stop the procedure once we obtain k connected components.

Analyzing the Algorithm

The components C1, C2, …, Ck formed by deleting the k-1 most expensive edges of the MST T constitute a k-clustering of maximum spacing.

Interesting section, short, and to the point. 9/10.