Section 4.7 Clustering

Clustering problems revolves around classifying objects into groups based on their distance from one another. For some collections of items, this distance can be physical but for many others, the distance is a measure of similarity for which a distance function must be defined. The idea of a clustering is that items more like each other will be in the same cluster, while items less like each other will be in disparate clusters. The trouble is finding an efficient way to partition these objects into a specified number k clusters.

It turns out that deleting the k-1 costliest edges from Kruskal's algorithm's minimum spanning tree will return a k-clustering of maximum spacing, with spacing being defined as the minimum distance between any pair of points lying in different clusters. The last k-1 edges are deleted from the tree returned by the algorithm because they are the most expensive edges and therefore their distance will be the longest in a weighted graph. We do not want these edges for a k-clustering because they would create a single connected component where k different connected components are desired.

Since a clustering is yielded by an execution of Kruskal's algorithm that simply terminates before adding the last k-1 edges, it will use the same O(m log n) time that normal Kruskal's algorithm uses.

I found this chapter surprisingly intuitive and stimulating and thus it was easily readable.