4.7 Clustering

The Problem

Clustering arises when one is trying to classify a collection of objects into coherent groups.The first question that comes into mind is whether there exist some measure under which each pair of objects is similar or dissimilar to another. A common approach used is to define a distance function on the objects, where objects at a larger distance from one another are less similar to each other. However, the notion of distance usually takes on a very abstract meaning in many applications.

Given a distance function on the objects, the clustering problem seeks to divide them into groups such that objects within the same group are “close”, and objects in different groups are “far apart”. So the main challenge is figuring out what a good set of groups might look like.

Clustering of Maximum Spacing

–> Let's consider a set U of n objects{p₁,p₂,…,p_n}
–> For each pair p_i and p_j, there's a numerical distance d(p_i, p_j)
–> For this distance:
–> d(p_i,p_i) = 0
–> d(p_i,p_j ≥ 0 for distinct i and j
–> Distances are symmetric: d(p_i,p_j = d(p_j,p_i

Now,let's suppose we're dividing the objects in U into k groups, for a certain parameter k.

–> A k-clustering of U is a partition of U into k nonempty sets C₁,C₂,…, C_k
–> Spacing of a k-clustering: the minimum distance between any pair of points lying in different clusters.

–> So,our goal is to find the k-clustering with the maximum possible spacing.
–> Thus, The Problem :

Since there are exponentially many different k-clusterings of a set U, how can we efficiently find the one that has maximum spacing?

Designing the algorithm
–> Again, our goal is to find a clustering of maximum spacing
–> To achieve this, we proceed by growing a graph on the vertex set U and achieve n clusters
–> Our aim is to bring points nearby together into the same cluster as fast as possible

–> We start by drawing an edge between the closest pair of points
–> We then draw an edge between the next closest pairs of points
–> We continue adding edges between pairs of points in order of increasing distance d(p_i,p_j)

–> Thus we're growing a graph H on U edge by edge which connects components corresponding to clusters.
–> Each time we add an edge between that spans two distinct components, it's like merging two corresponding clusters: This iterative merging of clusters is called single-link clustering, which basically means that we combine clusters as soon as a single link joins them together.
–> Single-link clustering is a special case of hierarchical agglomerative clustering

–> This algorithm is precisely an application of Kruskal's algorithm we saw in the the previous section –> Only difference: We stop once we obtain k connected components. So we're running Kruskal's algorithm until just before it adds its last k-1 edges.

–> So, it's basically taking the full Minimum Spanning Tree T produced by Kruskal's algorithm and deleting the k-1 most expensive edges.

Analyzing the Algorithm

–> The components C₁,C₂,…, C_k formed by deleting the k-1 most expensive edges of the Minimum Spanning Tree T constitute a k-clustering of minimum spacing. Proof: Book, Page 160.
–> The idea behind the claim is that the spacing of any other clustering can be no larger than that of the clustering found by by the single-linkage algorithm.

This section was interesting in that it presented a very important idea in a way as concise as it can get. I give it an 8/10.