This is an old revision of the document!


4.7 Clustering

The Problem

Clustering arises when one is trying to classify a collection of objects into coherent groups.The first question that comes into mind is whether there exist some measure under which each pair of objects is similar or dissimilar to another. A common approach used is to define a distance function on the objects, where objects at a larger distance from one another are less similar to each other. However, the notion of distance usually takes on a very abstract meaning in many applications.

Given a distance function on the objects, the clustering problem seeks to divide them into groups such that objects within the same group are “close”, and objects in different groups are “far apart”. So the main challenge is figuring out what a good set of groups might look like.

Clustering of Maximum Spacing

–> Let's consider a set U of n objects{p1,p2,…,pn}
–> For each pair pi and pj, there's a numerical distance d(pi,pj)
–> For this distance:
–> d(pi,pi) = 0
–> d(pi,pj ≥ 0 for distinct i and j
–> Distances are symmetric: d(pi,pj = d(pj,pi

Now,let's suppose we're dividing the objects in U into k groups, for a certain parameter k.

–> A k-clustering of U is a partition of U into k nonempty sets C1,C2,…, Ck
–> Spacing of a k-clustering: the minimum distance between any pair of points lying in different clusters.

–> So,our goal is to find the k-clustering with the maximum possible spacing.
–> Thus, The Problem :

Since there are exponentially many different k-clusterings of a set U, how can we efficiently find the one that has maximum spacing?

Designing the algorithm
–> Again, our goal is to find a clustering of maximum spacing
–> To achieve this, we proceed by growing a graph on the vertex set U and achieve n clusters
–> Our aim is to bring points nearby together into the same cluster as fast as possible

–> We start by drawing an edge between the closest pair of points
–> We then draw an edge between the next closest pairs of points
–> We continue adding edges between pairs of points in order of increasing distance d(pi,pj)

–> Thus we're growing a graph H on U edge by edge which connects components corresponding to clusters.
–> Each time we add an edge between that spans two distinct components, it's like merging two corresponding clusters: This iterative merging of clusters is called single-link clustering, which basically means that we combine clusters as soon as a single link joins them together.
–> Single-link clustering is a special case of
hierarchical agglomerative clustering

courses/cs211/winter2012/journals/jeanpaul/chapterfour_sectionvii.1331004483.txt.gz · Last modified: 2012/03/06 03:28 by mugabej
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0