This is an old revision of the document!


The Front Matter

Summary

This section is an introduction to Greedy Algorithms. Greedy algorithms build a solution by making step-by-step, local decisions to optimize underlying criterion. There are two types of proofs: Greedy Stays Ahead and Exchange Argument. Greedy Stays Ahead measures the progress of the algorithm and proves that it does better than or equal to any other algorithm, especially a simple brute search. The exchange argument is when you consider an optimal solution and gradually transform it into the solution yielded by the greedy algorithm without jeopardizing it optimality at any point.

Notes
  • “Greed… is good. Greed is right. Greed works.”
  • An algorithm is greedy if it builds up a solution in small steps, choosing a decision at each step myopically to optimize some underlying criterion.
  • When a greedy algorithm succeeds in solving a nontrivial problem optimally, it typically implies something interesting and useful about the structure of the problem itself; there is a local decision rule that one can use to construct optimal solutions.
  • Types of proofs
    • Greedy Stays Ahead: if you measure the greedy algorithm’s progress in a step-by-step fashion, one sees that it does better than any other algorithm at each step, and thus, produces an optimal solution.
    • Exchange Argument: consider any possible solution to the problem and gradually transform it into the solution found by the greedy algorithm without hurting its quality.
Additional Information

There wasn't anything that I was unclear about or had to reread to understand after class or after my first run through of this section.

In terms of readability, this section was a 10 because it was just an introduction section. It was basically definitions, and they were super easy to understand because they were clearly explained. Also, this section was a page, so it was a super quick read.

4.1 Interval Scheduling

Summary

In this section, there are two problems to solve: Interval Scheduling and Interval Partitioning. In Interval Scheduling, the greedy aspect of the problem is to select intervals based on finish times (the smallest one first); this allows us to maximize the time left to complete other tasks. We prove the algorithm's optimality based on this greedy criterion by using the greedy stays ahead proof. In Interval Partitioning, the greedy aspect is that we select intervals based on the earliest start time. We prove this greedy algorithm for optimality by using a structural argument; the number of resources for the intervals cannot be less than the depth of the intervals.

The Problem
  • A subset of requests is compatible if no two of them overlap in time; our goal is to accept as large a compatible subset as possible
  • Compatible sets of maximum size are optimal
Designing a Greedy Algorithm
  • Basic idea in greedy algorithm for interval scheduling is to use a simple rule to select a first request i1, then reject all requests not compatible with i1, then select request i2, then reject all requests incompatible with i2, and so on until we run out of requests
  • The challenge is in designing a good greedy algorithm is deciding which simple rule to use for the selection
  • There are many natural rules for this problem, which do not give good solutions
    • Pick based on earliest start time: not optimal
    • Pick based on smallest interval of time: not optimal
    • For each request, count the number of requests that are not compatible and accept the request that has the fewest number of incompatible requests (pick based on fewest conflicts): surprisingly, not optimal
  • The optimal solution: accept the first request that finishes first, a request for which the finish time is the smallest
    • Allows us to maximize the time left to satisfy other requests

Algorithm

   Initially let R be the set of all requests, and let A be empty
   While R is not yet empty
     Choose a request i in R that has the smallest finishing time
     Add request i to A
     Delete all requests from R that are not compatible with request i
   Endwhile
   Return the set A as the set of accepted requests
Analyzing the Algorithm
  • A is a compatible set of requests
  • We need to show that the solution yielded by the algorithm is optimal
    • We will show that the size of A (set of compatible requests yielded by the algorithm) and the size of O (an optimal solution) are equal, A has the same number of intervals as O and hence is also optimal
    • We will use a greedy stays ahead proof
      • We want to show that each of A’s intervals finishes at least as ssoon as the corresponding interval in set O
      • For each r>=1, the rth accepted request in the algorithm’s scheduling finishes no later than the rth request in the optimal schedule
  • For all indices r⇐k we have finish time of rth i (ir) is less than or equal to the finish time of rth j (jr)
    • Now let r>1. Assume statement is true for r-1, and we will prove it for r. In order for the greedy algorithm to not be optimal, it would have to fall behind. However, it will not because the greedy always has the option at worst of choosing jr.
  • The greedy algorithm returns an optimal set A
    • If A is not optimal, an optimal set O must have more requests, must have m>k. Therefore, there must be a request j(k+1) in O, which starts after jk ends, and thus after ik ends, therefore j(k+1) would be compatible with ik and the set A. The greedy algorithm stops with request ik, and it is only supposed to stop when R is empty – a contradiction.
  • Runtime: O(nlogn), most costly portion of the algorithm is sorting the requests in order of finishing time
  • Extensions: scheduler needs to make decisions about accepting and rejecting before knowing about the full set of requests, requests may time out if the scheduler waits too long to schedule; each request has a different value, want to maximize the value
Scheduling All Intervals
  • Goal is to partition all intervals across multiple resources, Interval Partitioning Problem
    • We have many identical resources available, and we want to schedule all requests using as few resources as possible
  • Define the depth of a set of intervals as the maximum number that pass over any single point on the timeline
  • In any instance of Interval Partitioning, the number of resources needed is at least the depth of the set of intervals
    • A set of intervals has depth d and let requests 1 through d pass over a common point on the timeline. Each of these intervals must be scheduled on a different resource, need at least d resources
  • Can we design an algorithm that schedules all intervals using the minimum possible number of resources? Is there always a schedule using a number of resources that is equal to the depth? Yes and yes!!!!!
  • Design a greedy algorithm that schedules all intervals using a number of resources equal to depth, no solution can have number of resources that is smaller than the depth
  • Structural argument: there is a structural bound asserting that every possible solution must have at least a certain value

Algorithm

   Sort the intervals by their start times, breaking ties arbitrarily
   Let I1, I2, …., In denote the intervals in this order
   For j = 1, 2, 3, …, n:
     For each interval Ii that precedes Ij in sorted order and overlaps it:
                Exclude the label of Ii from considerations for Ij’
     Endfor
     If there is any label from {1, 2,…, d} that has not been excluded then:
                Assign a nonexcluded label to Ij
     Else
                Leave Ij unlabeled
     Endif
   Endfor
  • If we use the greedy algorithm above, every interval will be assigned a label, and no two overlapping will receive the same label.
  • The greedy algorithm above schedules every interval on a resource, using a number of resources equal to the depth of the set of intervals. This is the optimal number of resources needed.
Additional Information

In class, I understood what the structural argument was in the specific Interval Partitioning problem: the optimal solution cannot be less than the depth of the intervals. However, I was kind of confused by what it was in a broader sense. Luckily, after reading this section, it became much clearer. You just have to prove that there is a structural bound asserting that every possible solution must have at least a certain value. Then, you have to show that the greedy algorithm yields a result that has that minimum value.

In terms of readability, this section is a solid 9. Because I read this section after hearing the lecture in class, it was super easy to comprehend, and the proof s were written very clearly. The only proof that was a little difficult to understand in terms of the way that the book explained it was the greedy stays ahead proof. Luckily, it wasn't too difficult to get because we learned about it in class.

4.2 Scheduling to Minimize Lateness

Summary

We want to be able to schedule all of the tasks, without overlapping, so to minimize the maximum lateness. Scheduling jobs in order of increasing length and by shortest slack time do not always yield optimal solutions. The greedy criterion for this problem is sort the jobs in increasing order of the deadline and then schedule them in that order. Of course, there would be no idle time. To prove optimality for this greedy algorithm, we will use an exchange argument in which we will modify an optimal solution, ensuring it remains optimal at each step, slowly until we transform the old optimal solution into the solution that our greedy algorithm would produce. In this case, we would modify an optimal solution with inversions to get it to have no inversions to prove the greedy algorithm's optimality.

The Problem
  • The request i has a deadline di, requires a contiguous time interval of length ti, but it is willing to be scheduled at any time before the deadline. Each accepted request must be assigned an interval of time of length ti, and different requests must be assigned to non-overlapping intervals.
  • Request i is late if it misses the deadline
  • Our goal is to schedule all requests using non-overlapping intervals, so as to minimize the maximum lateness
Designing the Algorithm
  • Schedule jobs in order of increasing length: not optimal
  • Scheduling jobs by smallest slack time: not optimal
  • The optimal solution: sort the jobs in increasing order of their deadlines di, and schedule them in order (Earliest Deadline First Rule).
  • s is start time, f is finishtime, d is deadline, t is time length of the interval

Algorithm

   Order the jobs in order of their deadline
   Assume that d1<=…<=dn (sorted deadlines)
   Initially, f = s
   Consider jobs i=1,…, n in this order:
     Assign job I to the time interval from s(i) = f to f(i) = f + ti
     Let f = f + ti
   End
   Return the set of scheduled intervals [s(i), f(i)] for i = 1,…, n
Analyzing the Algorithm
  • The algorithm above produces no gaps, no idle time.
  • There is an optimal solution with no idle time.
  • Exchange argument: gradually modify O, preserving its optimality at each step, but eventually transforming it into a schedule that is identical to schedule A found by the greedy algorithm.
  • All schedules with no inversions and no idle time have the same maximum lateness.
    • They can only differ in the order in which jobs with identical deadlines are scheduled, which cannot affect the maximum lateness.
  • There is an optimal schedule that has no inversions and no idle time.
    • If O has an inversion, then there is a pair of jobs i and j such that j is scheduled immediately after i and has dj<di.
    • After swapping i and j we get a schedule with one less inversion.
    • The new swapped schedule has a maximum lateness no larger than that of O.
      • The finishing time of j before the seap is exactly equal to the finishing time of I after the sap. Thus, all jobs other than jobs I and j finish at the same time in the two schedules. Job j will get finished earlier in the new schedule. Therefore, the swap does not increase the lateness of job j. Job I finished at j’s previous finish time in pre- swap schedule O. If job i is late in this new schedule, its lateness of i is equal to the difference of the finish time of j in schedule O before the swap and the deadline of i. Job i cannot be more late in the swapped schedule than j was in the pre-swapped schedule. The lateness of I is equal to the pre-swap finish time of j minus the deadline of i, which is less than the finish time of j minus the deadline of j, which is equal to the lateness of j. Therefore, the swap does not increase the maximum lateness of the schedule.
    • The schedule A produced by the greedy algorithm has optimal maximum lateness L.
Additional Information

When first reading through the section, it was hard for e t understand the algorithm and its proof because of the notation that the book invented for it. The book abbreviated full words to letters, so when I first read the section I must have not registered where they defined the terms. However, once I went back and reread the section, the notation became more clear. For instance, s is start time, f is finishtime, d is deadline, and t is time length of the interval. I can't type some of the symbols because they have some lines over them. However, the letters with the lines over them were post-swap, and the ones without were pre-swap.

In terms of readability, on a scale from 1 to 10, this section was an 8. The section was explained very well overall. Unfortunately, the proof of optimality of the algorithm for this section had some confusing notation in it. However, once I understood the notation, understanding the proof got exponentially easier.

4.4 Shortest Paths in a Graph

Summary

Our goal is to find the the shortest paths in a graph. To solve this, we will use Dijkstra's algorithm. Dijkstra's algorithm is greedy because we always form the shortest new s-v path we can make from a path followed by a single edge. However, all of the edges must have nonnegative distance; the algorithm breaks down if it does not. We will prove this algorithm for correctness using the greedy stays ahead proof. The runtime of the algorithm is O(mlogn) if you use the priority queue to implement it.

The Problem
  • Given nodes u and v, what is the shortest u-v path? Given a starting node s, what is the shortest path from s to each other node?
  • This is for a directed graph. Assume s has a path to every other node in the graph. Each edge is non-negative. The length of a path is the sum of the lengths of all the edges in the path.
Designing the Algorithm
  • For each node v in set V-S, we determine the shortest path that can be constructed by traveling along a path through the explored part S to some u in set S, followed by a single edge (u, v).

Dijkstra's Algorithm

   Dijkstra’s Algorithm (G, l)
        Let S be the set of explored nodes
        For each u in S:
       Store a distance d(u)
        Initially, S = {s} and d{s} = 0
        While S != V:
             Select a node v not in set S with at least one edge from S for 
             which d’(v) = min(e=(u, v):u in S) d(u) + le is as small as possible
             Add v to S and define d(v) = d’(v)
        Endwhile
Analyzing the Algorithm
  • Dijkstra’s Algorithm is greedy in the sense that we always form the shortest new s-v path we can make from a path S followed by a single edge.
  • We will prove its correctness by using greedy stays ahead.
  • Consider the set S at any point in the algorithm’s execution. For each u in set S (discovered nodes), the path Pu is a shortest s-u path.
    • P cannot be shorter than Pv because it is already at least as long as Pv by the time it has left the set S. In iteration k+1, Dijkstra’s Algorithm must have considered adding node y to the set S via the edge (x, y) and reject its option in favor of adding v. This means that there is no path from s to y through x that is shorter than Pv. The subpath of P up to y is such a path, and so this subpath is at least as long as Pv. Since the edge lengths are nonnegative, the full path P is at least as long as Pv as well.
  • The algorithm does not always find the shortest paths if some of the edges have negative lengths.
  • Dijkstra’s Algorithm is a continuous version of the standard breadth-first search algorithm for traversing a graph.
  • There are n-1 iterations of the while loop.
  • For a graph with m edges, computing all these minima can take O(m) time, so this would lead to an implementation that runs in O(mn) time.
  • However, we can do much better if we use a priority queue.
    • We will need: ChangeKey and ExtractMin operations
    • ExtractMin: selecting a node v to be added to set S
    • ChangeKey: decrease the key of node w appropriately
    • ChangeKey can occur at most once per edge
  • Using a priority queue, Dijkstra’s Algorithm can be implemented on a graph with n nodes and m edges to run in O(m) time, plus the time for n ExtractMin and m ChangeKey operations.
  • The heap-based priority queue implementation results in an overall time for implementation in this case is O(mlogn)
Additional Information

The notation was pretty hard for me to understand in the first run through of the section, specifically d’(v) = min(e=(u, v):u in S) d(u) + le. After reading the section again, the notation became more clear. min(e=(u, v):u in S)d(u) means the minimum distance to u of the paths from node u to node v in set of discovered nodes S. d'(v) is the new distance to v and le is the length of the edge. Therefore, now I better understand the algorithm overall.

In terms of readability, on a scale of 1 to 10, this section was a 7. It wasn't necessarily that this section was difficult to understand. It was just that some of the notation was confusing, especially when the algorithm is written out. However, once I got the notation down, the section was not too difficult. I thought that the proofs were explained fairly clearly.

4.5 The Minimum Spanning Tree Problem

Summary
The Problem
  • Set of locations V = {v1, v2, …, vn}, we want to build a communication network on top of them
  • The network should be connected; we want to build it as cheaply as possible
  • For certain pairs (vi, vj), we can build a link between them for a certain cost
  • Let T be a minimum-cost solution to the network design problem defined above. Then (V, T) is a tree.
    • Starting from any optimal solution, we could keep deleting edges on cycles until we had a tree; with nonnegative edges, the cost would not increase during this process
  • This is called the Minimum Spanning Tree Problem.
Designing the Algorithm
  • Many of the first greedy algorithms on tires turn out to be correct and produce optimal solutions.
  • Three greedy algorithms that produce optimal solutions
    • Kruskal’s Algorithm: Starts without any edges, builds spanning tree by successively inserting edges from E in order of increasing cost, insert each edge e as long as it does not create a cycle when added to the edges that are already inserted, if the insertion results in a cycle we discard e and continue with the algorithm
    • Prim’s Algorithm: analogy with Dijkstra’s Algorithm for paths, start with a root node s and try to greedily grow a tree outward, at each step add the node that can be attached as cheaply as possible to the partial tree we already have (in each iteration add the node v that minimizes the attachment cost)
    • Reverse-Delete Algorithm: backward of Kruskal’s Algorithm, start with full graph and begin deleting edges in order of decreasing cost, we delete it an edge as long as doing so would not disconnect the graph
Analyzing the Algorithms
  • Useful to have in hand some basic facts saying when it is safe to include or eliminate an edge in the minimum spanning tree
  • Cut Property
    • Assume that all edges costs are distinct. Let S be any subset of nodes that is neither empty nor equal to all of V, and let edge e = (v, w) be the minimum-cost edge with one end in S and the other in V-S. Then every minimum spanning tree contains the edge e.
    • Exchange argument proof
  • Kruskal’s Algorithm produces a minimum spanning tree of G.
    • Consider any edge e = (v, w) added by Kruskal’s Algorithm; adding e to the set of the included nodes does not create a cycle. Also, no edge from S to V-S has been encountered yet since any edge could have been added without creating a cycle, would have been added by algorithm. Therefore, e is the cheapest edge with one edge in S and one in V-S, by the cut property, it belongs to every minimum spanning tree. And, because the nature of the algorithm does not add any cycles, the algorithm will add the first of these edges it encounters.
  • Prim’s Algorithm produces a minimum spanning tree G.
    • The nature of the algorithm is to add nodes that minimize the quantity of min(e+(u, v):u in set S)ce. E is the cheapest edge with one end in S and the other end in V-S, by the cut property, it is in every minimum spanning tree.
  • Cycle Property
    • Assume all edge costs are distinct. Let C be any cycle in G, and let edge e = (v, w) be the most expensive edge belonging to C. Then e does not belong to any minimum spanning tree G.
    • Exchange argument, swapping e for a cheaper edge in such a way we still have a spanning tree.
  • The Reverse-Delete Algorithm produces a minimum spanning tree.
    • When e is removed, it lies on a cycle C. By the algorithm edges are removed in decreasing order of edge cost. Therefore, by the cycle property, e does not belong in the minimum spanning tree. So, e must be the most expensive edge. The algorithm never removes an edge when it will disconnect the graph. Most expensive edge in cycles will be deleted by the nature of the algorithm.
  • Any algorithm that builds a minimum spanning tree by repeatedly including edges when justified by the cut property and deleting edges when justified by the cycle property – in any order at all – will end up with a minimum spanning tree.
  • Thus far, we have assumed that all edge costs are distinct. IF we run any of our minimum spanning tree algorithms, using the perturbed costs for comparing edges, we will produce a minimum spanning tree T that is also optimal for the original instance.
Implementing Prim’s Algorithm
  • Prim’s and Kruskal’s Algorithms can be implemented in O(mlogn) time.
  • Using a priority queue, Prim’s Algorithm can be implemented on a graph with n nodes and m edges to run in O(m) time, plus the time for n ExtractMin and m ChangeKey operations.
  • If we use a heap-based priority queue we can implement both ExctracMin and ChangeKey in O(logn) time, and so get an overall running time of O(mlogn).
Extensions
  • We could be concerned about point-to-point distances in the spanning tree we build, and be willing to reduce these even if we pay more for the set of edges.
  • We may care more about the congestion on the edges.
  • We could make resilience an explicit goal.
Additional Information

The cycle and cut properties were hard for me to comprehend on the first day that we talked about them in class and the first time I read about them in the section. However, the cut property merely tells you when you can include an edge in a minimum spanning tree. It says that the minimum cost edge with one end in the explored set and the other in the unexplored set can must be included in the minimum spanning tree (with neither the explored set nor the unexplored set being empty). The cycle property tells you when you can delete edges to create the minimum spanning tree. It says that the most expensive edge in a cycle can be deleted and does not belong in any minimum spanning tree. Prim's and Kruskal's Algorithms use the cut property to prove for correctness, and the Reverse-Delete Algorithm uses the cycle property to prove for correctness.

On a scale of 1 to 10, the readability of this section was a 7. The description of the algorithms were very straightforward. However, the proofs of the algorithms using the cycle and cut properties were a little hard to comprehend at first. However, after going to class and reading over the proofs again, it was easier to understand.

4.6 Implementing Kruskal's Algorithmm: Union-Find

Summary

• When an edge is added to the graph, we don’t want to have to recompute the connected components from scratch. We will develop a data structure that we call the Union-Find structure, which will store a representation of the components in a way that supports rapid searching and updating.

The Problem
  • The Union-Find data structure allows us to maintain disjoint sets
  • Find(u): return the name of the set containing u.
  • Union(A, B): take two sets A and B and merge them into a single set
  • If we add an edge (u, v) to the graph, we will first test u and v to see if they are already in the same connected set. If they are not, Union(Find(u), Find(v)) can be used to merge the two components into one.
  • The Union-Find data structure supports
    • MakeUnionFind(S): for a set S will return a Union-Find data structure on set S where all elements are in separate sets; run in O(n) time
    • Find(u): run in O(logn)
    • Union(A, B): run in O(logn)
A Simple Data Structure Union-Find
  • Array Component that contains the name of the set currently containing each element, size n, where Component[s] is the name of the set containing s
  • To implement MakeUnionFind(S), setup the array Component and initialize Component[s] = s for all s in set S
  • Maintain the list of elements in each set
  • Maintain an additional array Size of length n where Size[a] is the size of set A and when a Union(A, B) operation is performed, use the name of the larger set for the union
  • Consider an array implementation of the Union-Find data structure for some set S of size n, where unions keep the name of the larger set. The Find operation takes O(1) time, MakeUnionFind(S) takes O(n) time, any sequence k of Union operations takes at most O(klogk) time.
A Better Data Structure for Union-Find
  • Each node v in set S will be contained in a record with an associated pointer to the name of the set that contains v.
  • To indicate we took the union of the two set, and that the name of the union set is v, we update u’s pointer to point to v. We do not update the pointers at the other nodes of set B.
  • Pointer-based data structure implements Union in O(1) time, but Find operation is no longer constant.
  • Find can be as large as O(n)
  • To reduce the time required for a Find operation, we will use the same optimization we used before: keep the name of the larger set as the name of the union.
  • Consider the above pointer-based implementation of the Union-Find data structure for some set S of size n, where unions keep the name of the larger set. A Union operation takes O(1) time, MakeUnionFind(S) takes O(n) time, and a Find operation takes O(logn) time.
Further Improvements
  • For all the applications of the Union-Find data structures that we consider the O(logn) time per operation is good enough in the sense that further improvement in the time ofr the operations would not translate to improvements in the overall running time of the algorithms where we use them.
  • In the improved implementation, we will compress the path we follow after every Find operation by resetting all pointers along the path to point to the current name of the set. This will allow subsequent Find operations to run more quickly.
  • The improved implementation supports Union operations in O(1), MakeUnionFind(S) in O(n), and Find operations take at most O(logn) time.
Implementing Kruskal’s Algorithm
  • Sort the edges in the graph by cost takes O(mlogm) time which can be simplified to O(mlogn) because m ⇐ n^2.
  • Use the Union-Find data structure to maintain the connected components of (V, T) as edges are added. As each ede e = (v, w) is considered, we compute Find(u) and Find(v) and test if they are equal to see if v and w belong to different components.
  • At most there are 2m Find operations and n-1 Union operations.
  • This results in a total O(mlogn) runtime.
  • Kruskal’s Algorithm can be implemented on a graph with n nodes and m edges to run in O(mlogn) time.
Additional Information

The pointer implementation of the Union-Find data structure was a little confusing to me at first. We didn't really go over how the Union-Find structure was implemented in class, so the first time I read the pointer implementation, I did not fully comprehend what exactly that implementation did. Luckily, after reading the implementation again, it became clearer. The Union-Find structure in the form of the pointer implementation, to represent a union of the set with u in it and the set with v in it and making the union set name v's, merely update u's pointer to point to v. This makes the Union operation O(1) and the Find operation, with each subsequent operation when we update the pointers to point to the “parent” set with each Find, is O(logn).

On a scale of 1 to 10, the readability of this section was a solid 7. The different Union-Find implementations were a little hard to understand (and they are definitely hard things to explain). Nonetheless, I thought the book did a fair job doing just that. It just took a couple of rereadings to fully comprehend the Union-Find data structure using pointers.

4.7 Clustering

Summary

• Minimum spanning trees play a role in the area of clustering.

The Problem
  • A collection of objects that one is trying to classify or organize into coherent groups.
  • Define a distance function on the objects, with the interpretation that objects at a larger distance from one another are less similar to each other.
  • Given a distance function on the objects, the clustering problem seeks to divide them into groups so that objects within the same group are “close” and objects in different groups are “far apart”.
  • N objects labeled p1, p2, …, pn
  • For each pair pi and pj, we have distance d(pi, pj)
    • d(pi, pi) = 0, d(pi, pj) > 0 for distinct pi and pj, and d(pi, pj) = d(pj, pi)
  • we want to divide the objects in U into k groups for a given parameter k
    • a k-clustering of U is a partition of U into k nonempty sets C1, C2,…Ck
    • spacing of a k-clustering to be the minimum distance between any pair of points lying in different clusters
  • we are looking for the k-clustering with the maximum possible sapcing
Designing the Algorithm
  • consider growing a graph on the vertex set U
  • connected components will be clusters
  • drawing an edge between the closest pair of points, draw an edge between the next closest pair of points, and so on; continue adding edges between pairs in order of increasing distance d(pi, pj)
  • if we are about to add edge (pi, pj) and find pi and pj are already in the same cluster, we will not add the edge (never create a cycle which means we will have union trees)
  • the iterative merging of clusters is termed single-linked clustering, a special case of hierarchical agglomerative clustering
  • our procedure above is Kruskal’s Minimum Spanning Tree Algorithm; the only difference is we stop the procedure once we obtain k connected components
    • run Kruskal’s Algorithm but stop it just before it adds its last k-1 edges
  • The components C1, C2, …, Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum spacing.
    • The spacing of the k-clustering is precisely the length d* of the (k-1)st most expensive edges in the minimum spanning tree. d(p, p’) ⇐ d*, since the edge (p, p’) was added by Kruskal’s algorithm. But, p and p’ belong to different sets in the k-clustering, hence the spanning of the k-clustering is at most d(p, p’) ⇐ d*.
Additional Information

The thing that confused me in this section was the proof of the algorithm for correctness. The notation in the proof was super confusing to me.

On a scale of 1 to 10, the readability of this section was a 9. The book did a really great job explaining what a clustering was, and the description of the algorithm was super clear. The only thing that confused me a little was the proof of the algorithm we used. But, still, I feel like I at least got the gist of the proof.

courses/cs211/winter2018/journals/nasona/chapter4.1520171131.txt.gz · Last modified: by nasona
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0