Emily Boyes' Wiki

This is an old revision of the document!

Introduction: Greedy Algorithms
Section 4.1: Interval Scheduling: The Greedy Algorithm Stays Ahead
Section 4.2: Scheduling to Minimize Lateness: An Exchange Argument
Section 4.3: Optimal Caching: A More Complex Exchange Argument
Section 4.4: Shortest Paths in a Graph
Section 4.5: The Minimum Spanning Tree Problem
Section 4.6: Implementing Kruskal's Algorithm: The Union-Find Data Structure
Section 4.7: Clustering
Section 4.8: Huffman Codes and Data Compression
Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm

Introduction: Greedy Algorithms

An algorithm is said to be greedy is it builds up a solution in small steps, choosing a decision at each step myopically to optimize some underlying criterion. When a greedy algorithm succeeds in solving a nontrivial problem optimally, it typically implies something interesting and useful about the structure of the problem itself. There are two basic methods for proving that a greedy algorithm produces an optimal solution to a problem: greedy stays ahead and the exchange argument. Applications of greedy algorithms include shortest paths in a graph, the Minimum Spanning Tree Problem, the construction of Huffman codes for performing data compression, and the Minimum Cost Arbolescence Problem, a more complex application.

Section 4.1: Interval Scheduling: The Greedy Algorithm Stays Ahead

Set Up: Consider a set of requests {1, 2, …, n}; the i^th request corresponding to an interval of time starting at s(i) ad finishing at f(i). A subset of the requests is compatible if no two of them overlap in time, and the goal is to accept as large a compatible subset as possible. The compatible sets of maximum size will be called optimal.

Designing a Greedy Algorithm

The basic idea in a greedy algorithm for interval scheduling is to use a simple rule to select a first request i_l. Once a request i_l is accepted, we reject all requests that are not compatible with i_l. We then select the next request i₂ to be accepted, and again reject all requests that are not compatible with i₂. We continue in this fashion until we run out of requests.

Here is the process we go through to find the most optimal ordering:

1. Select the available request that starts earliest –> Not an optimal solution

2. Accept the request that requires the smallest interval of time, or the request for which f(i) - s(i) is as small as possible –> Not an optimal solution, but somewhat better than 1.

3. For each request, count the number of other requests that are not compatible, and accept the request that has the fewest number of non compatible requests –> Solution might be optimal

4. Accept first the request that finishes first, or the request i for which f(i) is as small as possible –> This leads to an optimal solution. The general algorithm is shown below:

Initially let R be the set of all requests, and let A be empty
While R is not yet empty
    Choose a request i ∈ R that has the smallest finishing time
    Add request i to A
    Delete all requests from R that are not compatible with request i
EndWhile
Return the set A as the set of accepted requests

Analyzing the Algorithm

For the above algorithm we can prove optimality by proving |A| = |O| where A is a compatible set of requests and O is an optimal set of intervals. This shows that A contains the same number of intervals as O and hence is also an optimal solution.

Theorem 4.3 The greedy algorithm returns a optimal set A.

Implementation and Running Time

We can make our algorithm run in time O(n log n) as follows:

Begin by sorting the n requests in order of finishing time and labeling them in this order; that is, we will assume that f(i) < f(j) when i < j. This takes time O(n log n).
In an additional O(n) time, we construct an array S[1… n] with the property that S[i] contains the value s(i). We now select requests by processing the intervals in order of increasing f(i). We always select the first interval; we then iterate through the intervals in order until reaching the first interval j for which s(j) > f(1); we then select this one as well.
So we implement the greedy algorithm analyzed above in one pass through the intervals, spending constant time per interval. Thus this part of the algorithm takes time O(n).

Extensions

The Interval Scheduling Problem we considered here is a quite simple scheduling problem. There are further complications that could arise in practical settings such as:

We assumed that all requests were known to the scheduling algorithm when it was choosing the compatible schedule, but online algorithms must make decisions as time proceeds, without knowledge of future input.
Our goal was to maximize the number of satisfied requests, there could exist a situation in which each request has a different value or weight to us (one thing was more important to have than another).

A Related Problem: Scheduling All Intervals

The Problem A related problem arises if we have many identical resources available and we wish to schedule all the requests’ using as few resources as possible. This is called the Interval Partitioning Problem because the goal here is to partition all intervals across multiple resources. We will now design a simple greedy algorithm that schedules all intervals using a number of resources equal to the depth. This immediately implies the optimality of the algorithm.

Designing the Algorithm Let d be the depth of the set of intervals. We show how to assign a label to each interval, where the labels come from the set of numbers {1, 2 ….. d}, and the assignment has the property that overlapping intervals are labeled with different numbers. This gives the desired solution, since we can interpret each number as the name of a resource, and the label of each interval as the name of the resource to which it is assigned. The algorithm we use for this (shown below) is a simple one-pass greedy strategy that orders intervals by their starting time.

Sort the intervals by their start times, breaking ties arbitrarily
Let I<sub>1</sub>, I<sub>2</sub> ..... I<sub>n</sub> denote the intervals in this order
For j = 1, 2, 3, ..., n
    For each interval Ii that precedes Ii in sorted order and overlaps it
        Exclude the label of I<sub>j</sub> from consideration for I<sub>j</sub>
    Endfor
    If there is any label from {1, 2, ..., d} that has not been excluded then
        Assign a nonexcluded label to I<sub>j</sub>
    Else
        Leave I<sub>j</sub> unlabeled
    Endif
Endfor

Analyzing the Algorithm If we use the greedy algorithm above, every interval will be assigned a label, and no two overlapping intervals will receive the same label. If you have d labels at your disposal, then as you sweep through the intervals from left to right, assigning an available label to each interval you encounter, you can never reach a point where all the labels are currently in use. Since our algorithm is using d labels, we conclude that it is, in fact, always using the minimum possible number of labels.

Theorem 4.6 The greedy algorithm above schedules every interval on a resource, using a number of resources equal to the depth of the set of intervals. This is the optimal number of resources needed.

I found that there was too much information in this section and it would have been easier to read if the authors would do less explaining and more listing of steps. The excess information made the section hard to read and there doesn't need to be so much explaining for the introduction to a new subject. I would give this section a 5/10 for readability. I think maybe another simple example or two of where a greedy algorithm would be the best solution could have been helpful.

Section 4.2: Scheduling to Minimize Lateness: An Exchange Argument

The Problem

Consider again a situation in which we have a single resource and a set of n requests to use the resource for an interval of time. Assume that the resource is available starting at time s. In contrast to the previous problem, each request is now more flexible. Instead of a start time and finish time, the request i has a deadline d_i, and it requires a contiguous time interval of length t_i, but it is willing to be scheduled at any time before the deadline. Suppose that we plan to satisfy each request, but we are allowed to let certain requests run late. Thus, beginning at our overall start time s, we will assign each request i an interval of time of length t_i; let us denote this interval by [s(i), f(i)], with f(i) = s(i) + t_i. Unlike the previous problem, then, the algorithm must actually determine a start time (and hence a finish time) for each interval.

Designing the Algorithm

There are several natural greedy approaches in which we look at the data (t_i, d_i) about the jobs and use this to order them according to some simple rule:

1. Schedule the jobs in order of increasing length t_i, so as to get the short jobs out of the way quickly –> Ignores the deadlines of the jobs so does not work

2. A more natural greedy algorithm would be to sort jobs in order of increasing slack d_i - t_i –> This greedy rule fails as well

3. Sort the jobs in increasing order of their deadlines d_i, and schedule them in this order. This rule is called Earliest Deadline First and there is an intuitive basis to it: we should make sure that jobs with earlier deadlines get completed earlier. –> This algorithm always produces optimal solutions. It is shown below.

Order the jobs in order of their deadlines
Assume for simplicity of notation that d<sub>1</sub> ≤ ... ≤ d<sub>n</sub>
Initially, f = s
Consider the jobs i = l, ..., n in this order
    Assign job i to the time interval from s(i) = f to f(i) = f + t<sub>i</sub>
    Let f = f + t<sub>i</sub>
End
Return the set of scheduled intervals [s(i), f(i)] for i = 1, ..., n

Theorem 4.7 There is an optimal schedule with no idle time.

Observe that the schedule the algorithm produces has no “gaps”–times when the machine is not working yet there are jobs left. The plan here is to gradually modify O, preserving its optimality at each step, but eventually transform it into a schedule that is identical to the schedule A found by the greedy algorithm. We refer to this type of analysis as an exchange argument. The main step in showing the optimality of our algorithm is to establish that there is an optimal schedule that has no inversions and no idle time.

Theorem 4.8 All schedules with no inversions and no idle time have the same maximum lateness.

Theorem 4.9 There is an optimal schedule that has no inversions and no idle time.

Theorem 4.10 The schedule A produced by the greedy algorithm has optimal maximum lateness L.

Extensions A more difficult version of this problem would contain requests i that, in addition to the deadline d_i and the requested time t_i, would also have an earliest possible starting time r_i, called the release time. Problems with release times arise naturally in scheduling problems where requests can take the form: Can I reserve the room for a two-hour lecture, sometime between 1 P.M. and 5 P.M.?

It is still somewhat unclear to me what the difference, advantages, and disadvantages this method has over the greedy stays ahead method described in section 4.1. I think this section was difficult to digest, at a 5/10, so I will probably have to review it alongside section 4.1 again, as wells as my notes. I think it would have been helpful for me to read these sections before the lectures in this particular case because I was also pretty lost during the lectures and I think I could have gotten some further clarification after reading the sections.

Section 4.3: Optimal Caching: A More Complex Exchange Argument

Section 4.4: Shortest Paths in a Graph

The Problem

Given a directed graph G = (V, E) with a designated start node s determine the shortest path from s to every other node in the graph. Assume that s has a path to every other node in G. Each edge e has a length l_e ≥ 0, indicating the time (or distance, or cost) it takes to traverse e. For a path P, the length of P–denoted l(P)–is the sum of the lengths of all edges in P.

Designing the Algorithm

The outline for Dijkstra's Algorithm is shown below:

Dijkstra’s Algorithm (G, //l//)
Let S be the set of explored nodes
    For each u ∈ S, we store a distance d(u)
Initially S = {s} and d(s) = 0
While S ≠ V
    Select a node v ∉ S with at least one edge from S for which
        d’(v) = min<sub>e = (u,v):u∈s</sub> d(u) + //l//<sub>e</sub> is as small as possible
    Add v to S and define d(v) = d’(v)
EndWhile

Analyzing the Algorithm

Is it always true that when Dijkstra’s Algorithm adds a node v, we get the true shortest-path distance to v? We now answer this by proving the correctness of the algorithm, showing that the paths P_a really are shortest paths. Dijkstra’s Algorithm is greedy in the sense that we always form the shortest new s-v path we can make from a path in S followed by a single edge.

Theorem 4.14 Consider the set S at any point in the algorithm’s execution. For each u ∈ S, the path P_u is a shortest s-u path.

Note that this fact immediately establishes the correctness of Dijkstra’s Algorithm, since we can apply it when the algorithm terminates, at which point S includes all nodes.

Dijkstra's Algorithm does not always find shortest paths if some of the edges can have negative lengths. Many shortest-path applications involve negative edge lengths, and a more complex algorithm is required for this case. Another observation is that Dijkstra’s Algorithm is, in a sense, even simpler than it has been described here. Dijkstra’s Algorithm is really just a “continuous” version of the standard breadth-first search algorithm for traversing a graph.

Implementation and Running Time There are n - 1 iterations of the while loop for a graph with n nodes, as each iteration adds a new node v to S. Selecting the correct node v efficiently is a more subtle issue. For a graph with m edges, computing the minima can take O(m) time, so this would lead to an implementation that runs in O(mn) time, but we can do considerably better if we use the right data structures. res. First, we will explicitly maintain the values of the minima d’(v) = min_{e = (u,v):u∈s} d(u) + l_e for each node v ∈ V - S, rather than recomputing them in each iteration. We can further improve the efficiency by keeping the nodes V - S in a priority queue with d’(u) as their key. Recall that a priority queue can efficiently insert elements, delete elements, change an element’s key, and extract the element with the minimum key.

Theorem 4.15 Using a priority queue, Dijkstra’s Algorithm can be implemented on a graph with n nodes and m edges to run in O(m) time, plus the time for n ExtractMin and m ChangeKey operations.

Using the heap-based priority queue implementation (discussed in Chapter 2), each priority queue operation can be made to run in O(log n) time. Thus the overall time for the implementation is O(m log n).

The biggest question I have after reading this section is how is Dijkstra's Algorithm better than a BFS, and why would someone choose this algorithm as opposed to a BFS? I had a hard time digesting this section too so I would give it a 6/10 for readability.

Section 4.5: The Minimum Spanning Tree Problem

Minimum Spanning Tree (MST)

Spanning Tree: spans all nodes in a graph
Given a connected graph G = (V, E) with positive edge weights ce, an MST is a subset of the edges T ⊆ E such that T is a spanning tree whose sum of edge weights is minimized

MST Applications

Network design
- Telephone, electrical, hydraulic, TV cable, computer, road
Approximation algorithms for NP-hard problems
- Traveling salesperson problem, Striner tree
Indirect applications
- Max bottleneck paths
- Reducing data storage in sequencing amino acids in a protein
Cluster analysis

Cayley’s Theorem: There are n_n-2 spanning trees

Possible algorithms to find MST:

Kruskal’s Algorithm: Start without any edges and builds a spanning tree by successively inserting edges from E in order of increasing cost. Insert each edge e as long as it does not create a cycle when added to the edges already inserted. If inserting e would result in a cycle, discard e and continue.
Prim’s Algorithm: Start with a root node s and try to greedily grow a tree from s outward. At each step, we simply add the node that can be attached as cheaply as possibly to the partial tree we already have.
Reverse Delete Algorithm: Start with the full graph (V, E) and begin deleting edges in order of decreasing cost. As we get to each edge e (starting from the most expensive), we delete it as long as doing so would not actually disconnect the graph we currently have.

Properties of the MST

Cut property: Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then MST contains e.
- Cutset: A cut is a subset of nodes S. The corresponding cutset D is the subset of edges with exactly one endpoint in S.
Cycle property: Let C be a cycle, and let f be the max cost edge belonging to C. Then MST does not contain f. (All edge costs ce are distinct)
- Cycle: Set of edges in the form a-b, b-c, c-d, …, y-z, z-a
Cycle-Cut Intersection: A cycle and a cutset intersect in an even number of edges

Table of Contents

Introduction: Greedy Algorithms

Section 4.1: Interval Scheduling: The Greedy Algorithm Stays Ahead

Designing a Greedy Algorithm

Analyzing the Algorithm

Extensions

A Related Problem: Scheduling All Intervals

Section 4.2: Scheduling to Minimize Lateness: An Exchange Argument

The Problem

Designing the Algorithm

Section 4.3: Optimal Caching: A More Complex Exchange Argument

Section 4.4: Shortest Paths in a Graph

The Problem

Designing the Algorithm

Analyzing the Algorithm

Section 4.5: The Minimum Spanning Tree Problem

Minimum Spanning Tree (MST)

Section 4.6: Implementing Kruskal's Algorithm: The Union-Find Data Structure

Section 4.7: Clustering

Section 4.8: Huffman Codes and Data Compression

Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm