Chapter 4

Chapter 4

Section 4.1 (Interval Scheduling)

This section deals with the Interval Scheduling problem, and implements a greedy algorithm to solve it.

The interval scheduling problem deals with scheduling job requests. We want to schedule as many jobs as possible, but to do so we need to pick only jobs that do not overlap. Each given job has a beginning and end time.

The solution to this problem is deceptively simple: we start with the request that has the earliest end time possible. Then, we delete all requests that conflict with this accepted request. Then we choose the job that has the next earliest end time. We repeat this algorithm until there are not more requests to either delete or accept, and the accepted jobs are the optimal set. It is important to distinguish that this algorithm works because all jobs have the same value, so our only concern is taking on the highest number of jobs.

This section then goes on to prove that this algorithm is optimal. First, it proves by induction that no two accepted jobs overlap (f(i_r) =< (f(j_r)). Then, it proves by contradiction that the set the greedy algorithm returns is optimal by contradiction. With these two points proven, we can now say that the solution returned is correct and optimal. This algorithm runs in O(n log n).

This section also describes the Interval Partitioning problem, which is similar to the Interval Scheduling problem. This algorithm mostly operates under the same processes described above, except when two jobs overlap, the conflicting jobs are put onto different “resources” (think scheduling lectures in two different classrooms). Therefore, the optimal number of “resources” needed to schedule the jobs is equal to the maximum number of overlapping jobs at any one point in time.

Section 4.2 (Minimize Lateness)

For this problem, we have a single resource at our disposal (like in the Interval Scheduling problem). All jobs have deadlines instead of start and end times, and requires a certain amount of unbroken time to complete the job. Therefore, each job must be scheduled in non-overlapping intervals. We try to minimize the amount of time any job is submitted past its deadline (lateness).

To solve this problem with a greedy algorithm, we must simply sort the jobs in order of their deadlines. The intuition is to make sure jobs with earlier deadlines get completed sooner than jobs with later deadlines.

This section proves that the greedy algorithm is optimal by proving four essential components of the problem. First, we prove that there is an optimal schedule with no idle time. Next, we prove that all schedules with no inversions (a job with an earlier deadline being scheduled AFTER a job with a later one) and no idle time have the same maximum lateness. We prove this via a direct proof stating that the ordering of jobs with the same deadline does not determine lateness. Then, we prove that there is an OPTIMAL schedule with no inversions and no idle time. We do this by direct proof. To conclude the section, we prove that the schedule produced by our greedy algorithm has the optimal maximum lateness. This proof builds off the proofs we completed above: because we proved that there is an optimal schedule with no inversions, and that all schedules with no inversions have the same maximum lateness, the schedule our algorithm returns MUST be optimal.

Section 4.4 (Shortest Path)

This section uses greedy algorithms to solve the Shortest Path problem. We are given a starting node s, and a graph of nodes and vertices. The vertices of this graph include a value designating the cost of traversing that node. Our goal is to traverse from s to another given node t.

Edsger Dijkstra proposed a greedy solution to the Shortest Path problem in 1959. Basically, this algorithm finds the shortest path from s to each other node in the graph, not just t. Then, it is easy to determine the overall least-costly route from s to t.

The specific algorithm is as follows: first, we find the values of all edges connected to s. Then, we construct a priority queue of these nodes, organized from least costly path to most costly. Then, we explore all edges connected to the highest priority on the queue, all the while keeping track of the cost to get to each individual node. As we find less costly paths, we update the cost to get to that node. After all shortest paths have been found, we can easily return a list of the shortest path from s to any end node t.

By using a heap-based priority queue, this algorithm can run in O(m log n) time.

Section 4.5 (Minimum Spanning Tree)

This section deals with minimum spanning trees and the various algorithms that can produce one.

The first algorithm is called Kruskal's Algorithm. This algorithm starts with all of the edges and no nodes. Then, the algorithm adds the least costly edges first, unless adding an edge would create a cycle. In that case, the edge is not added and the next least-costly edge is considered. The graph returned is a minimum spanning tree.

The second algorithm is called Prim's Algorithm. This algorithm is similar to Dijkstra's algorithm, but slightly less complex. We begin with the completed graph, then explore the least costly edge coming off of that node. We add this node to a set of edges. Then we repeat this process, this time exploring the least costly edge connecting any explored node to an unexplored node. We repeat this process until there are no more explored nodes, and the returned set of edges will be a minimum spanning tree.

The final algorithm resembles Kruskal's Algorithm done in reverse. For this algorithm, we are given a complete graph and then delete the most expensive edge. We repeat this process, making sure each time that we do not completely disconnect any node from the graph. The remaining completed graph will be a minimum spanning tree.

The section then proves correctness and optimality of the algorithms, which can be found on pages 144-149.

The section concludes with the implementation and runtime of Kruskal and Prim's Algorithms. Using a heap-based priority queue to hold the nodes, we can get a running time of O(m logn). The extractMin and changeKey function work in O(logn) time, and each is performed a maximum of m times, yielding O(m logn).

Section 4.6 (Union-Find Structure)

This section describes the Union-Find structure and its uses. It is well suite to implementing algorithms similar to Kruskal's Algorithm.

The Union-Find structure only does exactly what its name implies: union and find operations. It does this by keeping track of disjointed sets (or graphs, in the case of Kruskal's). In addition to these functions, Union-Find will also have a makeUnionFind. MakeUnionFind will take a graph as an input, then return a Union-Find data structure in which each node is contained within its own set.

An array-based implementation can be used to run the find operation in O(1) runtime and union in O(n) runtime. This is efficient enough to use on Kruskal's Algorthm, but it is not the most optimal implementation. The link-based implementation runs find in O(logn), which is worse than the array based implementation, but runs union in O(1). We are able to gain such an improved union runtime because we only have to create a link from one set to the next, as opposed to merging the two sets linearly.

Section 4.7 (Clustering)

Section 4.8 (Huffman Codes)

Huffman codes compress data. Computers operate by producing and reading bits, which are sequences of 0's and 1's. Each sequence is assigned some piece of information that the computer can understand, so the problem is centered around picking the most efficient way of assigning sequences to information so that the most used information is assigned to the least costly sequence. Huffman codes do that exactly.

Huffman codes use trees to organize and locate data. All data are stored in the leaf nodes of the tree, with the parent nodes being empty. As the computer reads through each bit, it traverses through the tree until it lands on a leaf node. Then, it returns that datum and reads through the next bit. It assigns O and 1 to a traversal to either the left or right child node, which guarantees that no two data will have the same encoding.

What makes this algorithm work efficiently is that the trees are not always full. The most common data will be placed near the top of the tree, so that they are reached first. The book uses commonly used letters as an example. It would not makes sense for the letter 'e' to take up the same amount of space as a 'z' or 'q', since 'e' is more commonly used. Therefore, the trees can be somewhat skinny, but any given input will most likely not end up traversing the entire tree.

If we use a heap priority queue to implement Huffman's algorithm, we can achieve O(n logn), with n representing the letters of the given alphabet. Heaps are trees that allow us to make insertions and deletions in O(logn) time. These operations are ideal for Huffman's algorithm, and so priority queues are the ideal data structure to use.

Table of Contents