Chapter 4: Greedy Algorithms
An greedy algorithm is accurate to its name. It builds up a solution greedily, by choosing a decision at each step that optimizes the solution at that point. While greedy algorithms can be designed for many problems, the ones we want to concern ourselves with are the ones that have optimal greedy solutions. Thus, most of this chapter will be concerned with constructing a greedy algorithm and then proving it is optimal, using either the greedy stays ahead argument or the exchange argument.
4.1 Interval Scheduling: The Greedy Algorithm Stays Ahead
The Interval Scheduling Problem was introduced in chapter 1.2 and basically consists of try to schedule the largest subset of requests(with start and end times) possible without overlap. The easiest concrete example is trying to schedule meetings in a single meeting room. The key to designing an optimal greedy algorithm to solve this problem is to determine what rule to use to choose the requests. Several options, including choosing by earliest start time or smallest interval of time, provide solutions, but they are not optimal. Instead, we want to order the requests by finish time. From there, you simply pick the first one, then go down the list choosing the first one that does not overlap, and so on. The algorithm is on page 118.
To analyze the algorithm we show that our algorithm “stays ahead” of the optimal solution at every point. We can prove that for every r, the r accepted request in our algorithm finishes no later than the r request in the optimal solution. By ordering the requests by finish time, out algorithm will run in O(nlogn) time. There are also several extensions of this project such as scheduling weighted requests or scheduling requests without knowing all of them ahead of time.
A similar problem is scheduling all intervals, with the goals of using the fewest possible resources. We prove in this section that the number of resources needed is equal to the depth of the set of intervals. The depth is the maximum number of requests that overlap. It is impossible to have a solution with less resources than the depth and you cannot have an optimal solution with more resources than the depth. The algorithm on page 124 orders requests by their start time and is optimal.
4.2 Scheduling to Minimize Lateness: An Exchange Argument
This problem, like interval scheduling, has a set of requests(this time with deadlines) and a single resource. The goal here is to schedule all the requests in the single resource with the minimum largest lateness. Several ordering approaches that don't work are ordering by increasing length or time of request. Instead, we should order by their deadline, with the earliest deadline first. This is logical so that we can be sure to schedule first, the jobs that will be late the soonest. The algorithm is on page 127-128.
There are several key point to not while analyzing this algorithm, which we can prove using an exchange argument. First of all, the optimal schedule has not idle time. If it did, we could bump the requests up and the new lateness would be less than or equal to the previous lateness. There also exists an optimal schedule with no inversions. An inversion is when a job with a later deadline is scheduled before a job with an earlier deadline. This optimal schedule with no inversions or idle time is produced by the greedy algorithm. The proof by exchange argument works by modifying an optimal schedule with exchanges until it matches the greedy one, maintaining optimality the whole time. At first I didn't think it was possible to have an optimal solution with inversions, but it makes sense now. It also clears up the exchange argument.
4.4 Shortest Paths in a Graph
This problem is based on a network of directed nodes and how to find the shortest path from node s to node t, or from s to any other node. Each edge has a weight or length associated with it, so a path is the sum of the weights of the edges to get from s to t. This algorithm, known as Dijkstra's Algorithm, is on page 138. The basic premise is that it keeps track of a set of explored nodes, and the distances to each of the nodes (originally infinity). You look at each edge from the explored set to any other node and keep updating the minimum distances. This algorithm is greedy because it always forms the shortest s-v path we can make from the explored set with a single edge at each step. The path to each node can only get shorter.
There are two things to note about this algorithm. First, this algorithm is dependent on the fact that all the weights are positive. Second, this algorithm is really just an extension of BFS. If we implement this algorithm with a priority queue, it can run in O(mlogn) time. Where the algorithm itself if O(m) and the operations on the priority queue are O(logn).
4.5 The Minimum Spanning Tree Problem
This problem consists of figuring out the cheapest way to connect a network such that all the nodes are connected and the solution is a tree. There are three different greedy algorithms for this problem, all of which work well. The first is Kruskal's algorithm, which builds a spanning tree by successively inserting the edge of least cost, given nit does not create a cycle with the edges already added. Next is Prim's Algorithm, which is similar Dijkstra's Algorithm. It starts with a root node and repeatedly adds the cheapest node out of the partial tree to the nodes not already looked at. Lastly, there is the Reverse-Delete Algorithm that starts with a full graph and repeatedly deletes the most expensive node, as long as it would not disconnect the graph.
The two important properties for analyzing these algorithm are the Cut Property and the Cycle Property. The Cut Property basically says that for any subset of nodes S, then the minimum edge from S to the rest of the graph must be in the minimum spanning tree. This property helps prove the optimality of Kruskal's and Prim's Algorithms. The Cycle Property essentially states that the most expensive edge in any cycle is not in the minimum spanning tree. This helps prove the optimality of Reverse-Delete. If we implement Prim's Algorithm with a priority queue, we can get an overall running time of O(m logn). For me, Reverse-Delete and Kruskal's Algorithm are the most intuitive.
4.6 Implementing Kruskal's Algorithm: The Union-Find Data Structure
To efficiently implement Kruskal's Algorithm, we create a special data structure with specific functionality. Find(u) will return the set containing the node u and Union(u,v) will merge two sets. The simplest data structure for union-find is an array with the name of the set containing each element. With this Find(u) is constant time and Union(u,v) is linear. However, we can do better if we use a data structure that uses pointers. Each node will be in a record and will have a pointer to the name of the set it is in. Thus, we can implement Union(u,v) in constant time by just updating the pointers. We can also make Find(u) run in O(logn), which is overall faster.
To implement Kruskal's Algorithm with this idea, we first sort the nodes by cost, which takes O(logm) time. Then we use the Union-Find to maintain the connected component of the tree we are building. We consider each edge with nodes u and v, compute Find(u) and Find(v), and then merge them if the algorithm says the edge should be in the tree. This algorithm seems very straight-forward, but is a little trickier to implement once you get into the details and try to make it as efficient as possible.
4.7 Clustering
The Clustering Problem is essentially trying to classify objects into coherent groups based on some distance function. This distance function is often abstract, rather than a distance in feet as we may think of it. The goal is to divide objects into groups so that objects in the same group are “close together” and objects in different groups are “far apart”. We use minimum spanning trees to try to find the clustering of maximum spacing of k clusters and n nodes.
The idea of the algorithm is to add an edge between the nodes/clusters of smallest distance repeatedly, where clusters are connected components. We will start with n clusters and continue this process until we are down to k clusters, as we desire. It turns out this can be done using Kruskal's Algorithm where we just stop before it adds the last k-1 edges. This is equivalent to getting the minimum spanning tree then deleting the most expensive k-1 edges. I would not have thought these were equal, but after going through the proof it makes sense.
4.8 Huffman Codes and Data Compression
The general problem of this section is how to encode symbols using bits. The simplest way, but not the most efficient, is to have codes of the same length for each symbol. For example, if we had 32 symbols to encode, we would need 5 bits. However, must alphabets have letters that occur more frequently than others, so it would be more efficient to make the bits for those symbols shorter. This idea is called data compression. A type of code with different numbers of bit for different symbols is a variable length coding (i.e. Morse code). However, it is also important that a code be a prefix code, which means that no code word is a prefix of another. This means you can just read the encoded message in a straight line until you reach a code word, and you will not run into any confusion.
To evaluate compression, we can look at average number of bits per letter, which is the sum of the frequencies times the number of bits for each letter. The goal is to find a prefix code with the minimum average bits per letter. These codes are called Huffman codes, and are represented using a binary tree. In this tree, the left child represents a 0 and the right side represents a 1. Each leaf is a letter and you get the bits for that letter by following the path down the tree. An algorithm to build this tree, given frequencies and an alphabet, is on page 172. You can prove Huffman codes are optimal using induction and the running time of the algorithm is O(k logk).
This section was interesting to me because I actually took a class on data compression in Scotland last semester. We actually learned to think about and visualize Huffman codes without the tree, but we still used the tree to implement it. It was interesting to see a more in depth analysis of the algorithms I worked with before.