Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:lesherr:home:chapter4 [2018/02/25 16:32] – created lesherr
+++ courses:cs211:winter2018:journals:lesherr:home:chapter4 [2018/03/11 16:30] (current) – [Section 8: Huffman Codes and Data Compression] lesherr
@@ Line 1: / Line 1: @@
 ====== Chapter 4: Greedy Algorithms ======
+===== Section 1: Interval Scheduling: The Greedy Algorithm Stays Ahead =====
+The Interval Scheduling Problem consists of a set of requests {1,2,...,n}, where the ith request corresponds to an interval of time starting at s(i) and finishing at f(i). When utilizing a greedy algorithm for the interval scheduling problem, we want to define a simple rule to select a first request. Once the first request is accepted, all other requests that are not compatible with the first request are rejected. Then, another compatible request is accepted, and again all of the other requests that are not compatible with it are rejected. This process repeats until you run out of the requests. The main challenge of this problem is deciding on what the rule should be for picking the next acceptable request. There are 4 main possibilities for choosing the order of the requests: choosing by earliest start time, choosing by shortest time required to complete the task, by the smallest number of coexisting requests during its time interval, and by earliest finish time. Choosing by earliest start time does not yield an optimal solution because if the earliest request i is for a very long interval, we may reject a lot of requests for shorter time intervals. Choosing by smallest interval of time does not yield an optimal solution, because it may cause the algorithm to choose a request of a smaller time interval by that results in only one request being chosen instead of two. Choosing by least number of incompatible requests also does not yield an optimal solution, because it may not choose the row with the most requests since they may conflict with lots of other rows. The correct way to choose the next request is by earliest finish time. We prove this by using the greedy stays ahead idea. The Greedy algorithm will always first choose the algorithm with the earliest finish time, so the possible start time of the next request will always be at least as early as the optimal solution. If the optimal solution has an extra request, the greedy algorithm will always add that request because its finish time will always be at least as early as the optimal solutions finish time. This algorithm will run in O(nlogn), since it will take nlogn time to sort the requests in order of finish time. Then selecting the requests will take place in constant time. Another related problem that can arise is trying to schedule all of the requests using as few resources as possible. 'In any instance of the Interval Partitioning Problem, the number of resources needed is at least the depth of the set of intervals'. The depth of a set of intervals is the maximum number that pass over any single point on the time-line. The algorithm used to solve this problem is very simple. You go sort the requests by finish time. Then you go one-by-one looking at the requests. For each request, you look at the current resources allocated. If the request can be added to a resource without conflict with another request, you add it to the resource, else you keep looking at other resources. If no such compatible resource is found, you must create a new resource and add the request to it. The algorithm will schedule every request on a resource using a number of resources equal to the depth of the set of intervals, which is the optimal number of resources needed. This section was very easy to read and understand, I'd give it an 8/10.
+===== Section 2: Scheduling to Minimize Lateness =====
+We now consider a different scheduling problem. This situation consists of requests like before, that each take a given interval of time to be completed. Each request however now has a deadline time, but can be scheduled at any interval of time before the deadline. Each accepted request is assigned an interval of time of its length, and different requests must not have overlapping intervals. A request is considered late if its interval finish time is after its deadline time. Its lateness is calculated by subtracting its finish time by its deadline time. The goal of this algorithm is minimize the maximum lateness of any one given request. There are several possible approaches that could be taken in an attempt to solve this problem. The first approach would be to schedule jobs in order of increasing time interval. This approach does not produce an optimal ordering, as the jobs with short time intervals might have very long deadlines, whereas the jobs that take longer amounts of time might have shorter deadlines and need to be scheduled first. Another way to approach this problem would be to order requests by slack time, which is calculated by subtracting the time interval over which they must take place by the deadline time. However, this approach also does not produce an optimal solution, as maximum lateness is not optimized if a job with a long time interval and deadline is scheduled before a job with a small time interval and smaller deadline, but a larger slack. The approach that always produces the optimal solution is to sort the jobs in increasing order of deadlines. This rule is also termed as 'Earliest Deadline First'. This approach confirms that jobs with earlier deadlines get completed earlier, which will minimize the lateness in the long run. The schedule produced by taking this approach will have no idle time, since work can be done during this period to minimize lateness. Another characteristic of this algorithm is that all schedules with no inversions and no idle time have the same maximum lateness. If a schedule has an inversion, where a job a is scheduled after job b, but has an earlier deadline, then the greedy algorithm would have scheduled the two requests in the opposite order, thus minimizing lateness. Thus, there is an optimal schedule that has no inversions or idle time. If a schedule had an inversion, swapping the two requests would result in one less inversion, and would have a maximum lateness that is no larger than it was prior to the inversion. Thus you can continue to remove inversions which will either lower the lateness or have it stay the same, and thus will result in an optimal solution with no inversions. This section was easy to understand after having talked about it extensively in class. I'd give it a 7/10.
+===== Section 4: Shortest Paths in a Graph =====
+Another application of Greedy Algorithms is to find the shortest path in a graph between two nodes, which results in the creation of minimum-cost spanning trees. The setup of the shortest paths problem consists of a directed graph, with a designated start node s that has a path to every other node in G. Each edge of the graph is given a weight to denote the time it takes to traverse the edge. The length of the path between two nodes is the sum of the weights of all the edges that are traversed in the path. The goal of the problem is to determine the shortest path from s to every other node in the graph. An algorithm that can be used to find the shortest path between nodes in the graph is called Dijkstra's algorithm. Dijkstra's algorithm works by considering every node in the graph, and finding the set of edges in the graph that results in the shortest distance from the start node to the desired node. However, there are some limitations to this algorithm. The algorithm does not always find shortest paths if some of the edges can have negative lengths. A more complex algorithm - due to Bellman and Ford - is required for cases with negative path lengths. Dijkstra's algorithm can also be thought of as a 'continuous' version of Breadth-First Search for traversing a graph, where the tree created is the shortest path to any given node from the start node. This algorithm runs in O(mn) time. There are n-1 iterations over the nodes of the graph, where each iteration adds a new node to the current set. For each iteration, we have to consider each node that is not in the current set, and find all the edges between the set and the node to determine the minimum distance to the node. Thus you are traversing m edges of the graph for each node. However, if you change the implementation of the graph into a priority queue, Dijkstra's algorithm can be implemented to run in O(m) time. Each priority queue operation runs in O(logn), for a total overall time of O(mlogn) for the implementation. This section was rather difficult to follow, I would give it a 4/10.
+===== Section 5: The Minimum Spanning Tree Problem =====
+Another fundamental problem that can be solved using graphs is the Minimum Spanning Tree Problem. This problems focuses on a set of locations (nodes), and connections between them. For every pair of nodes in the graph, there exists a path between every pair of nodes. For each edge, there is a certain cost to traverse the edge. The goal of this problem is to find a subset of the current graph that is connected, and whose total cost (sum of the edges) is minimized. A theorem of this problem is 'Let T be a minimum-cost solution to the Minimum Spanning Tree problem. Then (V, T) is a tree'. We know that by definition a solution to this problem must be connected, and we will show that it will not contain any cycles. Since removing one edge won't make the graph no longer connected, we can remove the edge of the cycle with the highest cost and therefore we know that a solution must not have any cycles. If we allow for edges that have 0 cost, a solution may have extra edges, however there will always be a minimum-cost solution that is a tree. There are three main greedy algorithms that correctly find the minimum spanning tree of a graph. The first algorithm, known as Kruskal's algorithm, starts with a graph without any edges, and builds the minimum spanning tree by successively adding edges in order of increasing cost. You iterate through the edges in this order, and only add an edge to the graph if it does not create a cycle. The second algorithm, known as Prim's algorithm, uses an approach very similar to Dijkstra's algorithm. You start with a root node, and try to greedily grow a tree outwards At each step you add the node that can be added with the smallest cost to the tree. The third algorithm, known as the Reverse-Delete Algorithm, utilizes an approach that is the backwards version of Kruskal's algorithm. You start with a full graph, and delete edges in order of decreasing cost.  You decided to delete an edge as long as doing so does not disconnect the graph. All three of these algorithms will produce an optimal solution to the Minimum Spanning Tree Problem. 'The crucial fact about edge insertion  for these algorithms is known as the Cut Property. It states: "Assume that all edge costs are distinct. Let S be any subset of nodes that is neither empty nor equal to all of V, and let edge e = (v,w) be the minimum cost edge with one end in S and the other in V-S. Then every minimum spanning tree contains the edge e."' We can use the Cut Property to prove the optimality of both Kruskal's and Prim's algorithm. 'The crucial fact about edge deletion for these algorithms is known as the Cycle Property'. It states "Assume that all edge costs are distinct. Let C be any cycle in G, and let edge e = (v,w) be the most expensive edge belonging to C. Then e does not belong to any minimum spanning tree of G."' We can use the Cycle Property to prove the optimality of the Reverse-Delete Algorithm. Both Prim's and Kruskal's Algorithms can be implemented to run in O(mlogn) time, using a priority queue. This section made what we talked about in class much more clear, and was easy to follow. I'd give it an 8/10.
+===== Section 6: Implementing Kruskal's Algorithm: The Union-Find Data Structure =====
+A basic graph problem that we covered in a previous chapter is to find the set of connected components, which can be achieved using Depth-First search or Breadth-First search. When implementing Kruskal's Algorithm, we want to be able to develop a data structure to store a representation of connected components to check whether an edge will create a cycle. When considering an edge between two nodes, we need to check if the identities of the connected components containing the two nodes are different, and therefore can include the edge, otherwise we should omit the edge. The Union-Find data structure allows use to maintain disjoint sets. Given a node, the FIND operation will return the name of the set containing the node. We can use this operation to check if two nodes are contained in the same set. This structure will also implement an operation UNION, which will merge two distinct sets into one single set. Using these two operations, we are able to maintain connected components of the evolving graph as the edges are added. This structure does not support the deletion of edges from the graph. The Union-Find data structure supports three operations.The first, MAKEUNIONFIND(S) for a set, returns a Union-Find data structure on set S where all elements are in separate sets, which can be used at the very start of the algorithm. The second, FIND(u), returns the name of the set containing u. The third, UNION(a,b), will merge the sets a and b into a single set. The goal run time of the first operation is O(n), and for the second and third is O(logn). Now the main decision is what to implement the Union-Find data structure as. 'Consider the array implementation of the Union-Find data structure for some set S of size n, where unions keep the name of the larger set. The FIND operation takes O(1), MAKEUNIONFIND(S) takes O(n), and any sequence of k UNION operations takes at most O(klogk) time.''Consider the pointer-based implementation of the Union-Find data structure for some set S of size n, where unions keep the name of the larger set. A UNION operation takes O(1) time, MAKEUNIONFIND(S) takes O(n) time, and FIND takes O(logn).' Using the Union-Find data structure, we can implement Kruskal's algorithm on a graph with n nodes and m edges to run in O(mlogn) time. Understanding the setup of the Union-Find Data Structure was rather easy to follow, although it became more complex when discussing the pointer based implementation. I would give this section a 6/10.
+===== Section 7: Clustering =====
+Another area that minimal spanning trees can be applied to is clustering. Clustering is the problem of trying to classify or organize a collection of objects into coherent groups. 'A common approach to this problem is to define a distance function on the objects, with the interpretation that objects at a larger distance from one another are less similar to each other'. Given a relative 'distance' between each other, the clustering problem wants to divide the objects into groups so that objects within a group are close, and objects in different groups are far apart. 'The spacing of a k-clustering is the minimum distance between any pairs of points that lie in different clusters.' In this problem, we want to find a clustering with the maximum possible spacing. 'To find a clustering of maximum spacing, we consider growing a graph on the vertex set U. The connected components will be the the clusters, and we will try to bring nearby points together into the same cluster as rapidly as possible.' We add edges in order of increasing distance between endpoints, and only add an edge if it connects edges that are not in the same cluster. When adding an edge between two distinct connected components, you are merging the two clusters. 'In clustering lingo, the iterative merging of clusters is termed single-link clustering, a special case of heirarchical agglomerative clustering. Agglomerative means that we combine clusters, single-link means we do it as soon as a single link joins them together.' This procedure is the exact same as Kruskal's Minimum Spanning Tree Algorithm. When we seek a k-clustering, we stop the algorithm once we obtain k distinct connected components. 'The components C1, C2, ..., Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum space.' This section was relatively easy to follow, and hopefully will become more clear after discussing it in class. I would give it a 7/10.
+===== Section 8: Huffman Codes and Data Compression =====
+Now, we consider a problem were we use a greedy rule to 'shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.' In this manner, solving the smaller instance still leads to an optimal solution of the original larger instance, but the final result isn't returned until the whole recursion is completed. The problem we look at deals with 'data compression', which is an important part of digital communication. Computers encode data into sequences of bits (1s and 0s) so that it can process information and perform operations. The problem of data compression is finding the best way to store the data in the smallest amount of bits possible so that it can be transmitted in the least amount of time. The average bits per letter is calculated by taking the number of bits it takes to encode a letter and multiplying it by the frequency of that letter, then summing up all of the letters. The goal of data compression is to minimize the average number of bits per letter, so that letters that are more frequent can be encoded with less bits than other less frequent letters. One example of data compression is Morse Code, which was used to transmit messages using sequences of dots and dashes to represent letters. However, the encoding of words in this manner was ambiguous, as there was no way to distinguish when a letter had ended and a new one had begun. To deal with this issue, Morse Code transmissions had to involve short pauses between letters, however this is inefficient and can be done in a much better way. The issue with Morse Code is that there exists instances where one encoding of a letter is a prefix of an encoding of another letter, thus you don't know which letter is being represented without the pause in between. To optimize the data transmission, we want to find a way to encode all of the letters in the most efficient way possible such that no letter encoding is a prefix of another, thus removing any ambiguity of what the data represents. Additionally we want to continue to represent more frequent letters with less bits than other less frequent letters. To do this, we will represent the prefix codes using a binary tree, where a path from a node to the right represents a 1, and a path from a node to the left represents a 0.  'The encoding of alphabet S constructed from binary tree T is a prefix code'. To ensure that no one letter encoding is a prefix of another, all letters must be stored on leaf nodes of the tree, with no letters being stored on interior nodes. 'The binary tree corresponding to the optimal prefix code is full'. 'There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*.' The algorithm to build this tree is as follows: If S has two letters, encode one with 0 and the other with a 1, else, let y and z be the two lowest-frequency letters, replace the two letters in the alphabet by creating a combined letter yz as a supernode with y and z as its children, and recursively continue this pattern until there is only one letter left in the alphabet. 'The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code.' The running time of Huffman's algorithm relies on the use of a priority queue in the implementation, which has O(logn) operations, thus this algorithm can run in a time of O(nlogn). A major drawback of prefix codes is that they cannot adapt to changes in the text. This section was very easy to read and follow after having learned about it in class. I would give it a 9/10.