Chapter 6

Often, we face a situation where there is no natural greedy algorithm exist and the divide and conquer algorithm is not effective in terms of reducing algorithm, such as when we do not know “what is coming next”. Then we turn to Dynamic Programming, where we carefully play around with the algorithm such that it is close to brute-force search but systematically work to save running time.

Section 1: Weighted Interval Scheduling

The first Dynamic programing problem is the weighted interval scheduling problem. It is a generalization of the one we saw at the greedy algorithm chapter, which can no longer be tackled with greedy algorithm. The brute-force search will take n! running time and of course, we want to reduce it. First step is to order the intervals by their ending times. Second, set up a function p(n) to calculate how many intervals before n is not overlapping with itself, which assures us that if we choose anything p(n) before n, it will not overlap with n. Hence, when we analyze wether or not an interval j shall be in the optimal solution given 1…n intervals, we need to see if: vj+opt(p(j))>opt(j-1), if so, then it is in, if not then it shall not be included—this operation only takes constant time, because there are only two possible cases. Alongside, with the comparison, obviously, we need to record all the optimal solutions opt(j), which takes O(n) time. The memorization will save us time and also enables us to find the intervals in linear time later on. When we finally compute out opt(n) for n intervals, all we have right at the moment is the value of the optimal weight: n. But not the actual intervals that get us the value. Thanks to the memorization, we can trace back and figure out the intervals, which takes O(n) time. Together the algorithm takes O(n) time.

Readability is 8. The section is clear and easy to follow.

Section 2: Principles of Dynamic Programming: Memorization or Iteration over Subproblems

The section sums up the basic principles of dynamic programming. The fundamental idea is to iterate through all subproblems rather than computing recursively. While we try to solve the subproblems, we record all the solutions, which in turn will help us solve the whole problem within polynomial time instead of factorial time. The basic outline/requirement for dynamic programming is listed as below: I, There are only a polynomial number of subproblems II, The solution to the nth problem can be easily (for example, within constant or linear time) computed. III, There is a natural ordering of the subproblems ( for example, the finishing time in the Weighted Interval Scheduling).

This section concisely summarize the logic behind dynamic programming. I think it quite useful before looking at more complicated examples.

Readability is 8.

Section 3: Segmented Least Squares: Multi-way Choices

This section talks about another problem to be dealt with through dynamic programming. But it is slightly different from the Weighted Interval Scheduling problem. The book first spends a lot of paragraph explaining the problem. In short, we want a best fit for some discrete points, the best fit is composed by several lines. However, every line added will add some penalty. There is parameter, OPT, composed by Error and Penalty from adding lines and we want to minimize the OPT. The relationship we have between the OPT of the nth and the previous ones is: OPT(n)=min_1⇐i⇐j(e_i,n+C+OPT(i-1)) Notice in this time, in order to find OPT(n), we actually need all the OPT(i) for i before n. Because it is totally possible that if we set up a line from n-1th to nth point, the total value is not as good as set up a line from n-ith to nth. This is also where the difference between this problem and the weighted interval scheduling problem comes from. Hence, the running time in the algorithm to find out the minimum value of OPT(n) is O(n^3). Tracing back to find out the lines take O(n^2) time, because for each point we need to look at multiple lines. The readability is 7.

Section 4: Subset Sums and Knapsacks: Adding a Variable

This is another variation of the dynamic programming problem. The book first introduces the problem. We have a lot of items with different values to pick from but a weight limit. How to pick so that we can maximize the total value? The optimal relationship is If wi>w, obviously adding i will make the basket overweight so the answer is no, otherwise: OPT(i,w)=max(OPT(i-1,w),w_i+OPT(i-1,w-w_i)). If not adding the item gives a better result than adding it, surely we choose no, otherwise yes. It is different from the previous two because now we look at two variables, both the weight (not to exceed) and the value (to maximize) instead only 1 to optimize. On the surface, it is a little bit like the weighted Interval problem because it is a single way decision, either include the item or not. However, when we look closer, to decide to include item n or not, we need to see if there is a way to make better use of the weight without including the item. Then we need to calculate all the possible optimal combinations at each weight w from 1 to W with allowing the 1st item, 1st+2nd items, 1st+2nd+3rd items… 1st+2nd+3rd+..nth items. So this takes O(n*W) time. Trace back to find which items to add takes linear time. Readability is 7.

Section 5: RNA Secondary Structure Dynamic Programming over Intervals

This is another slightly more complicated problem. The problem is RNA matching, which need to satisfy the two ends of a base to be separated by at least 4 elements, no crossing, and AU, CG correspondingly match. We try to find out all the possible matches and filter out the ones are definitely not the optimal matching till at the end there is only 1 left. We try matching any two molecules that are 4 distance apart, if they do not match we just go on to the next one. The difficulty of this problem is that once a new pair appear, we will have further constrains (crossing, 4 elements) because of the appearance of the pair. In fact, in some cases, the original problem will be divided into 2 subproblems (not evenly). That is why we call it “dynamic programming over intervals”. Of course, as usual, the solution is built up from the smallest subproblem to make the algorithm linear. Even though, realistically, this is not suitable for RNA matching, in my opinion…Because we cant have more than 5 molecules consecutively unmatched and we cant have molecules hanging at the end of the strand without matching, which are possibly the case if we apply the algorithm presented in the book. All the constrains from the real world make bio computation more difficult. Readability is 6.

Section 6: Sequence Alignment

This section introduces a fairly commonly seen problem: sequence alignment, which appear in google-search and molecular computational biology. We want to see if two things are “similar”; thanks to some biologist, there is a numerical definition of similarity. What we want to do is to find the parameter to compare. The algorithm given by the book is to use the recurrence: OPT(i,j)=min(alphaij+OPT(i-1,j-1), delta+OPT(i-1,j), delta+OPT(i,j-1)). First possibility is to have i,j matching each other, second is to skip i, third is to skip j. To prove the correctness of the dynamic programming, we visualize a graph with each node pointing to the top, right, and the upper right diag. It is not hard to show that the OPT algorithm gives the best solution to go from (00) to any (ij), which essentially means the best alignment.

However, in real world scenario, we are not going to be given the target to compare with what we type in. In other word, how can the computer know to compare “occurence” with “occurrence”? Guess this is another interesting problem in searching.

Readability is 7.

Section 7: Sequence Alignment in Linear Space via Divide and Conquer

This section gives an improvement of the algorithm from section 5. In real world, especially molecule biology, there are many elements to be aligned and it takes over too much space (O(mn) for the graph) if we use the algorithm given in the last section. Hence, we hope to find a more spacial-efficient algorithm. The first idea comes to mind (in the book) is to only record the current column and the one right before it because those are the only relevant information in order to calculate the optimal value. However, a problem appears: we cannot record the path leading to the optimal value in that way. Then the book further elaborates the idea using divide-conquer. The algorithm is based on this theorem: m and n to be aligned, k be in {0..n} and q be an index that minimizes the quantity f(q,k)+g(q,k). Then there is a corner to corner path of minimum length that passes through (q,k). This, together with working the problem backward from (m,n) to (0,0) and forward as the original way, allows us to use divide and conquer. For one of the sequence to be aligned, wlog, say n, we always divide it by half; and the other sequence, m, we divide it based on the value of q. On the way, we record each q. It indeed takes O(m+n) space. The recurrence relationship is: T(m,n)⇐cmn+T(q,n/2)+T(m-q,n/2)=O(mn), so the running time is roughly the same as the algorithm before. Hence we achieved our goal: shrink the space and maintain the running time. I really like the idea and the readability is 9.

Section 8: Shortest Paths in a Graph

This is a variation of the shortest paths in a graph problem with only positive edges. In this extended problem, we allow negative-weighted edges; however we still do not allow a negative cycle-because it can just go on forever around the cycle and go to negative infinitive, and become meaningless. When there is negative edges, it is easy to come up with a counter example that our old greedy algorithm no longer works. Then the book tries to modify the greedy algorithm by adding a constant M big enough to each edge so as to make them all positive, which is easy to be seen wrong, too. Hence, we need to turn to dynamic programming. The recurrence is: OPT(i,v)=min(OPT(i-1,v),min(OPT(i-1,w)+cvw)): that is, to calculate out all the optimal solutions using different number of edges. The calculation, as usual, is built up on smaller sub-problems. The running time, if calculated roughly, is O(n^3), though a closer look refines it to be O(mn). This algorithm, like many other dynamic programming, is space-consuming. To save space, when we go through each node, we will record the next node that leading to the optimal solution. In this way, it only takes O(n) space. In addition, to save both time and space, we set a stopping signal: when we run into 2 same columns consecutively, it is time to stop because any further update will not change anything. The readability is 8.

Section 9: Shortest Paths and Distance Vector Protocols

This is section introduces an algorithm that improves the shortest paths algorithm from section 8. First we observe the shortage of the original algorithm: it is a pull-based algorithm, and in each iteration, each node will contact all its neighbors and “pull” the new value from it; however, if a neighbor remains the same value, then there is no need to pull out the new value from it. The new algorithm is a push-based algorithm. That is, updates will not be provided until the value of a node changes; then the updates will be pushed to its surrounding neighbors, and the method is called the distance vector protocol. The book writes down the detailed algorithm, as usual. Then it continues to discuss some associated concerns with the distance vector protocol. In real life scenario, if one of the edges is deleted, which totally cut-off all the connections between a node a and the ending node b, the distance vector protocol will not cease updating M[a] and never realize there is no path available. And this, of course, can be problematic. The way to solve it is to keep track of the big picture: each node not only storing knowledge of its neighbor but the entire network to ensure there actually exists a path. The extra storage will take over more space but is indeed necessary. Readability is 7.