This is an old revision of the document!

Eric's Journal

Week 7's Reading Summary:

4.7 Clustering

the problem

* have to classify/org a collection of objects

distance function of similarity or differences
clusterings of max spacing: k-clustering of U is a partition of U into k non empty sets C
how to find a U with max spacing of clusters?

>designing the alg

graph H on U, grown edge by edge and actually a union of trees
single-link clu, special case of hierarchical agglomerative clu
run Kruskal's alg and stop before it adds last edge
>analyzing the alg
4.26: the components C1…Ck formed by deleting the k-1 most expensive edges of the MST T make a k-cl of max space
proof:C is most exp edge (mee); since C and C' are not same, some points in C are not in Cr; then KA added all edges of a path pi-pj before we stopped it; but then there's some p and p' that must belong to different sets, so spacing is at most d(p,p')⇐d* pg 161

4.8 Huffman Codes and Data Compression

the problem

* encoding symbols using bits: say we set four bits for each letter of the alph… but fixed bits isn't eff because not all letters are used frequently; reducing av num bits/letter is a fund prob of data compression, and here's where compression algs come into play

var-len encoding schemes:Morse code was encoding of letters into dots and dashes with a pause in between letters to disambiguate; so the encoding is three parts: 0, 1, and space
prefix codes: no encoding is a prefix of any other, so go left to right, when enough bits match a letter, delete those bits and write down the letter; not ambiguous anymore so no need for space
optimal prefix codes: where the average number of bits per letter is min ABL(y) = ExinS |y(x)|

>des the alg (greedy again)

rep prefix codes using bin trees

* 4.27: the enc of S constructed from T is a prefix code

Proof: path from root to x would have to prefix of path from root to y but since x is a leaf this can't be
Other direction: start with root, all x in S enc beg with 0 is in left, then all y in S enc begin with 1 to right of root, then recurse
len of enc of letter in S is same as leng of path from root to that node in bin tree
4.28: the bin tree corr to opt prefix code is full
proof: T is BT corr to OPC, and has node u with one child v; convert T to T' by replacing u with v; so pref code of T' has smaller av nb/l than for T, contra opt of T

»a first attempt: the top-down approach

split S into S1 and S2 the rec construct prefix codes for S12 ind, then make these two subtrees of the root
Shannon-Fano Codes in info theory: splitting a set into two subsets of equal frequency

»what if we knew the structure of the OPC?

4.29: suppose u and v are leaves of T* and v is on a level under u; u is labeled with y in S and v in z in S; then fy>=fz
proof: using exchange of fy and fz, dep(v)-dep(u) * fy-fz = overall sum; if fy<fz the sum is neg, contradicting OPC we had before the exchange pg 170
so take all leaves of d1, label them highest freq letters, then d2, next highest freq and so on
4.30: w is a leaf of T*
if w weren't then some w' in subtree below it, that has greater depth than v, contradicting that v is a leaf of max dep in T*
4.31: there's OPC with corr tree T* in which the 2 lowest freq letters are assigned to leaves that are siblings in T*

»an alg to construct OPC; Huffman's alg (greedy because it's same frequencies and repps but with as few letters repping as possible; bottom up as opposed to Shannon-Fado top down) to con a pref code for alph S with given freqs: if S has 2 letters: enc one letter with 0, the other with 1 else let y* and z* be lowest FL form new alph S' without y* and z*, replaced w or the sum of their freqs recursively construct pref code y' for S' with tree T' pref code for S is: start with T', take leaf labeled w and add two children y* and z* to it

analyzing the alg
its optimality

* 4.32: ABL(T') = ABL(T) - fw

proof on pg 174, lots of too fancy symbols
4.33: Huffman code for a given alph gets min avB/L of any prefix code
proof by contr: T not opt; some Z exists in which leaves y* and z* are siblings. Del those from z and make w, then Z' defines pref code for S'; then do it to Z' to get Z; but we get ABL(Z')<ALB(T') which contradicts optimality of T' as a prefix code for S'

»impl and RT: k^2; but using pq, it can be klogk

extensions

* BW images where one color overwhelms the other can be compressed using arithmetic coding algs

adaptation can be a problem if frequencies of letters change depending on their location in total encoding; this can be solved by placing halfway into the sequences an instruction that says the encoding changed
investing small amt of space can save big as it reduces avB/L in the long run (adaptive compression schemes)

(CHAPTER 5: DIVIDE AND CONQUER) 5.1 a first recurrence: the mergesort alg

template for mergesort and other common divide and conquer algs: divide input into 2 equal pieces, solve the 2 subproblems by recursion, then combine spending only linear time for initial division and final recombining
5.1 recurrence relation: for some constant c, T(n) ⇐ 2T(n/2)+cn when n>=2 and T(2) ⇐ c; this specifies T(n) in terms of its values or smaller inputs, and we now need to get T only on the left hand side
designing eff DAC algs is heavily intertwined with understanding how recurrence relation determines a RT

>approaches to solving recurrences: 1) unroll using rec by finding RT of first few levels and then finding a pattern then summing it all up to get total RT or 2) plugging-in a guess into the recurrence relation and seeing if it works

unrolling the mergesort recurrence: 5.2: any function T satisfying 5.1 is bounded by nlogn when n>1
subbing a solution into the MR: T(n) = 2c(n/2)log(n/2)+cn=cnlogn - cn_cn = cnlogn pg 213
an approach using partial substitution (weaker of the two):T(n)⇐2k(n/2)log(n/2)+cn = kn(log2-1)+cn = T(n) ⇐ knlogn - kn + cn ⇐ knlogn

Week 6's Reading Summary:

4.2 Sch to Min Lateness: An Exchange Argument

The problem: one resource to meet n requests for an interval of time. Each req has deadline d. Lateness means the deadline was missed. We want to minimize the max lateness of a request

Designing the Alg

* Order the jobs in order of increasing length t to get the short jobs done first; but this completely ignores d, so it's bad

Order the jobs according to increasing slack time di-ti; ineff too
Optimal: sort by inc order of deadlines

consider jobs i=1…n in this order

  assign i to the time int from s(i) = f to f(i) = f+t
  let f=f+t
return the set of sch ints s(i),f(i) for i=1...n

>Analyzing the Alg

4.7 There is an optimal sch with no idle time. Proving by exchange argument, A' has an inversion if dj < di but di is sch first to be done
4.8 All schs with no inversions and no idle time have the same max lateness.
Proof: Two schs can only differ in the order in which jobs with identical deadlines are scheduled d. Jobs with d are sch consec. Among the jobs with d, the last one has the greatest lateness, and this lateness doesn't depend on the order of the jobs
4.9 There is an optimal sch that has no inversions and no idle time.

Proof: a) If O has inv, then there's some job with dj < di b) After swapping i and j we get a sch with one less inversion c) the new swapped sch has a max lateness no larger than O. (i cannot be more late in the sch O than j was in the sch O)

4.10 the sch A produced by GA has optimal max lateness L, by 4.9 and 4.8

>Extensions: say all jobs have same s, or if they have release time r before which they can't be completed

4.4 Shortest Paths in a Graph

problem: given start node s, what is the shortest path from s to each other node?
Designing the Alg / Dijkstra's Alg

Dijstra's Alg(G,l)

let S be the set of explored nodes
  for each u in S, store distance d(u)
S = {s} and d(s) = 0
while S != V
  select node v not in S with at least one edge from S for which d'(v) = min[e(u,v)]d(u) + le is as small as possible
  add v to S and define d(v) = d'(v)

>Analyzing the Alg

4.14 Consider the set S at any pt in the alg's exec. For each u in S, the path Pu is the shortest s-u path

Proof: S=1 is easy cuz both S and d = 0. Induction hypothesis: Pu is shortest su path for each u in S. Let y be the first node on P that is not in S and let x in S be the node before y. P can't be shorter than Pv because it is already at least as long as Pv by the time it has left S *

Ob 1: the alg does not always find shortest paths if some of the edges can have negative lengths
Ob 2: DA is a continuous version of standard BFS

>Imp and RT: it is O(mn) but can be improved by maintaining d'(v) for each node, and then having a pq for the nodes V-S; then have ChangeKey and ExtractMin ops -4.15 Using a pq, DA can be imp on a graph with n nodes and m edges to run in O(m) time, plus the time for n ExtractMin and m ChangeKey ops

4.5 The Min Spanning Tree (MST) Problem

Prob: connect n nodes with the fewest (and cheapest) edges possible

* 4.16 let T be a min cost solution to the network design problem defined above. Then (V,T) is a tree Proof: (V,T) must be connected; if it contained C, then (V,T-{e}) is also a valid solution to the problem and cheaper than taking the long way around the remainder of C; but that's a contradiction, so it has no C and is a tree

We're trying to find the cheapest spanning tree for some graph G, so that's why the name is the MST problem

>Designing Algs

Kruskal's Alg: Start w no edges then build spanning tree by inserting edges from E ioo inc cost as long as it doesn't create a cycle
Prim's Alg: start with root s and add the node that can be attached as cheaply as possible to the partial tree we already have
Reverse-delete Alg: start with full graph (V,E) then delete edges ioo decreasing cost as long as it doesn't disconnect the graph

>Analyzing the Algs

4.17 assume that all edge costs are distinct. let S be any subset of nodes that is neither empty nor equal to V and let e=(v,w) be the min cost edge with one end in S and the other in V-S. Then every MST contains e

Proof: we can identify edge e' in T that is more expensive than e, and then exchanging e for e' results in another spanning tree cheaper than T

4.18 KA produces MST of G

Proof: v in S but w not in S, no oedge from S to V-S has been encountered yet or else KA would have added it; hence e is the cheapest edge with one end in S the other in V-S and by 4.17 belongs to every MST

4.19 PA produces MST of G

Proof: in each iteration of the alg there's v and e that minimize q; 4.17

4.20 assume that all edge costs are distinct. let C be any cycle in G and e=(v,w) the most expensive edge of C. then e does not belong to any MST of G

Proof: delete e from T, partitioning two components S and V-S. follow P from v to w, from S to V-S, and we find some edge e' on P that crosses over; by 4.17 T' is a spanning tree of G, and since e is the most expensive of C and e' belongs to C, e' is cheaper than e and so T' is cheaper than T

4.21 the RDA produces a MST of G

Proof: any time e is removed, it lies in C; it is the most expensive in C be def of RDA so by 4.20 it is not part of a MST

What if edges weren't distinct cost? simple solution is to perturb equal edges by some really small numbers, so that relative to the rest of the graph they will be in the same, correct places

>Implementing Prim's Alg

4.22 using a pq, PA can be impl on a graph with n and m to run in O(m) time, plus time for n ExtractMin and m ChangeKey ops. (This is just like Dijkstra's)
Extensions: point to point distances in spanning trees, or congestion on edges (where we don't want an edge to carry too much traffic), or asking whether the spanning tree is ideal for network design since deleting one edge breaks connectivity

4.6 Impl KA: the Union-Find Data Structure

The problem. The UF ds will support 3 ops: 1) MakeUnionFind(S) will return a UF on S where all elements are in sep sets; the goal for this is O(n) where n = |S|. 2) Find(u) will return the name of the set containing u. Goal is O(logn) although K is possible. 3) Union(A,B) will change the ds by merging A and B into one set; Goal is logn
A simple ds for uf

*maintain array Component

maintain a list of all the elements in each set
choose the name for the union to be the name of one of the sets so that Components[s] will only have to update one and not both the sets
keep additional array size
4.23 consider the array impl of UF for some S of size n, where unions keep the name of the larger set. Find is k, MakeUnionFind n and any sequence of k Union operations takes at most klogk time

>A bs for uf

gonna use pointers; when we take the union of 2 sets, update u's pointer to point to v but not the pointers at the other nodes of set B; old name to new name progress
Union is now k, but Find is no longer constant but can be up to n; have to keep the name of the larger set to avoid this
4.24 consider the above pointer-based impls of UF for S of size n where unions keep the name of the larger set. Union is k, MakeUnionFind n and Find logn time

Proof:every time the name of the set with v nodes changes, size of this set doubles, since the set containing v starts at 1 and is never larger than n, its size can double at most logn times so there is at most logn name changes

Further improvements: compression idea is to make subsequent calls to Find cheaper by bounding the total runtime for a sequence of n Find ops rather than worst case time for any one of them. Inverse Ackermann function
Impls KA: sort edges by cost mlogm, then UF to maintain conn comps of V,T and compute Find(u) and Find(v) over and over for a total of at most 2m Find and n-1 Union ops

*4.25 KA can be impl on a graph with n nodes and m edges to run in O(mlogn) time

Week 5's Reading Summary:

CHAPTER 4: Greedy Algorithms

How to define it? Small steps, local steps that maximize work each time so that the bigger picture/problem is solved in the fastest time or in the smallest number of steps; I guess they're extremely efficient algs
Guaranteed close to optimal, and often times optimal
Approaches: 'stays ahead' and 'exchange'
Apps: shortest path in a graph, Minimum Spanning Tree Problem, Huffman codes for data compression, clustering, Minimum-Cost Arborescence Problem

4.1 Interval Scheduling: GA stays ahead

Interest on 3.4-4.1: 8/10. The book is surprisingly helpful because my first exposure to these concepts is in class, so when these readings are due it's more like reviewing or at least visiting something that has vaguely imprinted on my brain. I put a 7/10 originally, but then realized that GAs would probably be great to know in the future because they're very practical - anytime many tasks need to be done in a crunched time (meaning always haha) they come in handy and automate the scheduling process for us.

We want to max number of compatible subsets; when max size, it will be called optimal
Designing a GA

-Select request i1, reject incompatibles, then i2 and so on until all requests met -Some rules: 1) always select available request that starts earliest (not opt cuz earliest request might be the longest) 2) request that takes shortest time to complete (bad if it lies over the edges of two other reqs) 3) for each req, count the number of reqs not comp and start with the fewest incomp req, go by the earliest finish times Alg: let R be set of reqs and A empty while R not empty choose req i w/ smallest finishing time add i to A del all reqs from R not comp with i

Analyzing the alg

-A is a compatible set of requests (otherwise we could not even start the rest of the alg) -Just show that A contains the same # of ints as Optimal and hence is optimal itself; 'stay ahead' -Fact (that we have to show): for all indices r⇐k, we have f(ir)⇐f(jr). The finish of r-1 ⇐ start of jr, meaning jr is in the set R f available intervals at the time the GA selects jr, and so f(ir) ⇐ f(jr) and always will be like that until all the intervals are checked through -Fact: the GA returns optimal set A -Proof: BWOC. If A is not optimal, then A has more requests m>k. Req j(k+1) in O starts after jk ends, and hence after ik ends too. So if we delete all requests not compatible with reqs i1…ik, R still contains j(k+1), but the greedy alg is set to stop after request ik /when set is empty/. -Implementation and Running Time: nlogn. ordering is logn, iterating through intervals and selecting is n

Extensions. Some issues that can arise in its impls are:

-We might not know all the reqs when we start the alg; online algs have to make decisions at the present as time proceeds, with no knowledge of future input -Prioritized requests change their order as we have it currently

A Related Prob: Scheduling All Intervals: if we have many identical resources available and must schedule all reqs with fewest res possible

-Prop: in any instance of Interval Partitioning, the # of res needed is at least the depth of the set of intervals. -Proof: suppose a set of ints has depth D, and I1 to Id all overlap; then each of I must be on different resource, at least d of them -An alg for scheduling all intervals with fewest resources sort ints by start time, breaking ties arbitrarily for j-1…n for each int i preceding and overlapping j, exclude i from consideration for j if any label has not been excluded then assign a nonexcluded label to j else leave j unlabeled -Fact: using GA above, every int will be assigned a label, and no two overlapping ints will receive the same label -Proof: argue no iterval ends up unlabeled Ij, and t intervals earlier overlap it. So t+1 intervals pass some common pt on a time line and ⇐ d. At least one of the d labels is not excluded by t and can be assigned to Ij. Then claim no two overlapping ints have the same label; if I and I' overlap, I is in the set of intervals whose label is excluded from consideration, so the alg will not assign the label for I to I' -Fact: the GA schedules every interval on a resource, using a number of resources equal to the depth of the set of intervals. This is the optimal number of resources needed

Question: In this section we take the approach of comparing in some way the GA to what I'll call “the Optimal Algorithm.” But practically speaking, if we already have the OA in the first place to compare GA to, why don't we just stick with the OA?

3.4 Testing Bipartiteness: An App of BFS

Bip graph is one where node set V can be partitioned into sets X and Y so that every edge has one end in X and one end in Y
Fact: if graph G is bip, it cannot contain an odd cycle
Designing the alg: assume G is connected, pick any node s in V and color it red, color its neighbors blue, then alternate; this is identical to BFS but just with an extra array Color
Analyzing the alg

-Let G be connected graph with layers L1, L2…. Then exactly one of the following is true: 1) no edge of G joins two nodes of the same layer or 2) G is not bipartite and has odd-length cycle -Proof: first consider case i where no edge joins two nodes of same layer. every edge of G joins nodes either in same or adjacent layers though. so because of i, every edge here joins two nodes in adjacent layers. but our coloring procedure makes it so that every edge has ends of opp colors, so G must be bipartite (and once again, has an odd cycle)

3.5 Connectivity in Directed Graphs

Repping Dir Graphs: adjacency list, each node has 2 lists assoc with it (one to which and one from which edges)
The Graph Search Alg: use BFS and it is m+n runtime. This computes the set of all nodes t with the prop that s has a path to t (although not always vv). DFS is similar impl. To get a graph of all nodes that reach s, simply get a graph that is the reverse of G
Strong Connectivity (if there is a path from u to v and vv for all two nodes; mutually reachable)

-Fact: if u and v are mut reach, and v and w are mut reach, u and w are mut reach -Proof: to get to w, first we go u to v, then from v to w. these paths are guaranteed by mutual reachability of the nodes. Then we just reverse the reasoning to go from w to u -Strong component containing node s in G: the set of all v such that s and v are mut reach -Fact: for any two nodes and t in a dir graph, their str comps are either identical or disjoint -Proof: consider any two mut reach nodes s and t; claim their strong components are identical; then use the 2nd closest to here thm.

3.6 Directed Acyclic Graphs and Topological Ordering

If no cycle, then the graph is actually a tree
Dir graph w/ no cycles is called a directed acyclic graph
Can be used to model precedence relations/dependencies; node for each task, and dir edge to show which task needs to be done first; use topological ordering
Fact: if G has a topo ord, then G is a DAG

-Proof: BWOC G has top ordering and cycle C (this is the contradiction part to prove). But then we can have node vi and vj before vi, even though j > i; so this contradicts topo ordering

Designing and Analyzing the Alg (proving converse of above thm)

-Fact: in every DAG G, there is node v w/ no incoming edges -Proof: (sort of BWOC) let each node in G have at least one incoming edge. How to find a cycle to prove our claim: pick v and go backward from v to u, then x, and so on. Keep going, and after n+1 steps we will have visited some node twice, which means the sequence of nodes we visited makes a cycle -Fact: If G is a DAG, then G has a topo ordering Find node v w/ no inc edges as first Delete v from G Recu compute topo ord of G-{v} and append this order after v -We declare a node to be active if it has not yet been deleted by the alg, and we explicitly maintain two things 1) for each node w the number of inc edges that w has from active nodes 2) the set S of all active nodes in G that have no incoming edges from other active nodes

Week 3's Reading Summary:

Interest: this week's 3 sections were pretty interesting for me. I'd say 8/10 because personally I tend to care more things when I can see where it would be used in real life, and graphs are everywhere in my life. I know I should care about theoretical stuff too though and am getting there.

3.1 Basic Defs and Apps

- Graph: pairwise relations among set of objects of V nodes and E edges - There are symmetric (undirected) and asymmetric graphs (directed) with u as a tail and v as the head - They're abstract and hard to define, so here are some examples of graphs: transportation networks (airports, train stations, taxi stations), communication networks (b/t computers, internet providers, networks and routing), information networks WWW), social networks, dependency networks (classes, bank tellers, food ordering) - Paths (connectivity) is moving from one node to another through connected edges. Simple paths mean no node is visited twice, cycle is path that starts and ends in same place - Connected means every pair of nodes u and v has a path between them; distance between two nodes is the min number of edges connecting them (aka short path) - Tree - an undirected graph that is connected and doesn't have a cycle. Root is “top” of rest of nodes, parent precedes child with edge distance of 1, with ancestors and descendants being distances greater than one; leaf has no descendant → concept of hierarchy captured in rooted trees - Fact: every n-node tree has exactly n-1 edges - Statement: G is an undirected graph of n nodes. any two of the following implies the third - g is connected, does not contain a cycle, has n-1 edges

3.2 Graph Connectivity and Graph Traversal

- Problem of connectivity: is there a path from node s to t in graph G? - Breadth-First Search - explore outward from s in all pos directions, adding nodes one layer at a time (find all nodes joined by one edge to s)

L1 is the neighbor of s, which is L0. For L1 to Lj, Lj+1 consists of all nodes not belonging to an earlier layer that have an edge to a node in Lj
Fact: For each layer j>=1, Lj produced by BFS consists of all nodes distance j from s; there's a path from s to t iff t appears in some layer
Fact: Let T be BFS tree, x and y belonging in layers Li and Lj, and x-y is an edge of G. The i and j differ by at most 1

Proof: suppose by way of contradiction that i and j differed by more than 1; since x belongs to layer Li, the only nodes disc from x belongs to Li+1 and earlier; hence if y is a neighbor of x it should have been discovered by this pt at the latest and hence should belong to Li+1 or earlier - Exploring a Connected Component - set of nodes BFS alg finds reachable from s

Alg: R consists of nodes to which s has a path Initially R = {s} While there is an edge u v and u in R but v not in R: Add v to R
Prop: the set R produced is precisely the connected component of G containing s. Proof: we know for any node v in R there's path sv. consider node w, and that BWOC there is sw path P in G. s is in R but w is not so there must be a node v on P not in R; node v is not s; so there is a u before v on P, and edge uv exists; u is in R; uv is an edge where u is in R but v is not in R, which contradicts the stopping rule

-Depth-First Search - follow first edge until dead end, then backtrack; the “maze algorithm”

Alg: DFS(u): Mark u explored and add u to R For each edge uv incident to u: if v is not explored then recursively invoke DFS(v) > Prop: for a given recursive call DFS(u), all explored nodes b/t invocation and end of recursive call are descendants of u in Tree > Prop: For T, nodes x y in T, and x-y an edge of G that isn't an edge of T, either x or y is an ancestor of the other. Proof: xy is an edge of G not of T, x is reached first by DFS. xy is not added to T because it's marked explored. y was not explored earlier, and must have been discovered b/t invocation and end of the recursive call DFS(s) - so it must be a descendant of x

- The Set of All Connected Components - For any two nodes s and t, their connected components are either identical or disjoint. Proof: path st in G; claim conn comp of s and t are same set; so there's a path from t to v and v to s. If there's not path st, there's no v connected comp of each, because then we could walk from sv and then to t, but there is no path

3.3 Implementing Graph Traversal Using Queues and Stacks

- Repping Graphs - two basic ways are adjacency matrix and adjacency list

Goal run time is O(m+n) where m is the number of edges; note that since m>=n-1, we call O(m+n) linear
Ad matrix: nxm matrix A where A[u,v] is equal to 1 if the graphs contains edge uv and 0 otherwise. It's symmetric for undirected graphs. Two disads are it takes Theta(n^2) space no matter what, and takes O(n) time to check all edges incident to a node no matter what node)
Ad list (good for sparse (fewer than n^2 edges) graphs): actually an array where the index is the node and value is a list of index's adjacent nodes. Takes just O(m+n) space
Ad matrix is constant access, and ad list linear, but ad list is more natural rep of exploring graphs cuz physically it's like learning the neighbors of u when you arrive at u, and once there it is constant time although getting there might be linear
Proof: edge vw contributes twice to the sum of degrees in a graph: once in nv and once in nw. sinc the sum is the total of contributions of each edge, it is 2m

- Queues and Stacks - used when order matters in keeping records

Queue is FIFO and stack is LIFO; both repped as doubly linked list cuz is has explicit First and Last pointers

- Implementing BFS: set only s true, L0 consisting only s, layer counter i=0, current Tree T as empty. while Li not empty, initialize empty list L[i+1] and for each node u in L[i], if edge uv is false then set it to discovered true and add it to tree T and v to list L[i+1]. Then increment layer counter i by one.

Prop: the above impl of BFS runs linear if the graph is given by the ad list rep. n for the for loop n times, and constant considering of each edge of each vertice u times using an additional array Discovered set up outside the for loop

- Implementing DFS: stack S has one element s. while S is not empty take node u from S and if u is false, set to true and add each edge v incident to u to S

Prop: the above alg implements DFS in the sense that it visits the nodes in exactly the same order as the recursive DFS procedure in the previous section (except that the ad list is in reverse order to this)
Prop: the above implementation of DFS runs in O(m+n) time if the graph is given by the ad list rep

-Finding the Set of All Connected Components: start with some node s, use BFS or DFS, then find node v not visited by search from s and use BFS and DFS (v is disjoint from s by def). Linear runtime

Question: how hard is it to set up a tree, graph, stack or queue that has all of the functions specified in these sections? Not the O notation for setting them up but the real-world time and rules for them? Maybe they just seem complicated right now but are not that bad.

Week 2's Reading Summary:

2.3 Implementing the Stable Matching Alg Using Lists and Arrays:

-It's good to use data structures to estimate the run time of algs rather than breaking down every tiny step. We'll use lists and arrays for the GS alg

-Need to rep ranking and who is matched with whom and who is free

Arrays and Lists

-Keep a list of n elements as an array of length n, where the A[i] is the ith element of list. If ordered, then search is logn time

-Linked list for the list of men since ll is more dynamic and flexible than array. Create the list and also a pointer that goes to the current pos of the last operation (has deletion and insertion)

Implementing the Stable Matching Alg

-Have an all men or all women array, but two arrays for the preferences of both genders. The goal is that each iteration of the GS alg runs in constant time so that it runs in time proportional to the upper bound of n^2 and no worse

-What has to be C time: who is a free man (linked list, select first man on that list), who m should propose to next (use extra array Next that indicates for each m the pos of the next woman to propose to), is w engaged (use array Current where w is the index and the value is her man), if w prefers m or m' (trickiest of the four: create nxn array at start ofalg, where Ranking[m,w] contains the rank of man m in sorted order of w's preferences. Creation is linear time for each w, and to decide if w prefs m or m' just compare [w,m] with [w,m']

2.4 A Survey of Common Running Times:

-“Natural space” of a prob: the set of all possible solutions. Nat space and run time are the two bounds to consider when encountering a new problem Linear Time (rt is at most a constant * size of input)

-Computing the Maximum - a linear pass, where constant per element adds up to O(n)

-Merging Two Sorted Lists A and B

(Maintain 2 pointers init to front els While both lists not empty: append smaller of A or B to new list, advance Current pointer of which was smaller If one list is empty, put all the els of the other to the new list) Here's why it's n and not n^2 although the latter is technically correct: count the cost of each iteration of the selected element as constant because once it is first selected (“charged”) it is given to the new list and not seen by the alg again → 2n iterations at most = n.

-nlogn time: the rt of any alg that splits input into 2 even pieces and recurses before combining in linear time. mergesort, or algs whose expensive step is sorting

-Quadratic Time alg for finding closes pair of points using nested loop (For each x,y

For each other x, y
	compute distance formula
	update the min accordingly)

As the class progresses though, we'll find a way to do the above task in nlogn and then… n. say wha??

-Cubic time: 3 nested loops

-n^k time alg for checking if there's an edge between two nodes (For each subset s in k, check is S is independent set and if yes, we're done If no k-node ind set was found, declare failure)

-Beyond Polynomial Time:

2^n is a search space for many problems, so if we encounter it, then don't worry because most likely that actually means there's a better alg out there to improve the op. The “dichotomy” of brute-force algs and their alternatives
n! is the number of ways to match n items with n other items (think of this as perm n from 121)
Sublinear Time: possible when input can be “queried” and not read completely. An example is binary search where we prob entries by group of “active regions.” This happens when an alg does a C work to throw away a constant fraction of the input

2.5 A More Complex Data Structure: Priority Queues:

More complex data structures take nontrivial processing when invoked but afterwards lead to highly efficient algs that would not be using simpler data structures

The Prob (have a changing set S of the free men in the StabMat)

-PQs support addition and deletion and each element has a key that denotes the priority of it. They are used in real life whenever a schedule of processes must be made while taking turns -Why aim for logn time / op? Proof that “a sequence of O(n) pq ops can be used to sort n numbers”: set up H by inserting each number with its value as key. Then simply extract min one by one. Thus n numbers times logn ops per number gives nlogn which is the best we can hope for at this point

A Data Structure for Implementing a PQ

Heap - balanced bin tree with a root, node with up to children, and child. Heap order means child is at least equal to or bigger than parent. We can use pointers or arrays to rep heaps-repping-pqs depending on if we know that total number of elements that will be in it at any one time. left child is 2i and right child is 2i + 1 where i is the index of any node.

-Implementing Heap Ops. Start heap n, insert logn, findmin C, delete logn, extractmin logn. Heapify up (H, i): (if i>1 j is parent, and if child smaller than parent swap up.recursive heapify up(H,j). Swapping is C. We claim that after each swap array H is either a heap of almost a heap with the key of H[j.

-The height of tree is the max number of recursions. Add new elements to the end of the array and then heapify up. To delete, we only allow deleting the root; replace the root with the last element and then heapify up or down

-Heapify down (H,i): (n = length(H). if 2i>n, done. else 2i<n set j to be the smaller of i's left and right child. else 2i=n, set j to 2i. endif if child is smaller than parent, swap them, then call heapify down (H, j)).

-Implementing PQs with Heaps (same as implementing heap ops mentioned above).

-A second category of ops is finding elements by name rather than pos (Does this mean value rather than key?). To do this have additional array Pos that stores the current pos of each el/node in the heap. To delete, it's Delete(H,Pos[v]) and to change key, find where v is in Pos, change the key and then use heapify up or down as appropriate

Week 1's Reading Summary:

Preface:

Algorithms apply not just to cs. They aren't just a matter of solving a problem, but are about how to solve a problem the most efficient way (what the authors are hoping this book will teach me). -Once you find the 'mathematically clean core of a problem', then design the appropriate structure to solve that problem. '[Algs don't] just provide solutions to well-posed problems; they form the language that lets you cleanly express the underlying questions.' So more than as a tool for problem solving, they allow elegant design of algs and then in turn programs.

1.1 Stable Matching Problem:

Interest: 8/10 Gale and Shapley, two mathematicians/economists, wanted to see if any process involving an applicant and acceptor-or-rejector could be automated based on the preference ordering of both parties. It's tricky though: if self-interest allowed the parties to drop their current partner, chaos would ensue. This alg is in use today in hospitals. -Make it the 'bare-bones version', where we can assume things to go right (like for men and women, there are an even number of them); the goal is to have no instabilities, where E and A, or M and W, won't have motivation to ditch the other.

Pseudocode alg: All m w are free (with w having 2 arrays) While there is a free man who hasn't proposed to every woman (choose m choose highest w m hasn't proposed to yet if w is free, engage both else if w prefers someone else m is free, or if if w prefers m to current man they become engaged and her current m is free) return to set s of engaged pairs

Runtime can be n^2 if we set the w up with two arrays that are inverses of each other for the men they prefer. Otherwise it is n^3 (as shown in class). pg 7-12 shows that the solution is logically sound. Also see class notes for proofs (Professor, I'm not sure if saying things like the last two sentences are ok. I'm hoping they are for this one, but if not, then in the second journal entry I will type them out here)

2.1 Computational Tractability:

Interest: 7/10

It's about being efficient, not just solving something. The way they're gonna measure efficiency is by run time, and then in terms of memory space. Efficiency: 'an alg is eff if, when implemented, it runs quickly on real input instances'; but more than that general definition, good algs need to be platform-independent, instance-ind, and of predictive values with respect to increasing input sizes. -In general, we start with looking at worst-case rt. The problem with average-case analysis is that 'random inputs' can not capture real, 'average' inputs. Second def of efficiency: …if it achieves qualitatively better worst-case performance, at an analytical level, than brute-force search. -Polynomial running times are characteristic of eff algs (definition 3 of eff). Pg 34 has a table from O(n) best, which is great rt, to n!, which is onion awful.

In general: n > nlogn > n^2 > n^3 > 1.5^n > 2^n > n! (pg 34)

Questions I have: just the notion of how to determine the O of an alg, because, say we need to do an operation to the xth object of some collection, sometimes we count getting to x as part of the runtime, while other times we just count O to be whatever we do once we're there already. How do I know when to count what?

2.2 Asymptotic Order of Growth:

Interest: 6/10 Analyzing algs is done by analyzing how fast the rt grows with each additional unit of input. The function f(n) gives a line repping the worst-case run time of some alg of input size n. 'Granularity' in order to be able to generalize improvements/solutions to inefficient algs. Pseudo-code is enough, no need to microscope an alg. O(n) reps asymptotic upper bounds, Ω(n) reps lower, and θ(n) reps the tight bounds (which is found if the upper and lower bounds are the same or very close to being the same). Asymp growth rate algs have transitivity and sum properties. -Tips on determining asymp bounds. Polynomial: ignore everything but the highest order power. Log: for b>1 and x>0, logbn = O(n^x). Exponentials are terrible and cannot be bounded by a fixed constant c like the previous two can.

Table of Contents