Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:mccaffreyk:home:4 [2018/03/06 04:21] – mccaffreyk
+++ courses:cs211:winter2018:journals:mccaffreyk:home:4 [2018/03/11 19:53] (current) – mccaffreyk
@@ Line 25: / Line 25: @@
 ==== Section 4.5: The Minimum Spanning Tree Problem ====
-Here, we seek to connect a group of nodes using the shortest possible path. Each edge has x distance and our task is to select the appropriate edges which link each node to at-least one other node in the group. A minimum cost solution is a set of edges which minimize total costs. Minimum cost solutions will produce graphs(connected components) which are trees. This comes from the fact that there will be no cycles in our final most efficient solution. We should note that for any given problem there may be many possible spanning trees. Many greedy algorithms work for this problem. For instance, we can simply build a tree and insert edges in order of increasing lengths(Kruskal's Algorithm), add nodes in the order which minimizes attachment cost(Dikikstra's Algorithm) to a tree starting with a root node s, or start with fully connected graph and delete edges in decreasing cost order until we may not remove another edge without disconnecting at-least one node(reverse delete algorithm). All of these algorithms concern themselves with when it is okay or not okay to delete a node. The cut property says that we may not delete an edge when it is the minimum cost edge connecting two different groups of nodes, each with unique members. The cycle property says that if we have an edge which is the most expensive in a cycle in G, then that edge nay not be contained in any minimum spanning tree. The fundamental reason why there are so many optimal algorithms for this problem is that any algorithm which follows the above 2 properties will always produce a minimum spanning tree. To simplify situations in which some costs are equal, we simply go through before hand and set very small differences between each edge, making their distances unique. This simplifies our proofs and solutions. Prim's and Kruskal's algorithms may run in O(m log n) time.
+Here, we seek to connect a group of nodes using the shortest possible path. Each edge has x distance and our task is to select the appropriate edges which link each node to at-least one other node in the group. A minimum cost solution is a set of edges which minimize total costs. Minimum cost solutions will produce graphs(connected components) which are trees. This comes from the fact that there will be no cycles in our final most efficient solution. We should note that for any given problem there may be many possible spanning trees. Many greedy algorithms work for this problem. For instance, we can simply build a tree and insert edges in order of increasing lengths(Kruskal's Algorithm), add nodes in the order which minimizes attachment cost(Dikikstra's Algorithm) to a tree starting with a root node s, or start with fully connected graph and delete edges in decreasing cost order until we may not remove another edge without disconnecting at-least one node(reverse delete algorithm). All of these algorithms concern themselves with when it is okay or not okay to delete a node. The cut property says that we may not delete an edge when it is the minimum cost edge connecting two different groups of nodes, each with unique members. The cycle property says that if we have an edge which is the most expensive in a cycle in G, then that edge nay not be contained in any minimum spanning tree. The fundamental reason why there are so many optimal algorithms for this problem is that any algorithm which follows the above 2 properties will always produce a minimum spanning tree. To simplify situations in which some costs are equal, we simply go through before hand and set very small differences between each edge, making their distances unique. This simplifies our proofs and solutions. Prim's and Kruskal's algorithms may run in O(m log n) time. This section was somewhat hard to understand but made sense overall: 8/10.
+==== Section 4.6: Implementing Kruskal's Algorithm: The Union Find Data Structure ====
+To implement Kruskal's algorithm in optimal time, we must create a new data structure: the union find data structure. This must be capable of two operations: find and merge. Find(u) returns the name of the set containing u. Union(A,B) merges sets A and B into a single set. Note that for edge deletion, a single set may split into two sets. Thus, our UF data structure is not meant to handle edge deletion. To initialize UF, we use MakeUnionFind(S), which puts all elements in S into separate sets and keeps track of them. To keep track of names of set containing each component, we can use an array which enables lookup in O(n) time. We also keep array of each sets size so that after union operations, we may assign the name of the previously larger set to the new set. In terms of running time, Find is constant (O(1))time. MakeUnionFind takes O(n) time and k union operations take O(K log(K)) time(?)(Does this mean union is constant time?). We can use a yet better data structure for union find with pointers. If we utilize pointers, Union takes only O(1) time, and MakeUnionFind still takes O(n) time. The trade-off is that now find actually takes O(n) time. To implement Kruskal's algorithm here we simply sort edges by cost, then take the Union of each components if they are distinct to simulate additions into the tree. This section was very clear to me and it's interesting to be working with a new type of data structure. Doing so reminds me of CSCI 112 when I discovered linked lists, stacks, Queues, etc. 9/10
+==== Section 4.7: Clustering ====
+The clustering problem involves dividing components into groups where components are dissimilar corresponding to their group's distance from one another. That is, components in groups which are close together are similar whereas components far apart are very different. Here, distance is a very abstract concept which could actually mean many things such as time, not just space. This is overall a very abstract problem. We define spacing as the minimum distance between any given two nodes lying in two separate clusters(groups). To organize the clusters, we can make use of a minimum spanning tree. In order to implement this, we construct a graph where connected components correspond to clusters. Edges between closest pairs of points in each connected component represent distances between groups. Since we add nodes in sorted order, perhaps based on their similarity, this should produce a good representation. Notice how this is basically a pure implementation of Kruskal's Algorithm. This is why minimum spanning trees are so fundamental. Overall, this was a very intuitive and well written section: 9/10.
+==== Section 4.8: Huffman Codes and Data Compression ====
+Here, we focus on how computers store and transfer information using "bits". We can represent 2^b characters uniquely, where b is the total number of bits. However, this is not always optimal. Bits cost memory to store. Further, there is often ambiguity when character representations are ordered. For instance, suppose that a=1 and b=11. If we have word ab, we would have bits 111, which could also be aaa. We say that b has a prefix of a. If we added spaces, then we would technically be using a 3 bit system instead of 2 (0,1, space), which we want to avoid here. Thus, we are divided between minimizing the amount of space required and avoiding ambiguity. To do this, we need an algorithm which generates non-ambiguous(prefix) codes using the least amount of bits on average. To do this, the algorithm will use fewer bits to represent characters which tend to be typed more frequently and vice versa. This should decrease the average number of bits. That is, our algorithm should generate prefix codes which are "optimal", thus satisfying both goals. We clarify some basic facts. First, the binary tree created will be full since each node which is not a leaf has two children. This is because such trees are always more optimal in terms of space than those which are not full. Second, we know that in the tree, more frequently occurring characters should be at lower layers. This is because if we represent bits as quantities of edge path to root, lower layers mean fewer bits required. Third, if a node has the lowest frequency in one of our binary trees, it must be a leaf as it is required to have maximum depth. Thus, we arrive at Huffman's algorithm, a complicated recursive step set to construct such a tree. We should note that this algorithm inserts letters to the tree from lowest to highest frequency. There exists a simple equation to determine average bit-length change in a tree after adding another node. That is, average bit length of new tree=average bit length of old tree-frequency of new node. This reflects the fact that adding nodes to the tree. Higher frequencies mean more change between trees since they cause the number of layers to increase faster(2,4,8 etc). The concrete proof for this is very complex and I do not fully understand the math behind it. The total running time of Huffman's algorithm is O(k log k)where k is the number of letters in the alphabet. We could perhaps improve by using fractions of bits instead of full bits, but this holds deep challenges and complexities. Overall, I give this section a 4/10. The concepts were mostly straightforward but I could not really understand Huffman's algorithm or many of the proofs discussed.