Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:lesherr:home:chapter4 [2018/03/05 04:34] – [Section 7] lesherr
+++ courses:cs211:winter2018:journals:lesherr:home:chapter4 [2018/03/11 16:30] (current) – [Section 8: Huffman Codes and Data Compression] lesherr
@@ Line 12: / Line 12: @@
 ===== Section 7: Clustering =====
 Another area that minimal spanning trees can be applied to is clustering. Clustering is the problem of trying to classify or organize a collection of objects into coherent groups. 'A common approach to this problem is to define a distance function on the objects, with the interpretation that objects at a larger distance from one another are less similar to each other'. Given a relative 'distance' between each other, the clustering problem wants to divide the objects into groups so that objects within a group are close, and objects in different groups are far apart. 'The spacing of a k-clustering is the minimum distance between any pairs of points that lie in different clusters.' In this problem, we want to find a clustering with the maximum possible spacing. 'To find a clustering of maximum spacing, we consider growing a graph on the vertex set U. The connected components will be the the clusters, and we will try to bring nearby points together into the same cluster as rapidly as possible.' We add edges in order of increasing distance between endpoints, and only add an edge if it connects edges that are not in the same cluster. When adding an edge between two distinct connected components, you are merging the two clusters. 'In clustering lingo, the iterative merging of clusters is termed single-link clustering, a special case of heirarchical agglomerative clustering. Agglomerative means that we combine clusters, single-link means we do it as soon as a single link joins them together.' This procedure is the exact same as Kruskal's Minimum Spanning Tree Algorithm. When we seek a k-clustering, we stop the algorithm once we obtain k distinct connected components. 'The components C1, C2, ..., Ck formed by deleting the k-1 most expensive edges of the minimum spanning tree T constitute a k-clustering of maximum space.' This section was relatively easy to follow, and hopefully will become more clear after discussing it in class. I would give it a 7/10.
+===== Section 8: Huffman Codes and Data Compression =====
+Now, we consider a problem were we use a greedy rule to 'shrink the size of the problem instance, so that an equivalent smaller problem can then be solved by recursion.' In this manner, solving the smaller instance still leads to an optimal solution of the original larger instance, but the final result isn't returned until the whole recursion is completed. The problem we look at deals with 'data compression', which is an important part of digital communication. Computers encode data into sequences of bits (1s and 0s) so that it can process information and perform operations. The problem of data compression is finding the best way to store the data in the smallest amount of bits possible so that it can be transmitted in the least amount of time. The average bits per letter is calculated by taking the number of bits it takes to encode a letter and multiplying it by the frequency of that letter, then summing up all of the letters. The goal of data compression is to minimize the average number of bits per letter, so that letters that are more frequent can be encoded with less bits than other less frequent letters. One example of data compression is Morse Code, which was used to transmit messages using sequences of dots and dashes to represent letters. However, the encoding of words in this manner was ambiguous, as there was no way to distinguish when a letter had ended and a new one had begun. To deal with this issue, Morse Code transmissions had to involve short pauses between letters, however this is inefficient and can be done in a much better way. The issue with Morse Code is that there exists instances where one encoding of a letter is a prefix of an encoding of another letter, thus you don't know which letter is being represented without the pause in between. To optimize the data transmission, we want to find a way to encode all of the letters in the most efficient way possible such that no letter encoding is a prefix of another, thus removing any ambiguity of what the data represents. Additionally we want to continue to represent more frequent letters with less bits than other less frequent letters. To do this, we will represent the prefix codes using a binary tree, where a path from a node to the right represents a 1, and a path from a node to the left represents a 0.  'The encoding of alphabet S constructed from binary tree T is a prefix code'. To ensure that no one letter encoding is a prefix of another, all letters must be stored on leaf nodes of the tree, with no letters being stored on interior nodes. 'The binary tree corresponding to the optimal prefix code is full'. 'There is an optimal prefix code, with corresponding tree T*, in which the two lowest-frequency letters are assigned to leaves that are siblings in T*.' The algorithm to build this tree is as follows: If S has two letters, encode one with 0 and the other with a 1, else, let y and z be the two lowest-frequency letters, replace the two letters in the alphabet by creating a combined letter yz as a supernode with y and z as its children, and recursively continue this pattern until there is only one letter left in the alphabet. 'The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code.' The running time of Huffman's algorithm relies on the use of a priority queue in the implementation, which has O(logn) operations, thus this algorithm can run in a time of O(nlogn). A major drawback of prefix codes is that they cannot adapt to changes in the text. This section was very easy to read and follow after having learned about it in class. I would give it a 9/10.