Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/05 23:47] – [Section 4.5: The Minimum Spanning Tree Problem] boyese
+++ courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/12 18:25] (current) – [Section 4.8: Huffman Codes and Data Compression] boyese
@@ Line 191: / Line 191: @@
 ====Section 4.6: Implementing Kruskal's Algorithm: The Union-Find Data Structure====
+===Union-Find Data Structure:===
+  * Keeps track of a graph as edges are added
+      * Cannot handle when edges are deleted
+  * Maintains disjoint sets
+      * E.g., graph’s connected components
+**Operations:**
+  * Find(u): returns name of set containing u
+      * How is this utilized to see if two nodes are in the same set?
+      * Goal implementation: O(log n)
+  * Union(A, B): merge sets A and B into one set
+      * Goal implementation: O(log n)
+===Implementing Kruskal’s Algorithm===
+  * Using the union-find data structure
+      * Build set of T edges in the MST
+      * Maintain set for each connected component
+  * Using best implementation of union-find
+      * Sorting: O(m log n)
+      * Union-find: O(m α (m, n))
+      * O(m log n)
+**The Algorithm:**
+<code>
+Sort edge weights so that c<sub>1</sub> ≤ c<sub>2</sub> ≤ ... ≤ c<sub>m</sub>
+T = {}
+foreach (u ∈ V) make a set containing singleton u
+for i = 1 to m
+    (u,v) = e<sub>i</sub>
+    if (u and v are in different sets)
+        T = T ∪ {e<sub>i</sub>}
+        merge the sets containing u and v
+return T
+</code>
 ====Section 4.7: Clustering====
+===Clustering===
+Given a set U of n objects (or points) labeled p<sub>1</sub>, …, p<sub>n</sub>, classify into coherent groups
+  * **Problem:** divide objects into clusters so that points in different clusters are far apart
+      * Requires quantification of distance
+  * Applications
+      * Routing in mobile ad-hoc networks
+      * Identify patterns in gene expression
+      * Identifying patterns in web application use cases
+          * Sets of URLs
+      * Similarity searching in medical image databases
+**Distance Function**
+The numeric value specifying “closeness” of two objects
+Assume distance function satisfies several natural properties:
+  * d(p<sub>i</sub>, p<sub>j</sub>) = 0 iff p<sub>i</sub> = p<sub>j</sub> (identity of indiscernibles)
+  * d(p<sub>i</sub>, p<sub>j</sub>) ≥ 0 (non-negativity)
+  * d(p<sub>i</sub>, p<sub>j</sub> ) = d(p<sub>j</sub>, p<sub>i</sub> ) (symmetry)
+**Problem: k-Clustering of Maximum Spacing**
+  * //k-clustering:// Divide objects into k non-empty groups
+  * //Spacing:// Min distance between any pair of points in different clusters
+  * //k-clustering of maximum spacing:// Given an integer k, find a k-clustering of maximum spacing
+**Greedy Clustering Algorithm**
+  * Single-link k-clustering algorithm
+      * Form a graph on the vertex set U, corresponding to n clusters
+      * Find the closest pair of objects such that each object is in a different cluster and add an edge between them
+      * Repeat n-k times until there are exactly k clusters
+  * This is the same as Kruskal’s algorithm, except we stop when there are k connected components
+  * Equivalent to finding MST and deleting the k-1 most expensive edges
+**//Theorem://** Let C denote the clustering C1, …, Ck formed by deleting the k-1 most expensive edges of a MST. C is a k-clustering of maximum spacing.
 ====Section 4.8: Huffman Codes and Data Compression====
+===Huffman Codes===
+**Problem: Encoding**
+  * Computers use bits: 0s and 1s
+  * Need to represent what humans know to what computers know
+  * Map symbol → unique sequence of 0s and 1s
+  * Process is called encoding
+  * The least number of bits needed to encode 26 letters, space, and 5 punctuation marks (, . ? ! ‘) (32 characters = 25), so 5 bits
+  * For a longer string of characters, we can use shorter encodings for frequently used characters like ETAOINSHRDLU
+      * Goal: optimal encoding that takes advantage of non-uniformity of letter frequencies
+      * Looking to represent data as compactly as possible
+      * Example: Morse code, which is an example of variable-length encoding
+      * We run into a problem when doing this: ambiguity caused by the encoding of one character being a prefix of encoding another
+**Optimal Prefix Codes**
+  * Solution to ambiguity: Prefix codes, which are a map of letters to bit strings such that no encoding is a prefix of any other
+      * Goal: minimize the average number of bits per letter (ABL):
+      * Σ<sub>x∈S</sub> : frequency of x * length of encoding x
+      * fx : frequency that the letter x occurs
+      * |ϒ(x)|: length of encoding of x
+      * Minimize ABL =  Σ<sub>x∈S</sub> f<sub>x</sub>|ϒ(x)|
+  * Binary tree to represent prefix codes
+      * Exposes structure netter than list of mappings
+          * Each leaf node is a letter
+          * Follow path to the letter
+              * Going left: 0, Going right: 1
+  * Recursively generate tree
+      * All letters are in root node
+      * For all letters in node:
+          * If encoding begins with 0, letter belongs in left subtree
+          * Otherwise (encoding begins with 1), letter belongs in right subtree
+          * If last bit of encoding, make the letter a leaf node of that subtree
+          * Shift encoding one bit
+          * Process left and right children
+  * Tree Properties
+      * The length of the path from root to leaf is the depth
+      * The binary tree T corresponding to the optimal prefix code is full, i.e., each internal node has two children
+  * Conclusions
+      * The binary tree corresponding to the optimal prefix code is full, i.e., each internal node has two children
+      * We want to label the leaf nodes of the binary tree corresponding to the optimal prefix code such that nodes with greatest depth have least frequency
+      * The two letters with least frequency are definitely going to be siblings
+**Huffman’s Algorithm**
+  * Data structures needed:
+      * Binary tree for the prefix codes
+      * Priority queue for choosing the node with lowest frequency
+  * Costs
+      * Inserting and extracting node into PQ: O(log n)
+      * Number of insertions and extractions: O(n)
+      * O(n log n)
+<code>
+To construct a prefix code for an alphabet S, with given frequencies:
+    If S has two letters then
+        Encode one letter using 0 and the other letter using 1
+    Else
+        Let y* and z* be the two lowest-frequency letters
+        Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter ω of frequency f<sub>y*</sub> + f<sub>z*</sub>
+        Recursively construct a prefix code γ’ for S’, with tree T’
+        Define a prefix code for S as follows:
+            Start with T’
+        Take the leaf labeled ω and add two children below it labeled y* and z*
+    Endif
+</code>
 ====Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm====