Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/05 23:47] – [Section 4.5: The Minimum Spanning Tree Problem] boyesecourses:cs211:winter2018:journals:boyese:chapter4 [2018/03/12 18:25] (current) – [Section 4.8: Huffman Codes and Data Compression] boyese
Line 191: Line 191:
 ====Section 4.6: Implementing Kruskal's Algorithm: The Union-Find Data Structure==== ====Section 4.6: Implementing Kruskal's Algorithm: The Union-Find Data Structure====
  
 +===Union-Find Data Structure:===
 +  * Keeps track of a graph as edges are added
 +      * Cannot handle when edges are deleted
 +  * Maintains disjoint sets
 +      * E.g., graph’s connected components
  
 +**Operations:**
 +  * Find(u): returns name of set containing u
 +      * How is this utilized to see if two nodes are in the same set?
 +      * Goal implementation: O(log n)
 +  * Union(A, B): merge sets A and B into one set
 +      * Goal implementation: O(log n)
 +
 +===Implementing Kruskal’s Algorithm===
 +  * Using the union-find data structure
 +      * Build set of T edges in the MST
 +      * Maintain set for each connected component
 +  * Using best implementation of union-find
 +      * Sorting: O(m log n)
 +      * Union-find: O(m α (m, n))
 +      * O(m log n)
 +
 +**The Algorithm:**
 +<code>
 +Sort edge weights so that c<sub>1</sub> ≤ c<sub>2</sub> ≤ ... ≤ c<sub>m</sub>
 +T = {}
 +foreach (u ∈ V) make a set containing singleton u
 +for i = 1 to m
 +    (u,v) = e<sub>i</sub>
 +    if (u and v are in different sets)
 +        T = T ∪ {e<sub>i</sub>}
 +        merge the sets containing u and v
 +return T
 +</code>
 ====Section 4.7: Clustering==== ====Section 4.7: Clustering====
  
 +===Clustering===
 +Given a set U of n objects (or points) labeled p<sub>1</sub>, …, p<sub>n</sub>, classify into coherent groups
 +  * **Problem:** divide objects into clusters so that points in different clusters are far apart
 +      * Requires quantification of distance
 +  * Applications
 +      * Routing in mobile ad-hoc networks
 +      * Identify patterns in gene expression
 +      * Identifying patterns in web application use cases
 +          * Sets of URLs
 +      * Similarity searching in medical image databases
 +
 +**Distance Function**
 +The numeric value specifying “closeness” of two objects
 +Assume distance function satisfies several natural properties:
 +  * d(p<sub>i</sub>, p<sub>j</sub>) = 0 iff p<sub>i</sub> = p<sub>j</sub> (identity of indiscernibles) 
 +  * d(p<sub>i</sub>, p<sub>j</sub>) ≥ 0 (non-negativity) 
 +  * d(p<sub>i</sub>, p<sub>j</sub> ) = d(p<sub>j</sub>, p<sub>i</sub> ) (symmetry) 
 +
 +
 +**Problem: k-Clustering of Maximum Spacing**
 +  * //k-clustering:// Divide objects into k non-empty groups 
 +  * //Spacing:// Min distance between any pair of points in different clusters 
 +  * //k-clustering of maximum spacing:// Given an integer k, find a k-clustering of maximum spacing
 +
 +**Greedy Clustering Algorithm**
 +  * Single-link k-clustering algorithm
 +      * Form a graph on the vertex set U, corresponding to n clusters 
 +      * Find the closest pair of objects such that each object is in a different cluster and add an edge between them 
 +      * Repeat n-k times until there are exactly k clusters
 +  * This is the same as Kruskal’s algorithm, except we stop when there are k connected components
 +  * Equivalent to finding MST and deleting the k-1 most expensive edges
 +
 +**//Theorem://** Let C denote the clustering C1, …, Ck formed by deleting the k-1 most expensive edges of a MST. C is a k-clustering of maximum spacing.
  
 ====Section 4.8: Huffman Codes and Data Compression==== ====Section 4.8: Huffman Codes and Data Compression====
  
 +===Huffman Codes===
 +**Problem: Encoding**
 +  * Computers use bits: 0s and 1s
 +  * Need to represent what humans know to what computers know
 +  * Map symbol → unique sequence of 0s and 1s
 +  * Process is called encoding
 +  * The least number of bits needed to encode 26 letters, space, and 5 punctuation marks (, . ? ! ‘) (32 characters = 25), so 5 bits
 +  * For a longer string of characters, we can use shorter encodings for frequently used characters like ETAOINSHRDLU
 +      * Goal: optimal encoding that takes advantage of non-uniformity of letter frequencies 
 +      * Looking to represent data as compactly as possible
 +      * Example: Morse code, which is an example of variable-length encoding
 +      * We run into a problem when doing this: ambiguity caused by the encoding of one character being a prefix of encoding another
 +
 +**Optimal Prefix Codes**
 +  * Solution to ambiguity: Prefix codes, which are a map of letters to bit strings such that no encoding is a prefix of any other
 +      * Goal: minimize the average number of bits per letter (ABL):
 +      * Σ<sub>x∈S</sub> : frequency of x * length of encoding x
 +      * fx : frequency that the letter x occurs
 +      * |ϒ(x)|: length of encoding of x
 +      * Minimize ABL =  Σ<sub>x∈S</sub> f<sub>x</sub>|ϒ(x)|
 +  * Binary tree to represent prefix codes
 +      * Exposes structure netter than list of mappings
 +          * Each leaf node is a letter
 +          * Follow path to the letter
 +              * Going left: 0, Going right: 1
 +  * Recursively generate tree
 +      * All letters are in root node
 +      * For all letters in node:
 +          * If encoding begins with 0, letter belongs in left subtree 
 +          * Otherwise (encoding begins with 1), letter belongs in right subtree 
 +          * If last bit of encoding, make the letter a leaf node of that subtree
 +          * Shift encoding one bit 
 +          * Process left and right children 
 +  * Tree Properties
 +      * The length of the path from root to leaf is the depth
 +      * The binary tree T corresponding to the optimal prefix code is full, i.e., each internal node has two children
 +  * Conclusions
 +      * The binary tree corresponding to the optimal prefix code is full, i.e., each internal node has two children
 +      * We want to label the leaf nodes of the binary tree corresponding to the optimal prefix code such that nodes with greatest depth have least frequency
 +      * The two letters with least frequency are definitely going to be siblings 
 +
 +**Huffman’s Algorithm**
 +  * Data structures needed:
 +      * Binary tree for the prefix codes
 +      * Priority queue for choosing the node with lowest frequency 
 +  * Costs
 +      * Inserting and extracting node into PQ: O(log n) 
 +      * Number of insertions and extractions: O(n)
 +      * O(n log n)
  
 +<code>
 +To construct a prefix code for an alphabet S, with given frequencies:
 +    If S has two letters then
 +        Encode one letter using 0 and the other letter using 1
 +    Else
 +        Let y* and z* be the two lowest-frequency letters
 +        Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter ω of frequency f<sub>y*</sub> + f<sub>z*</sub>
 +        Recursively construct a prefix code γ’ for S’, with tree T’
 +        Define a prefix code for S as follows:
 +            Start with T’
 +        Take the leaf labeled ω and add two children below it labeled y* and z*
 +    Endif
 +</code>
 ====Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm==== ====Section 4.9: Minimum-Cost Arborescences: A Multi-Phase Greedy Algorithm====
courses/cs211/winter2018/journals/boyese/chapter4.1520293653.txt.gz · Last modified: by boyese
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0