Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/05 23:43] – [Section 4.5: The Minimum Spanning Tree Problem] boyese | courses:cs211:winter2018:journals:boyese:chapter4 [2018/03/12 18:25] (current) – [Section 4.8: Huffman Codes and Data Compression] boyese | ||
|---|---|---|---|
| Line 175: | Line 175: | ||
| * Cluster analysis | * Cluster analysis | ||
| - | // | + | // |
| - | Possible algorithms to find MST: | + | **Possible algorithms to find MST:** |
| * // | * // | ||
| * //Prim’s Algorithm:// | * //Prim’s Algorithm:// | ||
| Line 183: | Line 183: | ||
| **Properties of the MST** | **Properties of the MST** | ||
| - | * Cut property: Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then MST contains e. | + | * //Cut property:// Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then MST contains e. |
| - | * Cutset: A cut is a subset of nodes S. The corresponding cutset D is the subset of edges with exactly one endpoint in S. | + | * //Cutset:// A cut is a subset of nodes S. The corresponding cutset D is the subset of edges with exactly one endpoint in S. |
| - | * Cycle property: Let C be a cycle, and let f be the max cost edge belonging to C. Then MST does not contain f. (All edge costs ce are distinct) | + | * //Cycle property:// Let C be a cycle, and let f be the max cost edge belonging to C. Then MST does not contain f. (All edge costs ce are distinct) |
| - | * Cycle: Set of edges in the form a-b, b-c, c-d, …, y-z, z-a | + | * //Cycle:// Set of edges in the form a-b, b-c, c-d, …, y-z, z-a |
| - | * Cycle-Cut Intersection: | + | * //Cycle-Cut Intersection: |
| ====Section 4.6: Implementing Kruskal' | ====Section 4.6: Implementing Kruskal' | ||
| + | ===Union-Find Data Structure: | ||
| + | * Keeps track of a graph as edges are added | ||
| + | * Cannot handle when edges are deleted | ||
| + | * Maintains disjoint sets | ||
| + | * E.g., graph’s connected components | ||
| + | **Operations: | ||
| + | * Find(u): returns name of set containing u | ||
| + | * How is this utilized to see if two nodes are in the same set? | ||
| + | * Goal implementation: | ||
| + | * Union(A, B): merge sets A and B into one set | ||
| + | * Goal implementation: | ||
| + | |||
| + | ===Implementing Kruskal’s Algorithm=== | ||
| + | * Using the union-find data structure | ||
| + | * Build set of T edges in the MST | ||
| + | * Maintain set for each connected component | ||
| + | * Using best implementation of union-find | ||
| + | * Sorting: O(m log n) | ||
| + | * Union-find: O(m α (m, n)) | ||
| + | * O(m log n) | ||
| + | |||
| + | **The Algorithm: | ||
| + | < | ||
| + | Sort edge weights so that c< | ||
| + | T = {} | ||
| + | foreach (u ∈ V) make a set containing singleton u | ||
| + | for i = 1 to m | ||
| + | (u,v) = e< | ||
| + | if (u and v are in different sets) | ||
| + | T = T ∪ {e< | ||
| + | merge the sets containing u and v | ||
| + | return T | ||
| + | </ | ||
| ====Section 4.7: Clustering==== | ====Section 4.7: Clustering==== | ||
| + | ===Clustering=== | ||
| + | Given a set U of n objects (or points) labeled p< | ||
| + | * **Problem: | ||
| + | * Requires quantification of distance | ||
| + | * Applications | ||
| + | * Routing in mobile ad-hoc networks | ||
| + | * Identify patterns in gene expression | ||
| + | * Identifying patterns in web application use cases | ||
| + | * Sets of URLs | ||
| + | * Similarity searching in medical image databases | ||
| + | |||
| + | **Distance Function** | ||
| + | The numeric value specifying “closeness” of two objects | ||
| + | Assume distance function satisfies several natural properties: | ||
| + | * d(p< | ||
| + | * d(p< | ||
| + | * d(p< | ||
| + | |||
| + | |||
| + | **Problem: k-Clustering of Maximum Spacing** | ||
| + | * // | ||
| + | * // | ||
| + | * // | ||
| + | |||
| + | **Greedy Clustering Algorithm** | ||
| + | * Single-link k-clustering algorithm | ||
| + | * Form a graph on the vertex set U, corresponding to n clusters | ||
| + | * Find the closest pair of objects such that each object is in a different cluster and add an edge between them | ||
| + | * Repeat n-k times until there are exactly k clusters | ||
| + | * This is the same as Kruskal’s algorithm, except we stop when there are k connected components | ||
| + | * Equivalent to finding MST and deleting the k-1 most expensive edges | ||
| + | |||
| + | **// | ||
| ====Section 4.8: Huffman Codes and Data Compression==== | ====Section 4.8: Huffman Codes and Data Compression==== | ||
| + | ===Huffman Codes=== | ||
| + | **Problem: Encoding** | ||
| + | * Computers use bits: 0s and 1s | ||
| + | * Need to represent what humans know to what computers know | ||
| + | * Map symbol → unique sequence of 0s and 1s | ||
| + | * Process is called encoding | ||
| + | * The least number of bits needed to encode 26 letters, space, and 5 punctuation marks (, . ? ! ‘) (32 characters = 25), so 5 bits | ||
| + | * For a longer string of characters, we can use shorter encodings for frequently used characters like ETAOINSHRDLU | ||
| + | * Goal: optimal encoding that takes advantage of non-uniformity of letter frequencies | ||
| + | * Looking to represent data as compactly as possible | ||
| + | * Example: Morse code, which is an example of variable-length encoding | ||
| + | * We run into a problem when doing this: ambiguity caused by the encoding of one character being a prefix of encoding another | ||
| + | |||
| + | **Optimal Prefix Codes** | ||
| + | * Solution to ambiguity: Prefix codes, which are a map of letters to bit strings such that no encoding is a prefix of any other | ||
| + | * Goal: minimize the average number of bits per letter (ABL): | ||
| + | * Σ< | ||
| + | * fx : frequency that the letter x occurs | ||
| + | * |ϒ(x)|: length of encoding of x | ||
| + | * Minimize ABL = Σ< | ||
| + | * Binary tree to represent prefix codes | ||
| + | * Exposes structure netter than list of mappings | ||
| + | * Each leaf node is a letter | ||
| + | * Follow path to the letter | ||
| + | * Going left: 0, Going right: 1 | ||
| + | * Recursively generate tree | ||
| + | * All letters are in root node | ||
| + | * For all letters in node: | ||
| + | * If encoding begins with 0, letter belongs in left subtree | ||
| + | * Otherwise (encoding begins with 1), letter belongs in right subtree | ||
| + | * If last bit of encoding, make the letter a leaf node of that subtree | ||
| + | * Shift encoding one bit | ||
| + | * Process left and right children | ||
| + | * Tree Properties | ||
| + | * The length of the path from root to leaf is the depth | ||
| + | * The binary tree T corresponding to the optimal prefix code is full, i.e., each internal node has two children | ||
| + | * Conclusions | ||
| + | * The binary tree corresponding to the optimal prefix code is full, i.e., each internal node has two children | ||
| + | * We want to label the leaf nodes of the binary tree corresponding to the optimal prefix code such that nodes with greatest depth have least frequency | ||
| + | * The two letters with least frequency are definitely going to be siblings | ||
| + | |||
| + | **Huffman’s Algorithm** | ||
| + | * Data structures needed: | ||
| + | * Binary tree for the prefix codes | ||
| + | * Priority queue for choosing the node with lowest frequency | ||
| + | * Costs | ||
| + | * Inserting and extracting node into PQ: O(log n) | ||
| + | * Number of insertions and extractions: | ||
| + | * O(n log n) | ||
| + | < | ||
| + | To construct a prefix code for an alphabet S, with given frequencies: | ||
| + | If S has two letters then | ||
| + | Encode one letter using 0 and the other letter using 1 | ||
| + | Else | ||
| + | Let y* and z* be the two lowest-frequency letters | ||
| + | Form a new alphabet S’ by deleting y* and z* and replacing them with a new letter ω of frequency f< | ||
| + | Recursively construct a prefix code γ’ for S’, with tree T’ | ||
| + | Define a prefix code for S as follows: | ||
| + | Start with T’ | ||
| + | Take the leaf labeled ω and add two children below it labeled y* and z* | ||
| + | Endif | ||
| + | </ | ||
| ====Section 4.9: Minimum-Cost Arborescences: | ====Section 4.9: Minimum-Cost Arborescences: | ||
