This is an old revision of the document!
Table of Contents
4.8 Huffman Codes and Data Compression
The Problem
- Encoding Symbols Using Bits:
- We need schemes to encode text written in richer alphabets and converts this text into long strings of bits.
- The first thought would be to represent all of the symbol by the same number of bits
- However, this is not efficient as some symbols are frequent while others are less frequent.
- Thus symbols are represented differently depending on their frequency.
- Data Compression Algorithms are devoted at representing large files as compact as possibly, that is they take files as input and reduce their space through efficient encoding schemes
Variable-Length Encoding Schemes:
- Basically more frequent letters are represented differently from the less frequent ones.
- The goal is to represent symbols with as short bits as possible
Prefix codes:
- It's the encoding of symbols in such a way that no encoding is a prefix of any other
- A Prefix code for a set S of letters is a function γ that maps each letter x∈ S to some sequence of zeros and ones such that:
- For distinct x,y ∈ S, the sequence γ(x) is not a prefix of the sequence γ(y).
Optimal Prefix Codes :
- For each letter x in S, there is a frequency fx, that represents the fraction of letters in the text that are equal to x
- Assuming there are n letters, n*fx of these letters equal to x
- The frequencies sum to 1 –>∑x ∈ S fx = 1
- |γ(x)|: Length of encoding of x
- ABL = Σx in S fx*|γ(x)|
- ABL is the Average Bit per Letter
Designing the Algorithm
- The search space of the problem includes all of the possible ways of mapping letters to bit strings
- We need to develop a tree-based means of representing prefix codes that exposes their structure more clearly
Representing Prefix Codes using Binary Trees
- We label each leaf of the binary tree with a distinct letter in the size of the alphabet S
- For each letter x in S, we follow the path from the root to the leaf labeled x:
- Each time the path goes from a node to its left child, we write down a 0
- Each time the path goes from a node to its right child, we write down a 1
- The Encoding of S constructed from T is a prefix code
- The search for an optimal prefix code reduces to the search of a binary Tree T, together with a labeling of the leaves of T, that minimizes the ABL.