This is an old revision of the document!


4.8 Huffman Codes and Data Compression

The Problem

  • Encoding Symbols Using Bits:
    • We need schemes to encode text written in richer alphabets and converts this text into long strings of bits.
    • The first thought would be to represent all of the symbol by the same number of bits
    • However, this is not efficient as some symbols are frequent while others are less frequent.
    • Thus symbols are represented differently depending on their frequency.
    • Data Compression Algorithms are devoted at representing large files as compact as possibly, that is they take files as input and reduce their space through efficient encoding schemes


Variable-Length Encoding Schemes:

  • Basically more frequent letters are represented differently from the less frequent ones.
  • The goal is to represent symbols with as short bits as possible


Prefix codes:

  • It's the encoding of symbols in such a way that no encoding is a prefix of any other
  • A Prefix code for a set S of letters is a function γ that maps each letter x∈ S to some sequence of zeros and ones such that:
    • For distinct x,y ∈ S, the sequence γ(x) is not a prefix of the sequence γ(y).


Optimal Prefix Codes :

  • For each letter x in S, there is a frequency fx, that represents the fraction of letters in the text that are equal to x
  • Assuming there are n letters, n*fx of these letters equal to x
  • The frequencies sum to 1 –>∑x ∈ S fx = 1
  • |γ(x)|: Length of encoding of x
  • ABL = Σx in S fx*|γ(x)|
  • ABL is the Average number of Bits per Letter



Designing the Algorithm

  • The search space of the problem includes all of the possible ways of mapping letters to bit strings
  • We need to develop a tree-based means of representing prefix codes that exposes their structure more clearly



Representing Prefix Codes using Binary Trees

  • We label each leaf of the binary tree with a distinct letter in the size of the alphabet S
  • For each letter x in S, we follow the path from the root to the leaf labeled x:
    • Each time the path goes from a node to its left child, we write down a 0
    • Each time the path goes from a node to its right child, we write down a 1
  • The Encoding of S constructed from T is a prefix code
  • The search for an optimal prefix code reduces to the search of a binary Tree T, together with a labeling of the leaves of T, that minimizes the ABL.
  • Thus the length of the encoding of a letter x in S is simply the length from the root to the leaf labeled x: This length is termed the depth of the leaf, and the depth of a leaf v in T will be denoted by depthT(v).
  • So now our goal is to find the labeled tree T that minimizes the weighted average of the depths of all leaves, where the average is weighted by the frequencies of the letters that label the leaves: ∑x ∈ S fx*depthT(x).–> ABL
  • The Binary Tree corresponding to the optimal prefix code is full

Algorithm to Construct an Optimal Prefix Code

To construct a prefix code for an alphabet S, with given frequencies:
If S has 2 letters:
Encode one letter using 0 and the other using 1

Else:

Let y* and z* be the 2 lowest-frequency letters

Form a new alphabet S' by deleting y* and z* and replacing them with
a new letter ω of frequency fy*+ fz*

Recursively Construct a prefix code y' for S', with tree T'

Define a prefix code for S as follows:

Start with T'

Take the leaf labeled ω and add two children below it labeled y* and z*

End if


  • The algorithm above is referred to as the Huffman Algorithm and the prefix code it produces is termed the Huffman code
  • This algorithm produces an optimal prefix code
  • This algorithm also follows a bottom-up approach as seen: It focuses on the leaves representing the two lowest-frequency letters, and then continues by recursion.



Analyzing the Algorithm

  • The algorithm is optimal
  • ABL(T') = ABL(T) - fω
  • The Huffman code for a given alphabet achieves the minimum ABL of any prefix code

Implementation and Running Time

courses/cs211/winter2012/journals/jeanpaul/chapterfour_sectionviii.1331011113.txt.gz · Last modified: 2012/03/06 05:18 by mugabej
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0