Sunday 2 December 2012

Huffman Coding


                  Huffman Coding is an entropy encoding algorithm used for lossless data compression.Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code,that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols.Huffman coding is equivalent to simple binary block encoding.

Compression

          
             The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols.
A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight  of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child.

The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority:

  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:
    1. Remove the two nodes of highest priority (lowest probability) from the queue
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    3. Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.

                 If the symbols are sorted by probability, there is a linear-time (O(n)) method to create a Huffman tree using two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues:

  1. Start with as many leaves as there are symbols.
  2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
  3. While there is more than one node in the queues:
    1. Dequeue the two nodes with the lowest weight by examining the fronts of both queues.
    2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
    3. Enqueue the new node into the rear of the second queue.
  4. The remaining node is the root node; the tree has now been generated. 




Decompression

               
                The process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream.Before this can take place, however, the Huffman tree must be somehow reconstructed. In the simplest case, where character frequencies are fairly predictable, the tree can be reconstructed and thus reused every time, at the expense of at least some measure of compression efficiency.


huffman coding using python code can be download from here.




                                                                                                                                                                                                                                                           

No comments:

Post a Comment