CSCI 235 Software Analysis & Design Spring, 2003

Instructor: Szenher

Programming Assignment #7: Huffman Trees

Due Date: May 14, 2003

Objectives:

Problem Specification: In every data stream there are characters that appear more often than others. The Huffman algorithm generates a binary tree (a Huffman tree) using the frequencies of characters in a data stream. In a Huffman tree every leaf node represents a character and every non-leaf node represents the sum of frequencies of its children. By traversing from the root to the tree to a leaf node we can generate a binary code to represent the character in the leaf.  More frequently occurring characters are stored in nodes closer to the root; less frequently occurring characters are stored in nodes farther from the root.  Therefore, more frequently occurring characters have shorter binary code lengths and vice versa. Huffman coding is typically used for data compression, but in this assignment we will only generate a Huffman tree and use it to decode compressed data.

Implementation:  You will be given a partial implementation of a program which:

1.       Generates a Huffman tree from given text

2.       Outputs each character and its associated binary code (the binary code will be attained from the Huffman Tree) in the form similar to Figure 3 below

3.       Reads “compressed”  text, solely composed of 1 and 0, and outputs the decoded characters to another file (described below)

To Generate a Huffman tree, you must first create a list of character and their frequencies. A HuffmanTree is composed of HuffmanTreeNodes. You will be using HuffmanTreeNodes to store a character, the character’s frequency (along with left and right pointers to the node’s children).

An example of how to generate the tree

For the particular data stream of ten characters: “PANAMANIAN

Figure 1

Character

Frequency

A

4

N

3

P

1

I

1

M

1

1.       We first evaluate how frequently each character occurs in the data stream. Then we sort the frequencies and choose the smallest two.

2.       Create a new tree node that will have the sum of the two smallest frequencies (2). Set the two smallest to be children of the new tree node.

3.       Sort the nodes and choose the smallest two.

4.       Create a new tree node (3) and set the two smallest to be children of the new node.

5.       Sort the nodes and choose the smallest two. Create a new tree node (6) and set the two smallest to be children of the new node.

6.       Sort the nodes and choose the smallest two. Create a new tree node (10) and set the two smallest to be children of the new node.
Notice: we reach 10 and we have used up all the nodes, the tree is complete

When the tree is complete, we can traverse from the root to each leaf to generate the code.

 

Figure 3

Code Table

Character

Code

A

0

N

11

P

100

I

1010

M

1011

Figure 2

Once the Huffman tree has been built, it can be used to decode compressed text. Decoding is a  tree traversal problem.  Compressed text (stored as bits) can be thought of as a set of branching instructions:  a zero-bit means "traverse to the left child" and a one-bit signals a branch to the right child.  When a leaf is reached, the character stored in the leaf should be printed and traversal begins again from the root of the tree, picking up at the next bit in the compressed text.

For example, the compressed text 110100 can be decoded as follows:

Figure 4

Read 1

Go right

-

Read 1

Go right

“N”  more to read -> go back to root

Read 0

Go left

“A”  more to read -> go back to root

Read 1

Go right

-

Read 0

Go left

-

Read 0

Go left

“P” no more to read -> stop

Thus, 110100, decompressed, is "NAP".

A side note: you may be thinking "How is it that 110100 is the compressed version of NAP? Isn't 110100 longer than NAP?" The answer: no. NAP consists of three characters and the storage of each character requires eight bits (in most systems) for a total of 24 bits. The encoded version (110100) is only six bits long.


As stated earlier, you will be given a partial implementation of a program. The program consists of:

The main program file [ main.cpp ]
The main program is fully implemented and documented and has two parts. The first part will generate a list of characters and their associated frequencies from a given text file, text.txt.  Each element in the list is a pointer to a HuffmanTreeNode. The type of list we are using is a vector.  If you are unfamiliar with vectors read this. The vector of HuffmanTreeNodePtr (HuffmanTreeNode pointers) will then be used to construct a HuffmanTree. When the tree is constructed you will use it to print the code generated by the tree to another file: codetable.txt.
The second part will use the tree to decode the compressed text in the file, codedtext.txt and output the decoded string to, decodedtext.txt. (Don't forget to copy text.txt and codedtext.txt into your working directory.)

HuffmanTree Class [ HuffmanTree.h and HuffmanTree.cpp ]
This class defines the Huffman tree. It is here that you will have to complete the implementation.  The tree is constructed from a vector of HuffmanTreeNodePtr and calls a method BuildHuffmanTree. In BuildHuffmanTree, you should implement the algorithm diagramed above. BuildTree calls three methods: Sort (this method is implemented for you) which sorts the nodes in the vector, MergeTwoSmallestIntoSubtree which will find the two nodes with the smallest frequencies and join them with a parent that will contain the sum of their frequencies, and CopyTree. CopyTree will be invoked when the Huffman tree is complete. CopyTree will copy the tree formed in the vector to the root of the HuffmanTree; CopyTree will operator recursively. In addition to MergeTwoSmallestIntoSubtree and CopyTree, you will have to implement three other methods. ClearTree will deallocate all the nodes of a tree recursively.  PrintCode, will recursively traverse the generated tree and print the characters and their associated code.  DecodeAndPrint will be a recursive algorithm that reads a binary data stream (of 1's and 0's) and decodes one character from it (as is done multiple times in the example above).

HuffmanTreeNode Class [ HuffmanTreeNode.h and HuffmanTreeNode.cpp ]
This class defines the Huffman tree nodes. Leaf nodes store a character, that character’s frequency. Internal nodes store the sum of the frequencies in the node's descendants, and pointers to the left and right children of the node. The class is fully implemented and documented.

Required Output:

  1. A code table similar to that given in Figure 3 for the Huffman tree generated from the text in text.txt.
  2. The message stored in codedtext.txt, decoded.