CSCI
235 Software Analysis & Design Spring, 2003
Instructor: Szenher
Programming Assignment #7: Huffman
Trees
Due
Date: May 14, 2003
Objectives:
Problem
Specification: In
every data stream there are characters that appear more often than others. The
Huffman algorithm generates a binary tree (a Huffman tree) using the
frequencies of characters in a data stream. In a Huffman tree every leaf node
represents a character and every non-leaf node represents the sum of
frequencies of its children. By traversing from the root to the tree to a leaf
node we can generate a binary code to represent the character in the leaf. More frequently occurring characters are
stored in nodes closer to the root; less frequently occurring characters are
stored in nodes farther from the root.
Therefore, more frequently occurring characters have shorter binary code
lengths and vice versa. Huffman coding is typically used for data compression,
but in this assignment we will only generate a Huffman tree and use it to
decode compressed data.
Implementation: You will be given a partial implementation
of a program which:
1.
Generates a Huffman tree from given text
2.
Outputs each character and its associated binary code (the
binary code will be attained from the Huffman Tree) in the form similar to
Figure 3 below
3.
Reads “compressed”
text, solely composed of 1 and 0, and outputs the decoded characters to
another file (described below)
To
Generate a Huffman tree, you must first create a list of character and their
frequencies. A HuffmanTree is composed of HuffmanTreeNodes. You will be using
HuffmanTreeNodes to store a character, the character’s frequency (along with
left and right pointers to the node’s children).
|
An example of how to generate the tree For
the particular data stream of ten characters: “PANAMANIAN”
1.
We first evaluate how frequently each character occurs in the data stream.
Then we sort the frequencies and choose the smallest two. 2.
Create a new tree node that will have the sum of
the two smallest frequencies (2). Set the two smallest to be children of the
new tree node. 3.
Sort the nodes and choose the smallest two. 4.
Create a new tree node (3) and set the two smallest
to be children of the new node. 5.
Sort the nodes and choose the smallest two. Create
a new tree node (6) and set the two smallest to be children of the new node. 6.
Sort the nodes and choose the smallest two. Create
a new tree node (10) and set the two smallest to be children of the new node. When the tree is complete, we can traverse from the
root to each leaf to generate the code.
|
Figure
2 |
|||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
Once
the Huffman tree has been built, it can be used to decode compressed text.
Decoding is a tree traversal
problem. Compressed text (stored as bits) can be
thought of as a set of branching instructions:
a zero-bit means "traverse to the left child" and a one-bit
signals a branch to the right child.
When a leaf is reached, the character stored in the leaf should be printed and
traversal begins again from the root of the tree, picking up at the next bit in the
compressed text.
For
example, the compressed text 110100 can be decoded as follows:
|
Figure 4 |
||
|
Read
1 |
Go
right |
- |
|
Read
1 |
Go
right |
“N” more to read -> go back to root |
|
Read
0 |
Go
left |
“A” more to read -> go back to root |
|
Read
1 |
Go
right |
- |
|
Read
0 |
Go
left |
- |
|
Read
0 |
Go
left |
“P”
no more to read -> stop |
Thus, 110100, decompressed, is "NAP".
A side note: you may be thinking "How is it that
110100 is the compressed version of NAP? Isn't 110100 longer than NAP?" The answer:
no. NAP consists of three characters and the storage of each character requires
eight bits (in most systems) for a total of 24 bits. The encoded version (110100) is
only six bits long.
As
stated earlier, you will be given a partial implementation of a program. The
program consists of:
The main program file [ main.cpp ]
The main program is fully implemented and documented and has two parts. The
first part will generate a list of characters and their associated frequencies
from a given text file, text.txt. Each element in the list is a pointer to a
HuffmanTreeNode. The type of list we are using is a vector. If you are unfamiliar with vectors read this. The vector of HuffmanTreeNodePtr
(HuffmanTreeNode pointers) will then be used to construct a HuffmanTree. When
the tree is constructed you will use it to print the code generated by the tree
to another file: codetable.txt.
The second part will use the tree to decode the compressed text in the file, codedtext.txt
and output the decoded string to, decodedtext.txt. (Don't forget to copy text.txt and codedtext.txt into your
working directory.)
HuffmanTree Class [ HuffmanTree.h and HuffmanTree.cpp ]
This class defines the Huffman tree. It is
here that you will have to complete the implementation. The tree is constructed from a vector of
HuffmanTreeNodePtr and calls a method BuildHuffmanTree. In BuildHuffmanTree, you should
implement the algorithm diagramed above. BuildTree calls
three methods: Sort (this method is implemented for you) which sorts the nodes
in the vector, MergeTwoSmallestIntoSubtree which will find the two nodes with
the smallest frequencies and join them with a parent that will contain the sum of their
frequencies, and CopyTree. CopyTree will be invoked when the Huffman tree is
complete.
CopyTree will copy the tree formed in the vector to the root of the
HuffmanTree; CopyTree will operator recursively. In addition to MergeTwoSmallestIntoSubtree and CopyTree, you will
have to implement three other methods. ClearTree will deallocate all
the nodes of a tree recursively. PrintCode,
will recursively traverse the generated tree and print the characters and their associated
code. DecodeAndPrint will be a recursive
algorithm that reads
a binary data stream (of 1's and 0's) and decodes one character from it (as is done
multiple times in the
example above).
HuffmanTreeNode Class [ HuffmanTreeNode.h
and HuffmanTreeNode.cpp
]
This class defines the Huffman tree nodes. Leaf nodes store a character, that
character’s frequency. Internal nodes store the sum of the frequencies in the node's
descendants, and pointers to the left and right children of the node. The class is
fully implemented and documented.
Required Output: