OverviewIn the age of big data, there is a constant struggle with regard to storage. For example, a...

Question

OverviewIn the age of big data, there is a constant struggle with regard to storage. For example, a single Whole Genome Sequencing experiment can generate over 100 GB of raw data. For years, computer...

Overview

In the age of big data, there is a constant struggle with regard to storage. For example, a single Whole Genome Sequencing experiment can generate over 100 GB of raw data. For years, computer scientists have attempted to reduce the storage size of files without losing any information. In this assignment, we will be implementing our own file compression tool using the Huffman algorithm.

Task: File Compression and Uncompression

In this project, "all" you have to do is implement two programs: compress and uncompress. The usage should be as follows:

./compress

./uncompress

For grading purposes, you are guaranteed that original_file will be at most 10 MB.

File Compression

The compress program will take as input an arbitrary file (original_file) and will use Huffman Compression to create a compressed version of the file (compressed_file):

Construct a Huffman code for the contents of original_file

Use the constructed Huffman code to encode original_file

Write the results (with a header to allow for uncompression) to compressed_file

File Uncompression

The uncompress program will then take as input a compressed file created by your compress program (compressed_file) and will uncompress the file (uncompressed_file):

Construct a Huffman code by reading the header of compressed_file

Use the constructed Huffman code to decode compressed_file

Write the results to uncompressed_file

Note that uncompressed_file must be completely identical to original_file! In other words, calling diff should return no differences:

diff uncompressed_file original_file # this should not output anything

Reference Solution

You have been provided two programs (refcompress and refuncompress), which are a reference solution implementation of this Project. You can use them to help guide you. Note that the reference binaries were compiled to run on Ed, so if you attempt to run them on a different architecture, they will most likely not work.

Note: Your compress is not expected to work with refuncompress, and your uncompress is not expected to work with refcompress. The provided reference executable files are a matched pair.

Compiling and Running Your Code

We have provided a Makefile, but you are free to modify anything in the starter code that you'd like (including the Makefile and any of the .cpp or .hpp files) however you wish. Our only requirements are the following:

We need to be able to compile your code using the make command

Once compiled, we need to be able to run your compress and uncompress programs as described above

Project Workflow

As mentioned in the overview, as long as we can (1) compile your program via make, (2) run your program as described, and (3) successfully compress + uncompress a file, you are absolutely free to implement the project any way you think is best. However, here are some suggestions just in case you feel a bit lost.

Suggested Development Process

Before doing anything, it is a good idea to create your compress.cpp and uncompress.cpp with main() functions and to get your project files to the point that you can compile successfully with make (even though the compress and uncompress executables don't do anything). Once you get things compiling, start working on compress, and once you are confident that you have that working reasonably well, start working on uncompress. Use print statements, gdb, etc. as you develop to make sure you're implementing things correctly.

Suggested Control Flow for compress

Parse the command line arguments and throw an error message if the user runs your program incorrectly

Open the input file for reading

Read bytes from the file. Count the number of occurrences of each byte value

Use the byte counts to construct a Huffman coding tree. Each unique byte with a non-zero count will be a leaf node in the Huffman tree

Open the output file for writing

Write enough information (a "file header") to the output file to enable the coding tree to be reconstructed when the file is read by your uncompress program

Move back to the beginning of the input file to be able to read it, again

Using the Huffman coding tree, translate each byte from the input file into its code, and append these codes as a sequence of bits to the output file, after the header

Close the input and output files (note that this is handled for you; see Design Notes)

Suggested Control Flow for uncompress

Open the input file for reading

Read the file header at the beginning of the input file, and use it to reconstruct the Huffman coding tree

Open the output file for writing

Using the Huffman coding tree, decode the bits from the input file into the appropriate sequence of bytes, writing them to the output file

Close the input and output files (note that this is handled for you; see Design Notes)

Potential Development Process

If you choose to use the code structure we have laid out in the starter, one reasonable potential development process is the following:

Create compress.cpp and uncompress.cpp with main() functions that don't do anything for now

As you implement the components described below, you can use your main() functions in compress.cpp and uncompress.cpp to test them out one-by-one

Create HCTree.cpp and add skeletons of the HCTree functions you will need to implement

Refer to HCTree.h to see what function signatures are declared in the HCTree class

Implement the HCTree::build function to construct a Huffman Tree from symbol frequencies

You will want to use gdb or print statements to help you trace your tree to make sure it matches what you would get if you constructed it manually by hand (on small example files)

Implement the HCTree destructor

Implement the HCTree::encode function to encode a given symbol

Implement the HCTree::decode function

Once you are confident that all of the individual components are working properly, begin to piece them together in the main functions of compress.cpp and uncompress.cpp according to the suggested control flows above

Incorporate one step of the control flow, then test it thoroughly to make sure things work to that point, then incorporate the next step, test thoroughly, etc.

How you choose to implement your compressed file header is up to you, but we have some recommendations in the "Design Notes" section of this write-up, so be sure to refer to that

cs-lwg2wyis.zip

Answered Same DayMar 07, 2023

Nidhi · Accepted Answer

Answer Attached Below:

OverviewIn the age of big data, there is a constant struggle with regard to storage. For example, a single Whole Genome Sequencing experiment can generate over 100 GB of raw data. For years, computer...

Answer To: OverviewIn the age of big data, there is a constant struggle with regard to storage. For example, a...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment