Information Theoretic Modeling | Projects | Project II

Project - De Compression Competition

Amin Alizadeh, Lokesh Nyati

Description of compression algorithm

The main compression algorithm that was used in the compression of the datasets was bzip2 algorithm. It was done using the bzip2 module in the python standard library. Here is a brief description of this algorithm followed by the observation/methods tried to compress various datafiles. And also whether, they succedeed? If not, than we also tried to explain the flaw in the method as we understood.

Bzip2 Algorithm: Bzip2 uses several compression techniques stacked on top of each other which occur in the following order:

1. Run Length encoding: Any sequence of 4 to 255 consecutive duplicate symbols is replaced by the first four symbols and a repeat length between 0 and 251. Runs of symbols are always transformed after four consecutive symbols, even if the run-length is set to zero, to keep the transformation reversible.
2. Burrows Wheeler transform (BWT): This is the reversible block-sort that is at the core of bzip2. For the block-sort, a (notional) matrix is created in which row i contains the whole of the buffer, rotated to start from the ith symbol. Following rotation, the rows of the matrix are sorted into alphabetic (numerical) order. A 24-bit pointer is stored marking the starting position for when the block is untransformed.
3. Move to Front: Each of the symbols in use in the document is placed in an array. When a symbol is processed, it is replaced by its location (index) in the array and that symbol is shuffled to the front of the array. The effect is that immediately recurring symbols are replaced by zero symbols (long runs of any arbitrary symbol thus become runs of zero symbols), while other symbols are remapped according to their local frequency.
4. Run-length encoding (RLE): long strings of repeated symbols in the output (normally zeros by this time) are replaced by a combination of the symbol and a sequence of two special codes, RUNA and RUNB.
5. Huffman Coding: Uses the basic huffman coding algorithm to assign codewords.
6. Multiple Huffman tables: several identically-sized Huffman tables can be used with a block if the gain from using them is greater than the cost of including the extra table.
7. Unary base 1 encoding: if multiple Huffman tables are in use, the selection of each table (numbered 0..5) is done from a list by a zero-terminated bit run between one (1) and six (6) bits in length. The selection is into a MTF list of the tables.
8. Delta encoding (?): Huffman code bit-lengths are required to reconstruct each of the used Canonical Huffman tables. Each bit-length is stored as an encoded difference against the previous code bit-length.
9. Sparse bit array: a bitmap is used to show which symbols are used inside the block and should be included in the Huffman trees. A sparse method is used, the 256 symbols are divided up into 16 ranges and only if symbols are used within that block is a 16-bit array included.


Datasets:
1. Curve1.dat - Observations :- Unfortunately, we could not come up with any relation between the data and side data thus no compression except for the basic one. we tried to extrapolate the data to come up with some mixture of functions, but we could not figure out anything exact though there we reached some closed ones with some manipulation with the coefficients. But they weren't biased enough to justify their inclusion in the compression algorithm.
2. Curve2.dat - Observations :- This was similar as with curve1.
3. Info.xpm - Observation :- None. We used simple compression to compress this which gave pretty good compression ratio. So didn't investigated it any further.
4. Letter.dat - Observation :- Given the feature parameter set in the side data, we thought of implementing the K-NN algorithm to compute this. As the feature parameter is not independent, and running K-NN algorithm with 4000 training set gives hit rate over 75%, it could have compressed data more. Although, easy to implement (and we did too), we didn't incorporate it into the final code because of the computation time. Our implementation was very time intensive and could not actually meet the 1 hr limit so we didn't incorporate it in the final code. We brought down the running time of the algorithm just over an hour, but not below. Another approach we took to compress this data was to use the arithmetic coding as the frequency was already given. But we couldn't code an efficient algorithm for this too and thus failed.
5. Mushroom.dat - Observation :- This file was easy to compress as the rules were clearly given in the side data. Just implemented the rules recursively, and we were able to generate the data file from scratch.
6. Rat.txt - Observation :- This was binary string, so didn't looked much into it. Just used the plain compression algorithm to compress this file.
7. Forest.dat - Observation :- This data was comprised of feature set which was independent of each other. Here the cover types were distributed normally so we thought of dividing the data into the binary code assigning them codelengths depending on their distribution. This would lead to a sparse binary string wth more numbers of 0s than 1s (or vice versa). Than we tried to implement one algorithm which tries to count number of zeroes in sequence and assigns a code word to them. Although easy to comprehend this proved to be much difficult while implementing. Biggest challenge to us was deciding on the number of code words. After multiple tries to come up with an efficient way to summarize this, we looked towards the classification methods. NBC seems perfect for this, as the feature set was independent of each other. But again in this case, computation time was a hurdle. We couldn't cut down time to under one hour and thus finally used the base algorithm.
8. Extra Data - Didn't tried to compress them.

More information on the various compression algorithms used in the bzip2 stacks and the classificatin methods we tried to use can be found on the web, especially en.wikipedia.org and the book The Elements of statistical learning by Trevor Hastie et al. Source code

Note: The programs must compile in Linux and conform to the runtest.sh script in the template.