Project I: Compressing Coin Flips

Marija Furdek, Michael Duku Kaakyire, Anupam Arohi

 

 

 

Introduction

 

Coin flips are encoded as binary strings and are stored in files. We use this binary representation as a symbol and apply algorithms on symbols (1 byte each). As a consequence the algorithm works on character symbols and not on individual bits. Due to this reason we were unable to address the biassed version of compression.

 

High-level description

 

Solution is implementation of adaptive arithmetic coding. The general concept of arithmetic coding is dividing the probabilities interval [0,1) into divisions whose size is exactly proportional to the percentage of every given symbol. After each step, according to the current symbol, the range is updated using the algorithm:

 

Set low to 0.0
Set high to 1.0
While there are still input symbols do
    get an input symbol
    code_range = high - low.
    high = low + range*high_range(symbol)
    low = low + range*low_range(symbol)
End of While
output low
 

The encoding algorithm has two parts: the probability model and the arithmetic encoder. The model reads the next symbol from the input stream and invokes the encoder, sending it the symbol and the two required cumulative frequencies. The model then increments the count of the symbol and updates the cumulative frequencies. The point is that the symbol’s probability is determined by the model from its old count, and the count is incremented only after the symbol has been encoded. This makes it possible for the decoder to mirror the encoder’s operations. The encoder knows what the symbol is even before it is encoded, but the decoder has to decode the symbol in order to find out what it is. The decoder can therefore use only the old counts when decoding a symbol.

Once the symbol has been decoded, the decoder increments its count and updates the cumulative frequencies in exactly the same way as the encoder. In nutshell, the basic operations used by the compressor are:

·         Interval update and arithmetic operations

·         Symbol encoding (interval search)

·         Interval renormalization

·         Carry propagation and bit moves

·         Probability estimation(source modeling)


We faced challenges in scaleng the algorithm to work with medium to large size files. Basic method runs out of precision very soon. It is optimized using various methods, one of which is shifting high bits regularly and maintaining infinite precision. Most significant digit is shifted out after it becomes the same for the lower and the upper bound, and then shifting the rest of the digits to the left.  For each symbol we then calculate the number of occurences, and, according to its probability, we assign it index, high, low, high cumulative frequency and low cumulative frequency.

 

 

Analysis and the results of the tests

 

Table below gives the results of the application test on given test files:

 

File

Original Size

Compressed File Size

Ratio

binom1.dat

1000

5656

1.77:1

binom2.dat 

1000

8108

1.23:1

binom3.dat 

1000

9355

1.07:1

binom4.dat 

1000

9984

1.00:1

binom5.dat 

1000

10176

0.98:1

binom6.dat 

1000

9986

1.00:1

binom7.dat 

1000

9345

1.07:1

binom8.dat 

1000

8070

1.24:1

binom9.dat 

1000

5764

1.73:1

 

We observed that for certain cases, it expanded the file in place of compressing it. One obvious solution to this (not implemented) is to calculate the compression ratio and avoid compression when compression is not fruitful. There could be multiple strategy implementations to take advantages of known binary patterns viz. Binary pattern repetition, p function etc. We learned a lot on various approached that can be take and adopted. However with our limited technical expertise and time, these thoughts did not manifest.

 

Application files:

 

Source:

1. arithmetic_codec.h

2. arithmetic_codec.cpp

3. acfile.cpp

 

binary:

            aczip

 

usage:

            compress:

     aczip –c <FileToBeCompressed> <CompressedFileName>

            decompress:

     aczip –d <CompressedFileName> <ExpandedFileName>

 

Refferences

 

David Salomon,  Data Compression the Complete Reference 3rd Ed, pages 106-124

Mark Nelson, Arithmetic Coding + Statistical Modeling = Data Compression