Project I: Compressing Coin Flips

Deadline

Fri 12.10.

The task

This is a two-part project. In the first part task is to design a compressor (and a decompressor) program for lossless compression of strings that are generated from a known Bernoulli distribution (you can imagine them to be coin-flips with known bias towards heads or tails). The algorithm should have a parameter indicating the bias in the data to be compressed. The parameter need not be included in the compressed file. The empirical performance of this algorithm should be analyzed with data sets sampled from Bernoulli distributions with different but known biases.

The main task in the second part is to design, document and implement a compressor (and a decompressor) program whose performance is nearly as good as that of the Part I compressor in data sets sampled from Bernoulli distributions with unknown biases. In other words, the Part II compressor does not have a 'bias parameter'. Some test data sets are included in the template package. Both adaptive, i.e., ones that improve their behavior online, or batch algorithms are acceptable. All compression must be lossless meaning that your decompression programs must be able to restore the original file exactly as it was before compression.

Deliverables

Your documentation should describe the ideas/observations/principles underlying the two compressor algorithms. A complete project is a tar-file (all links should be relative and all files in the directory returned) including

ť high-level description (in English) of the compressors,
ť analysis and the results of the tests made,
ť program source for both compressor and decompressor programs,
ť instructions on how to run the programs in a simple test case

The programs must compile in Linux. No fancy interface is required. The form of the report is a WEB-page (the HTML template is included in the project template).

Download the project template

If you run into problems, email questions to Teemu & Jukka.

Hints

In order to open the template file say 'tar xzvf project1-template.tgz'.

As you will learn on the course, in the first part — when you are dealing with data sets with known distribution — there is a certain level of performance (expected code-length) that can not be beaten even in principle. There are certain solutions that achieve this performance up to a neglible margin.

The second part — the generating distribution is known to be within a fixed set of distributions — is a bit more interesting. The solution of the first part cannot be directly applied.




3 Concepts: Information	2007