Project in Information-Theoretic Modeling

582651
2
Algorithms and machine learning
Advanced studies
A programming project applying the techniques learned in the course Information-Theoretic Modeling, which is a prerequisite.
Year Semester Date Period Language In charge
2012 autumn 29.10-03.12. 2-2 English

Exercise groups

Group: 1
Time Room Instructor Date Observe
Tue 10-12 C220 Hannes Wettig 29.10.2012—07.12.2012

General

The course has been graded and the results are available. (21 December 2012)

This project course follows upon the lecture course 582650 Information-Theoretic Modeling. You need to have taken the lecture course or acquired the equivalent skills somewhere else. If you have not taken the lecture course and wish to participate in the project, please contact the instructor Hannes Wettig and explain how you have studied the topic.

You will also need good programming skills in a language of your choice. Programming is done in the department's Linux environment.

The course consists of a series of programming problems where the task is to compress a given data set. The work is expected to be done in groups of 2–3 students.

 

Learning objectives of the project

The purpose of the project is to deepen the skills and understanding gained on the lecture course by more hands-on work that also allows for more concentration on some specific example cases.

In particular, the student

  • gets practical experience of different coding methods, including implementation-level details
  • sees the usefulness of different models and algorithms in compressing data
  • puts concepts from the lecture course more closely into the context of applications in machine learning and data compression.

The exercises include applications of the Minimum Description Length (MDL) principle for model class selection, prediction and denoising problems.

 

Completing the course

Students are expected to work in groups of two or three throughout the project. To pass the course, the group must

  1. Turn in solutions to four programming tasks, due on weeks 2, 3, 4 and 6 of the course (see below for details). Each solution is a tar file which when unpacked includes a 'run.sh' script that produces all data that have been assigned so far. These packages have to be delivered by tuesday, 9 o'clock by mail (wettig@cs.helsinki.fi) such that they can be checked in time before class.
  2. Give a short presentation about their solutions to each programming task to other students
  3. Turn in a written final report covering all four problems.

All students in the same group receive the same grade for the course. The grade of a group is based on

  • Written end report (65% of grade)
  • Compression performance calculated from the sizes of the group's solutions to programming tasks, 1–4 (35% of grade).

The written report should show that the students are able to choose appropriate tools for compressing given data and can justify their choices. Where you choose to use readily available software, you must explain what it does and why this is appropriate for the task at hand.

For the first programming task, the data will be simple and students will be given the model from which is has been generated. The purpose of this task is to make sure that students get a good start and build up some tools they can use on later tasks where finding appropriate models is an essential part of the task. Therefore, for the first tasks students are required to use arithmetic coding. It is allowed (and even recommended) that you use an existing implementation of arithmetic coding, but you must make sure you understand how it works.

The second data set will be a simple time series. The third data set will be from a classification problem. The fourth data set will consist of image data.

You are allowed to use existing software packages, but in your written report you are required to explain the compression algorithms they are based on. You are also required to explain why you think a particular package is a better choice than using some of the methods you learned about in the course. A good explanation here would be based on actually trying something to find out how it works. We try to choose the data sets in such a way that it is possible to beat the performance of off-the-shelf packages by using coding algorithms we know from the course together with some relatively simple models of the data.

 

Results

There have been three groups. The package sizes in bytes (in case of problem 5 the error penalty) are listed below. The final score is a weighted sum of these numbers, namely 0.1*x1 + 0.2*x2 + 0.3*x3 + 0.4*x4/5 , where xi is the package size of the package for problem i. The final reports can also be found below.

Results
  problem 1 problem 2 problem 3 problem 4 problem 5 weighted sum report
Team A 9468 44629 43683 287183 189724 213740.3 http://www.cs.helsinki.fi/u/wettig/A.pdf
Team B 9344 77373 48062 291237 187155 222184.4 http://www.cs.helsinki.fi/u/wettig/B.pdf
Team C 16244 71258 79067 349815 243103 276743.3 http://www.cs.helsinki.fi/u/wettig/C.pdf