Information Theoretic Modelling Project

Report by Dmitrii Tisnek

콩나물 시루 | Bean Sprouts

Highlights

LZMA compression

File "letter.dat"

Programming language

Benchmarks of existing algorithms

Before embarking on implementing arithmetic coding or similar from scratch, I tested existing algorithms against the data sets in their raw form. Abbreviated results are here:

ppmd

32154

lzma

35940

ncompress

40768

One thing to note is that difference between lzma and ncompress is not very large, even though there is a decade of research between them.

However ppmd is not installed on CS computers by default and inclusion of ppmd binary would be a significant overhead. Therefore I chose to stay with lzma, which implements dictionary encoder front-end and range encoding back-end, the latter is equivalent to arithmetic coding.

My approach was to tackle the front-end, that is try to learn something about the data files and use that knowledge to represent the data in a more compact form, which is then passed to the existing backend. Also noteworthy is that lzma, although a generic algorithm, works well on bytes and characters, a representation which is quite easy to work with, especially in scriping languages.

Modern state of art: Prize for Compressing Human Knowledge

File "curve1.dat"

The data file is sorted by the side data file. Then each line is converted to fixed-point (3.1) integer, and value increased by a constant assuring that only positive integers remain. The resulting numbers are printed in fixed-with (4 characters) format without delimiters and the resulting string passed to the backend.

Restoring the original data order is trivial in the decompressor – create an array of triplets (i, space for datum, side datum) where i is a monotonously increasing variable, then sort this array by side data, insert data, and restore original order through sort by i. Practical considerations are a stable sort algorithm and compact implementation.

This naïve approach already gave good results and made the compressed file quite small, so I chose to spend time trying to compress larger files instead of improving compression of this.

Future research

An obvious improvement over my approach would be to create an analytical function for the range or data at any point in the side data and another function for a p.m.f. of particular data values at any point in side data.

One approach to this data set is similar to weather forecasting, given the sorted data, existing algorithms or ideas from existing algorithms could be used.

File "curve2.dat"

Same as above with an additional patch for “-0.0” value.

File "info.xpm"

This file compresses well with lzma as it is, due to highly repetitive data. I chose not to spend too much time here, because the naively compressed file is already quite small.

Future research

Many alternatives some to mind, in the end the decompressor size will be the limiting factor. One options would be to generate the image in python. Another, more scienfitic, would be to employ binary lambda calculus. Finally the banner program already installed on CS machines could be used.

File "letter.dat"

The data file is a list of class labels and the side data file is a list of corresponding feature values.

Size of data file as is compressed with lzma: 12195.

Standard approach

A good way to compress this file would be to create a classifier that derives class labels from features. Such classifier could be trained offline or online and could be deterministic (that is predicts or mispredicts a label) or probablistic (outputs a set of possible labels with probabilities assigned). In the first case corrections would have to be encoded, in the second a choice from a p.m.f. would have to be taken and compressed with range coding.

Unsupervised learning

Instead of supervised learning outlined above, I chose to use unsupervised learning, e.g. cluster analysis. If the side data is truly structured (which is a requirement for good compression with supervised learning anyway), then it can be divided into clusters. If the data is then sorted by side data cluster number, the backend compression algorithm should utilize closer spacial locality of symbols and generate a much smaller file.

Size of data file sorted by features 12 and 10, then compressed with lzma: 8479.

Cue validity

The clusters obtained would correspond to inner structure of data, or hidden variables, however in the case of the letter set, there are many: letter name, that we are interested in, font name, pertrubations applied, and so on. Thus to ensure the correct hidden variable is preferred, cue validity is calculated [ffline] for all features, and features with less correlation to class are given lower weight.

Distance measure

Many distance measures are readily available, I tested euclidean distance, it's generalization minkowski distance and tanimoto similarity. Of these tanimoto similarity, which can be thought of a cosine of the angle between two vectors of data, actually gave best results, but due to computational overhead and hopes to use the same algorithm against the forest data set, I stuck with euclidean distance.

Clustering and mapping

There is a large variety of unsupervised learning algorithms available, I briefly onsidered clustering, but decided that simple implementations are too computationally expensive and I didn't want to specify presets such as number of clusters. Instead I chose to use a mapping algorithm, namely a Self-Organizing Map, also known as Kohonen network.

A SOM has a property that [side] data that is internally close will be mapped to network nodes that are close to each other. Thus I settled for training a unidimentional network, because the Best Matching Unit in that network can then be trivially used as sorting key for the data, and network size an order of magnitude larger than the number of class labels to allow for other hidden variables.

Size of data sorted by SOM BMU, compressed with lzma: 3108 (euclidean distance).

Size of data sorted by SOM BMU, compressed with lzma: 2890 (tanimoto similarity).

The cost

The naive implementation of SOM in python would only compute for a small map or a subset of data, thus I had to [finally] learm numpy, which sped things up considerably and made the code much smaller! A faster approach that I benchmarked but did not get around to implement was to compute individual network updates or whole unsupervised learning in C. This would allow for early culling during BMU search and some other optimizations, but code could become large and definitely less elegant.

The training time on letter side data is around 10 minutes, memory usage is reasonable, although because all side data needs to be loaded in memory, is of a concern. Code size if 190 lines for nice, documented implementation and 35 lines for minimal implementation, which compressed with lzma is only 551 bytes. Cue validitity parameters size is number of feature times one digit. Thus overall SOM proved to be a very efficient extractor of knowledge from side data in terms of MDL.

Future research

Size of perfectly sorted data, compressed with lzma: 127. Need we say more?

File "mushroom.dat"

The file is small and compresses well with lzma, thus I decided not to put much effort here.

Future research

Data set description states there are features strongly correlated with class. This means a good classifier is possible.

File "rat.txt"

I tried several preprocessor algorithms, but in the end the most efficient was simply to convert the rat into binary, one bit per character and let lzma do the rest.

File "forest.dat"

I tried same SOM approach that worked well for the letter data set, however I ran into practical problems, namely that loading side data into memory through numpy takes 1.3GB of memory, which is something CS computers are not willing to give. Of course, I tried it on an office server too, but the run time of the algorithm made it impractical.

Future research

If I were to continue with my approach, I would have to first simplify the side data, e.g. to collapse all binary features into a binary string; change distance measure to use euclidean distance for some features and hamming distance for others.

I would also have to seriousely consider reimplementing SOM in C or C++, either whole, or some part as a native extension to numpy

An established machine learning package might do very well here, given the raw amount of data.

File "Juustohampurilainen.dat"

I picked the shortest between raw form and lzma compressed form

Future research

The MDL principle requires a short code that generates the data, however all cryptographically strong PRNG's, namely stream ciphers and one-way hash functions have rather large code sizes (smallest around 1.5k). Even a cryptographically strong key takes 32 bytes. Therefore two possibilities exist, either this data was generated by something installed on CS computers, like md5sum, possibly with a secret key, in which case the only hope is to steal the key; or a weak PRNG, in which case is generally possible to recover the PRNG state. Even the most popular algorithm, Mersenne Twister is reversible after observing 624 consequtive outputs.

In practice, however the enormity of guesswork required would stop this project given that neither the initial conditions, nor the algorithm, nor even the domain of the algorithm is known.

File "GroupX.dat"

Same as above

Choice of programming language

There are several non-trivial considerations in the choice of programming language for this course. First is the availability of the given platform on CS computers. Second is practical familiarity. Third is code source or compiled size. Fourth is runtime requirements.

I chose python and numpy library as most elegant, general purpose and fast to develop in.

Python

Because the interpreter is already installed on CS computers, and thus doesn't count towards the decompressor size, it makes sense to use the most powerful language available.

Most importantly, Python is beautiful.

Numpy

Many compression problems require serious statistical approach. Numpy makes it easy and more elegant, whenever the data or an intermediatory can be represented as a vector or a matrix.

C++ source

Clearly most computationally-intensive code deserves C or C++ (or even assembly) treatment. Moreover large data is handled easily in this case too. However source does is usually quite large, although this can be partially alleviated by general purpose compression.

Compiled code

Compacting binary programs has received much attention in the Demo Scene, and frameworks are already available. Naive assumptions about minimum binary size or dynamic shared object loading were defeated long ago. Thus a decoder program could easily be an executable binary.

Things I didn't do

It would make sense to implement PPM or C-W or Guazzo or range coder backend from scratch, but only if all the juice was taken out of the frontend. After all, lzma is already installed, but custom code counts towards towards decompressor size.

An alternative approach would be to use an established machine learning package, and then encode the resulting decision tree or neural network or equivalent in the decompressor. Offline compression has large advantages, it doesn't count towards runtime or space constraints, and can be done with whatever tools available, not only those installed on CS computers.

In Conclusion

The course is worth only 2 credits. Furthermore due to the rules of the competition the team that takes early lead is likely to perpetuate until the end.

Thus it naturally made sense to put most effort in the very beginning, then in case of [unlikely] early win to lazily check if something can be improved by large margin; or in case of an early loss only check if the new data sets can be reduced by a large margin.

December 2009, Dmitrii Tisnek

콩나물 시루 | Bean Sprouts