ITM Report

Team Juustohampurilainen

Mathias Helminger, Jan Schlüter, Liang Wang

I. Introduction

The purpose of this project is let us apply what we have learned in the course “Information Theoretical Modeling” in the previous period. The aim is very simple - try to compress the given data as small as possible.

As far as data compression is concerned, most of the data compression methods fall into one of two camps: dictionary based schemes and statistical methods. By combining arithmetic coding with powerful modeling techniques, statistical methods for data compression can actually achieve better performance.

Since the sets of data given in this project all have their own pattern, which means if we can find the most suitable model for each set of them, meanwhile keep the model as simple as possible, we will get much better compression rate on the data set than by just using a general compressing algorithm(programme). What’s more, how to trade the model complexity for goodness of fit is also a very important factor deciding the compression rate.

In the following sections, we’ll give a brief introduction to arithmetic coding and gzip first, which are used to combine with our models. Then the discussion will focus on the specific model of each dataset, since that is the core job of this project.

II. Shared Compression Algorithms

Arithmetic Coding

Background Information

When talking about data compression based on statistical methods, Huffman coding must be mentioned, even though it is a retired champion. The problem with Huffman coding is that the codes have to be an integral number of bits long. For example, if the probability of a character is 1/3, the optimum number of bits to code that character is around 1.6. The Huffman coding scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer compressed message than is theoretically possible. The image below is a very good illustration for the essence of the thought behind arithmetic coding.

Then the Huffman coding was replaced by the Arithmetic coding, which completely bypasses the idea of replacing an input symbol with a specific code. Instead, it takes a stream of input symbols and replaces it with a single floating point output number. The longer (and more complex) the message, the more bits are needed in the output number. It was not until recently that practical methods were found to implement this on computers with fixed sized registers.

About Implementation

At first glance, the arithmetic coding seems completely impractical, since the floating number with infinite precision is used during the calculation. But most computers only support floating point numbers of up to 80 bits or so. Some high-level programming languages provide support to numbers with infinite precision, such as Fraction and Decimal in Python. However, such supports in the high-level languages are only for ‘simple’ calculation, not for intensive calculations, efficiency and speed are not guaranteed.

In fact, we did implement an arithmetic coding in Python by using rational numbers (Fraction module in Python). The code is really simple and neat indeed. However, even though the efficiency of the algorithm is not the first concern in this project (in order to get the shortest program), the speed for compressing and decompressing is not acceptable at all. The simplicity is at the cost of efficiency.

In practice, we use the fixed-length integer to simulate infinite-precision floating number. The simulation is realized by rescaling the integer interval on the fly during the compression and decompression. Whenever the most significant digit of higher bound and lower bound match, it will never change in the future, then it can be shifted out as one digit in our encoded number. The situation of underflow should also be considered when the higher bound and lower bound begin to converge, but the most significant digit still remain different, such as 59999 and 60001.

Theoretically, the decoder should be something like 'dividing the encoded number by the symbol probability'. In practice, we can't perform an operation like that on a number that could be billions of bytes long. Actually, the decoder accomplishes the job by simulating the procedure in the encoding part.

GZIP

gzip is another tool used in this project. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. The DEFLATE algorithm is working on a series of blocks, it will produce an optimised Huffman tree customised for each block of data individually. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.Instructions to generate the necessary Huffman tree immediately follow the block header.

Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.

So, from the overview of the DEFLATE algorithm above, we can see that the effect of compression is achieved by:

We are not going to dig into the details of DEFLATE algorithm here. There is abundant information about GZIP and DEFLATE on the internet, you can also check the reference at the end of the report for further reading, if you are interested.

III. The Data

info.xpm

For info.xpm we first tried to produce the image from scratch. The french flag in the background is easily built with a list comprehension in Python. We then tried to find the font used for the text, but we didnt find a close enough match. The next idea was to encode the exceptions from the flag (the black pixels) using our arithmetic encoder but even if we encoded only the difference to the last line with the arithmetic encoder, the total size was beyond 1kb.
Right at the start we have also tried different existing graphics formats. The png format set us quite a challenge with 553 bytes plus some 200 bytes to convert it to xpm. (png optimizer was used)
Because of the small size of the png-file we didnt bother trying a lot more to make it maybe a few bytes smaller. The total size of <1kb did not justify that.

PNG format

As we have used it, here comes a short overview of the png standard:

The file starts with an 8 byte fixed header for identification and other purposes. After that, the file is split up in "chunks", each of which has length, type, data and checksum.

The optimized version of our png file only contains these chunks:

The IDAT chunks are the interesing ones in png compression.
As first step in compression, the png encoder tries to use the 2D neighbourhood of a new pixel in order to predict it. For each line the encoder can choose a new encoding method. As the png standard is quite extensive, we only describe the most common one (also probably the only one used in our case). In the following picture the encoder/decoder already knows the green pixels. The current pixel is red while the grey pixels are unknown ones that will be processed later


The pixel references available to the png decoder.

For each pixel the encoder now has the option to encode it using a reference to the pixels A or B, their average, or using the "paeth" with A+B-C. If none of those can be used the compressor will use a reference to the color table in the PLTE chunk. These 5 options are then passed to a DEFLATE compressor which further reduces the the length by using similarities to references seen before.

For more detailed information about the file format refer to the ISO norm or read the short introductory arcticle by one of the creators. Both can be found in the references

Rat

The rat.txt file is a string of ones and zeroes which, plotted in 2D, displays a drawing of a rat.
Very soon we realized, that a simple time-independent model would not yield good results because it does not account for the huge white areas around the center. Another idea that we did not implement was runlength encoding. Although it would benefit from the huge white areas, the inner part of the rat is not very optimal for the encoder as short bursts of ones and zeroes dominate.

The first implemented idea was inspired by the good results that png compression yields. As described in the last section, png uses local similarities by creating a lookup table for colors used nearby. Once we looked a little closer we found out that in this case actually a later step of the png compressor (Deflate) yields most gain, but we kept the idea to account for the 2D neighbourhood.

We now ran over the picture and for each pixel we created a histogramm from the neighbourhood that is already known if you would read the file sequentially. This means for pixel with coordinates (x,y), the histogramm is created from [(x-1,y),(x-1,y-1),(x,y-1),(x+1,y-1)]. At the borders we simply wrapped around them to avoid adding more complex code. (The -1th line was assumed to be all white)

This histogramm was actually split into those yielding white and black pixels at (x,y), thus giving us the ability to improve the prediction for the next pixel significantly compared to the time independent approach.

We were not really satisfied with the results, as for example the white surroundings still took up a lot of space (>1/10 bit). To further improve the results we added polygons by hand (done in inkscape). The first one was used to separate the rat from the white background. We experimented with bezier and spline polygons but in the end we took simple line polygons because the extra data needed for the other two was too high compared to the improval in the fit and the additional code needed to draw them.

The 2 regions that can be constructed from the first polygon (inside vs outside) gave us very different results for the histogramms and the encoding improved significantly.

It was then further improved by adding 2 more regions, one with > 95% black pixels and one with > 90% white pixels (approx.)

The four regions


The regions for the rat image.
count of white pixels
region01234
<5% white0.0120.1560.2310.4710.8
"the rest"0.0460.2220.4460.7110.942
>90% white0.050.2660.6250.8460.99
>99% white0.0360.7260.8060.8940.9999
probabilities for a white pixel sampled from the
different regions and histogramms

In the final compression/decompression the probabilites for black and white were hardcoded for each region and for each possible histogramm (i.e. count of white pixels in the known neighbourhood). These probabilites then went straigt to the arithmetic encoder.

Curves


curve1.dat plotted against curve1.sdat

curve2.dat plotted against curve2.sdat

When sdata is used to plot the curve data, it turns into what seems to be data points scattered around some function for both of the curves. The first step was to approximate these functions so that the scattered data would be centered around the x-axis. This fitting was done in gnuplot and we tried polynomials of different orders.
For curve1 we used 2 functions because of the significant change at x=0.
Curve2 was a little more complex because the data contained fairly many far-off values and a strange jump in the values at 0.8. The jump was easy to remove by setting y to -y-180 for all values past this point. To get rid of the far off errors, we first fitted a function to the data and then used the difference to decide if data points were far-off (blue crosses in the following picture) or data from the center. The fitting was then repeated with the points in the center area only (the sine shape was also used to improve the fit).
In both cases fairly few parameters were sufficient to get a good fit.

fitting of average:


fitting of curve1 - f(x): 3rd and 5th order polynomials combined

fitting of curve2 - f(x): 12th order polynomial

bias-cleaned data:


difference of curve1 and f(x)

difference of curve2 and f(x)

It turns out that the center part of curve2 can be modeled as evenly distributed in an interval whose size is a sine function of x and some sparse errors far off, while curve1 follows a gaussian distribution with a standard deviation that can be approximated with polynomic function of x. The even distribution was relatively easy to build (take y values in a small interval and find maximum of abs but omit far off values). The gaussian first required to calculate the local variance for every data point. This data was then used in gnuplot to fit several polynomials to the variance.

fitting of gaussian and even distribution:


standard deviation of curve1 with function fitted to it

the interval for the euqal distributionof curve2

What was left to do now was to study the far-off errors in curve2. It turns out they are relatively frequent if x<-0.35, a little less frequent if x<0.38 and there are no errors past x>0.38. The plot was now divided in 3 mutually exclusive regions: The first one being inside the sine curve, the second one left of -0.35 and the thrid one from -0.35 to 0.38. To cope with local problems with the fitted sine, a fourth area was added. It contains anything within 5 units of the sine.
For each of these regions, we now counted the points and measured the surface it covered to calculate probabilites for a datapoint lying at a certain position in one of these areas. This approach could have been improved (integration), but it would have increased the model size unproportionally.

Equipped with all the parameters for the distribution of a value y for a given x we were now able to compress the curves using the arithmetic encoder.

Letters

For the letters task, we had a data file consisting of letters from A to Z and an sdata file containing several attributes of images of the letters. For good compression, we needed to predict the letters from those attributes. So the letter task turns down to be a classification problem with 16-dimensional inputs, 26 classes and 20000 instances.

First round

As we already had some neural network code from a machine learning class in Munich, we first trained a neural network on it. The type of network used was a multi-layer perceptron with one hidden layer and an output layer of 26 neurons. The activation functions were the logistic sigmoid function for the hidden layer and the softmax function for the output layer (which has the advantage of always producing a probability distribution over the 26 classes in the output layer). The training method was backpropagation, where online learning turned out to be more robust than batch learning.
We trained the network with different numbers of hidden neurons. Our results with 8 to 64 neurons in the hidden layer varied between 17000 and 2500 misclassifications.

We now had to find methods to store both the weights (for the decompressor) and the misclassified characters (in the compressed file), and then find a good tradeoff between the number of hidden neurons and the number of misclassifications. As the arithmetic coder was not finished, we had to use other means. For the characters, the first idea we tried was to replicate the letter.dat file, replacing all correctly classified characters with spaces, and compressing the result with gzip. The resulting file was smaller than just compressing the letter.dat file directly with gzip, but not small enough. We reasoned that the neural network might still have a good second guess even if the first guess was wrong, and tried another encoding employing this. If the first guess was right, we replaced the character with an 'a', if the second guess was right, we replaced it with a 'b' and so on. This new compressor basically replaces the data file of letters from A to Z with a file of letters from a to z, but with heavily changed frequencies (a and b being far more frequent than in the original file), and then employs gzip. This was the best we could come up with without using arithmetic coding.
But even if the compressed file was small now (around 5 kB with 16 hidden neurons, as opposed to 12.5 kB you get by gzipping the letter.dat file directly), the weights took up too much space. The total number of weights is (17 * H) + ((H+1) * 26), where H is the number of hidden neurons. Using 64-bit float values, the weights for 16 hidden neurons would take up almost 6 kB. To reduce this size, we first rounded the weights to 32-bit floats, and later even discretized them to 16-bit and finally 8-bit integers. As the misclassification rate (and the size of the compressed file) only marginally increased (35 bytes for the compressed file), we adopted the 8-bit discretization.
Now that we found a way to store the weights in less bytes, we trained again with more neurons and found a good trade-off to be 32 hidden neurons, were 2700 misclassifications resulted in 3446 bytes compressed data, 544 + 858 bytes for the weights and 244 bytes of python code, our submission in the first round.

Second round

Between the first and second submission, we tried several new techniques. Firstly we implemented the "Extreme Learning Machine" [Huang 2006] and tried it, but got bad results (even with several hundred neurons it was not as good as what we had).
Then we read some papers that worked with the letter task and found a machine learning tool called "Weka". With this tool, we tried some new methods (J48, AdaBoost J48, K*, KNearest). It did not work too well - the classification rate was seldom better than what we already had, and the models needed for this were quite huge (decision trees of more than 100 rules, or several hundred training instances for KNearest).
We then tried the Matlab Neural Network Toolbox to train different neural networks, varying the number and size of hidden layers, transfer functions and training methods. Again, we couldn't beat the classification rate we had with the previous 32-neuron-model.
In the end we returned to the original code. We implemented a new method to adapt the learning rate during training (copied from Matlab's "traingda" method) and tried to find a good model with 48 neurons. It always ended around 3500 misclassifications, while our previous model of 32 neurons had around 2700.
As a last try we rescaled the input data of the letters.sdat file to the interval from -1 to +1, and this did the trick. The misclassification rate dropped to about one fifth of the previous rate. Additionally, the arithmetic coder was finished, allowing us to use the class probabilities given by network directly as a model for the data. The new tradeoff between neuron count and prediction quality had an optimum at 48 neurons, where 763 misclassifications resulted in 663 bytes compressed data, 604 + 724 bytes of weights and about 300 bytes of compressed decompression code, our submission from the second round on.

Later, we tried to further improve on this result by finding a model for compressing the weights using the arithmetic coder (and using the fact that the weights do not need to be losslessly compressed, but may be represented inaccurately), but we couldn't improve the current solution.

Forest

Similarly to the letter task, the forest task is basically a classification problem, this time with 54-dimensional inputs, 7 classes and 581012 instances.

Manual analysis of the data

By looking at the data, we found that the classes in the forest.dat file are not distributed randomly, but often form sequences, possibly because the data is ordered by location. Additionally, we noticed that the frequency of classes changes dramatically between the first 15123 instances and the rest.

Neural network solution

Again, we employed the neural network we already used for the letter task. We tried different methods to train the net on a subset of the given data only to save some computation time, but in the end returned to training on all instances (over night) because it gave better results. Having learned from the letter task, we rescaled the input vectors component-wise to the interval from -1 to +1, and additionally shifted them to a zero mean because it further improved the results.
The results were not as good as for the letter task - with 48 neurons we got 98626 misclassifications, resulting in more than 50 kB for the arithmetic coded data. LZMA on the forest.dat alone could compress it to 24 kB, so we still had to improve a lot.

Neural network plus last-symbol boost

As mentioned in the beginning, the instances seemed to be ordered, so actually our model wastes some information if the instances are regarded independently of one another. To easily fix this, we implemented what we termed a "last-symbol boost": The last seen symbol (i.e. class) is given a higher probability than what the neural network predicted. With the optimal boost value which we found out empirically, this method reduced the size of the compressed data to about 18 kB, already substantially smaller than the LZMA result. To further improve it, we chose a smaller boost value for the first part of the file (the first 15123 instances) than for the second part, because the first part seems to be less regular. This reduced the size of the compressed file further to 16.5 kB.
Our final result needed 16664 bytes for the compressed data and 1884 + 343 bytes for the weights. The code size is negligible - as the code is almost the same as for the letter task, gzip compresses the two files quite well when put together in a tar archive.

What we could not complete

Looking at the data in different editors we noticed that there seems to be a strong influence not just from the preceeding symbol but also from the recently observed patterns. If you change the width of your editor window (word wrap enabled), you come across areas that look like a 2D image. We decided to try this automatically to find the best line width. As it turned out, the optimal line width changes frequently. What we did now is to start with a hand-fitted line length at the beginning of the file. Now we searched for the optimal linewidth w for which another line (w steps fruther down) of the same length w deviates as little as possible from the current one. As the results of this looked very promissing we then switched to not using the .dat part but the .sdat part instead. Of the given values we deemed "elevation" to be most useful to find geographic correlations. We found that using elevation was even more robust than just using forest.dat.


2D-forest plotted on console with a very small font (click image to enlarge)

One big problem now was the fact that the length of the lines varied. if we wanted to compare lines to each other we would have had no data available for the end of a line if the previous one was shorter. To cope with this problem we improved the line matching to look for scaled versions of the current line instead of just a very close exact match. The same scaling could then be used to scale the previous line to the length of the current one. We then tried a similar technique as with the rat and the png algorithm of using the local neighbourhood to predict a new "pixel". As the data was a little different we used a bigger area than just the 4-neighbourhood and we gave weights to different spots. Using just this combined with averaged probabilites for each treetype, we were able to push down the size to about 20kb. We tried now to merge the Neural Network with the line predictor, as the errors of each model seemed to have little correlation.
Sadly we were not able to finish this part. From the entropy data we had expected to get below 10kb total size with this approach.


2D-forest on console with fixed length rendering and bits per symbol for each line (nearest filter was used for sampling)

bits per symbol for the entire forest dataset using the line-predictor (average: 0.281 bits per symbol)

Mushroom

For the Mushroom data, the mushroom.info file contained a set of rules that can be used to predict the class (edible or not) from the mushrooms' attributes with 100% classification accuracy. We wrote down the shortest possible Python program we came up with to apply these rules, which is 148 bytes compressed with gzip. As the decompressor can now generate the .dat file completely from the .sdat file, our compressor outputs a zero byte file for this task.

Data submitted by other teams

We did not spend much time trying to compress data from the other teams. The only things we tried were 1-10th order Markov chain and looking for similarities to Python's random generator. Neither approach found anything significant so we just left the data in it's original shape.

References