Get-Together for String Algorithms Researchers in Helsinki

Event type: 
Guest lecture
Event time: 
19.01.2012 - 15:15 - 16:00
Exactum B119

The program consists of a talk by Simon J. Puglisi from King's College London:

Title: Storing and Accessing Big Web Collections


Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed sized blocks and compress each block with a general-purpose adaptive compressor such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval time. In this work we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in an LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large web collections that, using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.

This is joint work with Christopher Hoobin (Royal Melbourne) and Justin Zobel (University of Melbourne), to appear at the 38th International Conference on Very Large Databases, August 2012.


13.01.2012 - 15:58 Leena Salmela
13.01.2012 - 15:58 Leena Salmela