Niko Välimäki defends his PhD thesis on August 21st, 2012, on Applications of Compressed Data Structures on Sequences and Structured Data

MSc Niko Välimäki will defend his doctoral thesis Applications of Compressed Data Structures on Sequences and Structured Data Tuesday the 21st of August, 2012 at noon in the University of Helsinki Main Building, Unioninkatu 34, Auditorium XIV (old part), 3rd floor.

Applications of Compressed Data Structures on Sequences and Structured Data

Recent advancements in the field of compressed data structures create interesting opportunities for interdisciplinary research and applications. Compressed data structures provide essentially a time--space tradeoff for solving a problem; while traditional data structures use extra space in addition to the input, compressed data structures replace the input and require space proportional to the compressed size of the input. The amount of available memory is often fixed, thus, the user might be willing to spend more time if it allows the use of larger inputs. However, despite the potential behind compressed data structures, they have not quite reached the audience of other disciplines. We study how to take advantage of compressed data structures in the fields of bioinformatics, data analysis and information retrieval.

We present several novel applications for compressed data structures and include an experimental evaluation of the time-space tradeoffs achieved. More precisely, we propose (i) a space-efficient string mining algorithm to recognise substrings that admit the given frequency constraints, (ii) both theoretical and practical methods for computing approximate overlaps between all string pairs, (iii) a practical path-based graph kernel for predicting the function of unknown enzymatic reactions, and (iv) a compressed XML index that supports efficient XPath queries on both the tree-structure and textual content of XML documents. Problem (i) is motivated by knowledge discovery in databases, where the goal is to extract emerging substrings that discriminate two (or more) databases. Problem (ii) is one of the first phases in a sequence assembly pipeline and requires efficient algorithms due to the new high-throughput sequencing systems. Problem (iii) is motivated by machine learning, where kernels are used to measure the similarity of complex objects. Problem (iv) has its background in information retrieval.

The proposed methods achieve theoretical and practical improvements over the earlier state of the art. To raise the overall awareness of compressed data structures, our results have been published in interdisciplinary forums, including conferences and journals from the fields of bioinformatics, data engineering and data mining.

Availability of the dissertation

An electronic version of the doctoral dissertation is available on the e-thesis site at http://urn.fi/URN:ISBN:978-952-10-8017-3.

Printed copies are available on request from Niko Välimäki: (09) 191 51151 or niko.valimaki@cs.helsinki.fi.

07.08.2012 - 14:18 Pirjo Moen
07.08.2012 - 11:14 Pirjo Moen