Projects
- Self-indexes in permanent storage (see SuDS group): The project focuses on the theory of index structures for textual data. Such structures can be used to speed up the context retrieval tasks, such as to answer queries like "Does my pattern X appear inside the text?". Indexes let us reach query times proportional only to the length of the query word, while without indexing we would need to scan through the whole text on each query. Self-indexes are extremely space-efficient versions of classical text indexes. The name comes from the fact that they contain enough information to reproduce the whole text. Thus, a self-index is, in itself, an index for the text and its representation. The most successful self-indexes up-to-date only need memory close to the compressed text size. In other words, a compression mechanism has been found that offers added value to the pure reduced space demand. The aforementioned properties of self-indexes have also been validated in practice, but only in the context of main memory indexes. As main memory solutions, self-indexes are useful tools for enabling genomic-scale analyses of biological sequences, as without indexing those analyses would be too time-consuming. For this purpose self-indexes suite perfectly, as they occupy no more space than the sequences themselves. The purpose of this research is to study the opportunities of self-indexes in permanent storage. This is an important future challenge; only when working in permanent storage, can self-indexes be considered as a universal representation of textual data. Such solutions would have direct applications in textual databases, Internet search engines, or even as parts of file systems in the operating system level.
- Theory and Practice of Succinct Data Structures (SuDS): The project studies a new subfield of data compression - data structure compression. The new aspect compared to traditional compression is that the compressed data (structure) needs to be represented so that access to its internal parts is provided without uncompressing the whole structure. As an example, consider a binary tree of n nodes. It is possible to represent the tree succinctly using about 2n bits so that the children and parent of any node can be accessed in constant time. A standard link structure representation of a binary tree takes of order n log n bits. The project concentrates on compressed full-text index structures. Using these structures one can replace any file with a compressed one so that subword searches are possible without uncompressing the whole file. Moreover, the search time is only proportional to the length of the query word. More concretely, the technique provides an opportunity to replace .zip files with some other file format, which at the same time occupies less memory and provides efficient searches on the file. The first breakthroughs to make this scenario possible have already been achieved. The aim of the project, in addition to advancing the theory, is also to pay attention to the practical aspects in order to transfer the theory into technology.
- Content-Based Retrieval and Analysis of Harmony and other Music
Structures (C-BRAHMS):
String matching techniques can be used for efficient retrieval of query-melodies (produced e.g. by humming) from large music databases. I develop algorithms for so-called transposition invariant string matching, that can be used for searching melodies allowing an arbitrary shift in the pitch levels; i.e. person humming a piece does not have to start from the correct pitch level as long as (most of) the relative pitch levels are correct with respect to the first one. - Integrated Computational Methods for Genomic,
Proteomic and Metabolic Modeling (ICOMIC):
My contribution was to develop methods for automatic matching of 2D protein gel electrophoresis images.
Other interests
- Combinatorial pattern matching.
- Data compression.

