Combinatorial Pattern Matching and Information Retrieval 

Team
Collaborators outside the unit

Analysing proteomics data using twodimensional pattern matching,
figure by Veli Mäkinen
See also

This subproject continues our research on concepts, algorithms and software systems for combinatorial pattern matching in strings, sequences and in discrete structures of higher dimensions such as digital images, music files, and threedimensional arrays of voxels. This is mostly basic research of algorithmic problems in the field and serves the FDK unit as a whole. On the theoretical side, we will work on three central questions in the field: approximate pattern matching, indexing, and pattern synthesis. In approximate string matching one still open question is how to find the weakest but still significant occurrences of the pattern fast. This is in contrast with the strong occurrences that are easy to find using wellknown methods. This is a big problem in bioinformatics where one often wants to find very weak homologies between sequences. The current solutions seem slow and inaccurate. Designing efficient index structures that speed up the pattern search from large files is another problem of interest. Finding good indexes for approximate search is the main issue. To index text strings, we work for example on socalled LempelZiv index (which is sublinear in size) as well as on different q gram filters. The filtering approach has turned out very useful both in exact and approximate string searches. However, we are still lacking a systematic study of the theory and algorithms of such filters. Pattern synthesis means finding interesting 'common factors' that occur in a given set of sequences or other structures. Such questions arise for example in the analysis of DNA when one wants to find socalled binding sites of regulatory elements for a set of genes. The common factor can be used for modeling the binding sites. An important general direction of work is also to generalize the combinatorial pattern matching approach from simple linear strings of symbols to different structures of higher dimension. Practical applications of these and related methods will be studied by developing prototype implementations for various application areas. For example, we have recently developed a music retrieval system based on a socalled bitparallel algorithm for pattern matching in strings. Another application is twodimensional point pattern matching that occurs, e.g., in the analysis of proteomics gels and in electron microscopy. 