Suomeksi På svenska In English
University of Helsinki Department of Computer Science
 

Annual report 2005

From Data to Knowledge - FDK

The From Data to Knowledge research unit (FDK) develops computing methods for discovering useful knowledge from large masses of data. The unit is multidisciplinary, combining in its research groups expertise in algorithmics, statistical methods and application fields such as bioinformatics and human language processing. The unit was appointed a Centre of Excellence of the Academy of Finland for a six-year period starting from 1 January 2002.

The FDK unit is shared by the University of Helsinki and Helsinki University of Technology. Most of its operations are located at the Department of Computer Science at the University of Helsinki . Professor Esko Ukkonen is the director of the unit, and Professors Helena Ahonen-Myka, Jaakko Hollmen (HUT), Heikki Mannila (Basic Research Unit/HIIT, Academy Professor) and Hannu Toivonen are members of it. In 2005 the personnel of the unit consisted of about sixty researchers and postgraduate students.

The core competence area of the unit is algorithmics for data analysis. The unit's areas of expertise on an international level are combinatorial pattern recognition and string matching on the one hand, and machine learning and data mining on the other. The unit emphasizes in its activities the interaction between theory development and practical applications. The goal is to find research problems, whose conceptual basis and solution algorithm have a wider application potential. The unit develops new algorithms and prototype implementations of them, and then studies their usage and performance.

The unit functions as several closely connected main projects. The same persons are active in several projects. This facilitates internal communication and the utilisation of expert knowledge for different applications.

The first main theme is data mining and machine learning. The project develops original concepts and algorithms to strengthen a core area of the unit. We aim at results in theoretical basic research. The relevance of the results is tested in various applications. Text databases and document collections as well as event sequences in telecommunication networks are examples of the data we use. Information filtering from the Internet and other human language technology belong to the field of this project, as well as using machine learning in image analysis. The focus of QA systems research has been laid on analysis methods for queries. The language independence of the methods has been tested by developing QA systems for three languages (Finnish, French and English).

The second main theme focuses on applying the first theme in the field of bioinformatics. It studies the methods for medical genetics and for analysing data on genomics, proteomics, and metabolisms. Partners include UCLA, the European Bioinformatics Institute and several top national research groups. The project develops computational methods for modelling multiple disease gene localisation as well as various gene-regulation and metabolism networks on the basis of measurement data. The latest research focuses on such areas as haplotypes, mapping the overall architecture of genes, and system biology. The project reached many relevant results about determining haplotypes and using them for gene-mapping. In cooperation with cancer researchers, the project opened a significant field of inquiry on locating gene-regulatory patterns in the DNA. The project reached many relevant results about determining haplotypes and using them for gene-mapping. In cooperation with cancer researchers, the project opened a significant field of inquiry on locating gene-regulatory patterns in the DNA.

Combinatorial pattern-matching and information retrieval belong to the focal areas of the unit. The main research questions include approximate pattern matching, efficient index structures, and learning patterns from data. The group continues to build a program library of string algorithms. An application that has been studied is the retrieval and analysis of music represented as musical notes. In connection with XML-information retrieval, the group has studied how best to divide XML documents into convenient indexing units.

In addition to the basic research and doctoral education, the FDK unit also wants to serve as an algorithm 'atelier' that develops computational solutions to new problems in different fields. The unit is always in search for new partners who could pose computational problems at the cutting edge of research.

During 2005 a total of 2 doctoral dissertations were completed in the unit, and its researchers are partners in one new patent.

Contact person: Professor Esko Ukkonen

Homepage: http://www.cs.helsinki.fi/research/fdk/

Publications:

Ahonen-Myka, H.
Mining all maximal frequent word sequences in a set of sentences.
Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, October 31- November 5, 2005, Bremen , Germany s. 255-256.

Hintsanen, P. & Sevon, P. & Onkamo, P. & Eronen, L. & Toivonen, H.
An empirical comparison of case-control and trio-based study designs in high-throughput association mapping. Journal of Medical Genetics, Published Online First: 28 October 2005. doi:10.1136/jmg.2005.036020

Kivioja, T. & Arvas, M. & Saloheimo, M. & Penttilä, M. & Ukkonen, E.
Optimization of cDNA-AFLP experiments using genomic sequence data. Bioinformatics 21(11): 2573-2579 (2005)

Mäkinen, V. & Navarro, G. & Ukkonen, E.
Transposition invariant string matching. Journal of algorithms 56, s. 124-153

Yangarber R. & Jokipii L.
Redundancy-based Correction of Automatically Extracted Facts. In Proceedings of the Human Language Technology Conference/ Conference on Empirical Methods in Natural Language Processing: HLT/EMNLP-2005, (2005) Vancouver , Canada .

Research projects in 2005:

Data mining and algorithmic machine learning

  • Information extraction
  • Paleoecological data analysis
  • APRIL II
  • PASCAL

Computational biology and bioinformatics

  • Computational methods for analysing genome structure and function in mammals
  • System biological analysis of physiological regulation
  • Finding predisposition genes in case-control material
  • A global molecular approach in the study of microbial stress
  • Yeast systems biology - Integrated analysis of metabolism-related data
  • BIOSAPIENS (EU NoE)
  • REGULATORY GENOMICS (EU)

Combinatorial pattern-matching and information retrieval

  • C-BRAHMS - music information retrieval
  • GLAS - Generic software library of algorithms on strings
  • Mobile and multilingual maintenance man

Computational structural biology

  • Structure, assembly and dynamics of biological macromolecular complexes

 

International visits

From the unit

Matti Kääriäinen
International Computer Science Institute, Berkeley, California, Algorithms Group, 4 April 2005-31 March 2006