UH / CS Department / Annual Report 2006

Document Management, Information Retrieval and Data Mining (Doremi)

The Doremi research group is active in the areas of document management, information retrieval, data mining and human language technology. The group has developed methods for question-answering systems, event detection and tracking, information retrieval in XML documents, and text mining.

Information retrieval from XML documents has attracted a great deal of attention lately in projects such as the INEX evaluation programme. Doremi member Miro Lehtonen used its test collection in his 2006 PhD thesis. The thesis work involves indexing methods for XML documents. If only the most text-intensive parts of a document are indexed, the index will be smaller, which raises the quality of search results. Another discovery was that term emphasis based on the XML mark-up enhances the precision of the retrieval even further.

The project Mobile and Multilingual Maintenance Man (4M) ended, giving rise to the new project Cognitive Guidance and Knowledge Systems (CoGKS). Like 4M, CoGKS is a broad collaboration between several research groups at the University of Helsinki and Helsinki University of Technology as well as VTT Information Technology. The goal is to develop a communication and knowledge support system for expert communities (e.g. the maintenance personnel of a company), where the role of the 4M system is to follow the conversations between human experts and offer instructions and background information as needed. The Doremi group is responsible for developing information-retrieval methods that extract pertinent search words out of the conversation and other sources and carry out dynamic searches from background material. In addition, we are developing methods based especially on information extraction for knowledge assembly from large document collections, such as problem and repair descriptions reported in maintenance documents.

Doremi has been collaborating with EU's Joint Research Centre (JRC) to build a system integrating information-retrieval and information-extraction technologies. The system will collect and analyse bulletins on infectious diseases from international news sources. The Europe Media Monitor (EMM) system developed by the JRC uses keyword analysis to search thousands of online sources for news documents on topics that are important to many EU units. The documents found in this way are clustered according to topic. The Pattern-based Understanding and Learning System (PULS) developed by Doremi analyses the documents clustered under infectious diseases and extracts facts such as which diseases have been discovered, in which country, and how many have been infected. The integrated system Medisys, which is updated in real time, is available at the address medusa.jrc.it/.

The group has also collaborated with the Research Institute for the Languages of Finland (Kotus) and developed an etymological database for Finno-Ugrian languages. The database is based on the dictionary Suomen sanojen alkuperä (the origin of Finnish words, SSA) that has only been available on paper before. The database will be used for developing and testing algorithms in computational etymology. The algorithms will search for genetic connections between the Finno-Ugric languages. The database will also be a valuable resource for research into Ugric etymology.

The Doremi group also worked on question-answering systems. The idea of QA systems is that users can ask them questions in a natural language, and the system finds the answer in a large body of text. Depending on the requirements, the answer is either a piece of text in which the reader can find the answer, or an exact answer, like a proper noun.

Contact persons: Professor Helena Ahonen-Myka and Roman Yangarber, PhD

Website: http://www.cs.helsinki.fi/research/doremi/

Project:

Mobile and Multilingual Maintenance Man (4M)

Publications

Doucet, A. & Ahonen-Myka, H: Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. In proceedings of IS-LTC 2006, Information Society - Language Technologies Conference, Ljubljana, Slovenia, October 9-14, 2006, p. 186-191.

Doucet, A. & Ahonen-Myka, H: Probability and Expected Document Frequency of Discontinued Word Sequences, an efficient method for their exact computation. TAL journal, special issue on "Scaling of Natural Language Processing: Complexity, Algorithms and Architectures", 46 (2): 25 pages, 2006.

Lehtonen, M: Designing User Studies for XML Retrieval. In proceedings of the ACM SIGIR 2006 Workshop on XML Element Retrieval Methodology, Seattle , USA , 10 August 2006, pages 28-34.

Lehtonen, M: Preparing Heterogeneous XML for Full-Text Search. ACM Transactions on Information Systems (TOIS), Special Issue on XML Retrieval, 24, 4, pages 455-474. ACM Press, October 2006.

Lehtonen, M: When a Few Highly Relevant Answers Are Enough. Lecture Notes in Computer Science, Advances in XML Information Retrieval andEvaluation: 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005. Volume 3977 /2006. p. 296-305.

Annual report 2006

Document Management, Information Retrieval and Data Mining (Doremi)