UH / CS Department / Annual Report 2007

Doremi Research Group

The Doremi research group concentrates on the field of Language Technology. A great deal of information is available only in plain, human language - on the Web, in newswire, books, etc. There is a rapidly growing need for automatic methods that will help us handle this abundance of data. The continuum of methods ranges from organizing the information for easy access, to more refined understanding of the content.

Group researches the problems in handling language data, and develops computational methods that enable us to make sense of these data.

Information Retrieval (IR): We study several problems related to IR. The problem of Topic Detection and Tracking (TDT) is to spot documents reporting previously unseen events in news streams, and to track these events as they develop. Because some documents may have internal structure , we also work on exploiting it to help find information more effectively.

Text mining : Information Extraction aims to find specific "facts" or events discussed in the text, and use them to populate a structured database, or to tag the original text with rich meta-data. From a large body of text (corpus) we find important Key Phrases , which describe the content of the text and help in retrieval tasks.

Analysis of words : In our work on lexical semantics, we investigate how words convey meaning, and how these meanings can be learned from the occurrence (distribution) of the words in large data sets. We also explore problems in etymology -- the origin of and relationships among words -- currently focusing on the Finno-Ugric (Uralic) language family.

Contact person: Professor Roman Yangarber

Homepage: http://doremi.cs.helsinki.fi

Selected publications

A. Doucet, M. Lehtonen: Unsupervised classification of text-centric XML document collections. Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Springer Lecture Notes in Computer Science, Volume 4518 (2007) pp. 515-527

M. Lehtonen, N. Pharo, A. Trotman: A Taxonomy for XML Retrieval Use Cases. Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Springer Lecture Notes in Computer Science, Volume 4518 (2007) pp. 430-439

R. Yangarber, R. Steinberger, C. Best, P. von Etter, F. Fuart, D. Horby: Combining Information Retrieval and Information Extraction for Medical Intelligence. Mining Massive Data Sets for Security, Nato Advanced Study Institute. (2007) Gazzada , Italy

R. Yangarber, C. Best, P. von Etter, F. Fuart, D. Horby, R. Steinberger: Combining Information about Epidemic Threats from Multiple Sources. Multi-source, Multilingual Information Extraction and Summarization, RANLP-2007. (2007) Borovets , Bulgaria

Annual report 2007

Doremi Research Group