Please see Project Pages for:
Department of Computer Science University of Helsinki

Computational Linguistics and Language Technology
Research Group

The scope of the Research Group covers all problems of analyzing linguistic data. We investigate

The Computational Linguistics and Language Technology Research Group is part of the Academy Centre of Excellence ALGODAN: Algorithmic Data Analysis.
(Previously the Centre of Excellence FDK: From Data to Knowledge of the Academy of Finland.)

Our work combines empirical and theoretical approaches to these problems, through research projects with international and domestic collaborators in academia, industry and government organizations.

Group leader: Dr Roman Yangarber.
Please visit the Project Pages, below, to learn more about the work and the people involved in it.

Research Focus: The PULS Project builds tools for semantic analysis of plain text—specifically for surveillance of on-line news media. The Group conducts research in Information Extraction (IE), which is a type of language-understanding technology. In IE, the task is to find certain types of facts, or events, in text, collect the facts into a knowledge base, and perform reasoning and inference over the collected knowledge. We focus at present on three subject domains:
  • epidemiological surveillance,
  • business intelligence,
  • cross-border security and criminal activity.
For further information and publications, please see the PULS Project home page.

A central research theme in PULS is automatic acquisition of domain-specific linguistic knowledge from plain text. We develop machine learning techniques, including weakly-supervised learning for rapid bootstrapping for new domains.

We work in collaboration with international organizations, who are research partners and end-users.

A collaborative project between PULS and the TermFactory project at the Department of Modern Languages (formerly, the Department of General Linguistics), focusing on large-scale ontology creation and maintenance, for text-analysis tasks.
  • Research Focus: The Etymon Project is an extension of UraLink, a project jointly funded by the Academy of Finland and the Russian Fund for the Humanities. Etymon develops computational models for studying the relationship of the Finnish language to languages that are genetically related to it—viz., the Uralic language family, based on lexical data in etymological databases. The methods are based on information-theoretic principles, including MDL (Minimum Description Length). The methods are applicable more generally, beyond the Uralic family—we explore their applications to other families.
  • Collaboration:
    • The Russian Academy of Sciences (RAS), Institute of Linguistics. The StarLing Project, a collection of large etymological databases for many language families of the world.
    • KOTUS: analysis and enhancement of data in the Finnish etymological dictionary "Suomen Sanojen Alkuperä". (This database is proprietary, and will be released for public access soon. Please contact us or KOTUS to request permission to access.)

Etymological BANANAS:
  • A HIIT Pump-Priming Project
  • Research Focus: Modeling genetic relationships among members of a language family, using methods from population genetics. Application to etymological data from different language families, starting with Uralic and Turkic.
  • Collaboration:

Internal space for Etymon collaborators

The Computational Linguistics and Language Technology Group has previously collaborated with the EU Clarin project, in building infrastructures for linguistic resources.