|
|
Computational Linguistics and Language Technology
Research Group
The scope of the Research Group covers all problems of analyzing linguistic data. We investigate
-
how language conveys information,
how information can be extracted from linguistic data,
how hidden, underlying structure can be learned from observed linguistic data.
(Previously the Centre of Excellence FDK: From Data to Knowledge of the Academy of Finland.) Our work combines empirical and theoretical approaches to these problems, through research projects with international and domestic collaborators in academia, industry and government organizations. Group leader: Dr Roman Yangarber.
Please visit the Project Pages, below, to learn more about the work and the people involved in it.
- PULS:
- Research Focus: The PULS Project builds tools for
semantic analysis of plain text—specifically for
surveillance of on-line news
media. The Group conducts research in Information
Extraction (IE), which is a type of language-understanding
technology.
In IE, the task is to find certain types of facts, or events, in
text,
collect the facts into a knowledge base, and perform reasoning and
inference over the collected knowledge. We focus at present
on three subject domains:
- epidemiological surveillance,
- business intelligence,
- cross-border security and criminal activity.
- ContentFactory:
- A collaborative project between PULS and the TermFactory project at the Department of Modern Languages (formerly, the Department of General Linguistics), focusing on large-scale ontology creation and maintenance, for text-analysis tasks.
- Etymon:
-
-
Research Focus: The Etymon Project is an extension of UraLink, a project
jointly funded by the Academy of Finland and the Russian Fund for
the Humanities. Etymon develops computational models for studying
the relationship of the Finnish language to languages that are
genetically related to it—viz., the Uralic language family,
based on lexical data in etymological databases. The methods are based on
information-theoretic principles, including MDL (Minimum Description
Length). The methods are applicable more generally, beyond the
Uralic family—we explore their applications to other families.
Collaboration:
-
The Russian Academy of Sciences (RAS),
Institute of Linguistics. The
StarLing Project, a
collection of large etymological databases for many language
families of the world.
KOTUS: analysis and enhancement of data in the Finnish etymological dictionary
"Suomen Sanojen Alkuperä".
(This database is proprietary, and will be released for public
access soon.
Please contact us or KOTUS to request permission to access.)
- Etymological BANANAS:
-
-
A HIIT Pump-Priming Project
Research Focus: Modeling genetic relationships among
members of a language family,
using methods from population genetics. Application to
etymological data from different language families, starting with
Uralic and Turkic.
Collaboration:
-
J Corander's group, at the Department of Mathematics and
Statistics: population-genetics models.
Part of the COIN Center of Excellence of the Academy of Finland
Russian Academy of Sciences: analysis of etymological databases, the StarLing Project
KOTUS: etymological databases of the Uralic language family.
- Clarin:
- The Computational Linguistics and Language Technology Group has previously collaborated with the EU Clarin project, in building infrastructures for linguistic resources.