Computational Linguistics
Research Group
applications in digital humanities and social sciences
The Group conducts research on core problems in NLP
-
how language conveys information,
how information can be extracted from text,
how latent structure can be learned from observed data.
People:
|
Recent alumni:
|
Projects:
- PULS:
-
Analysis of big-data streams of news media.
Information Extraction (IE): find facts and
events in text, collect them into a data base, and perform
reasoning over collected data.
Methods: neural networks, unsupervised and weakly-supervised learning.
Domains: general news, business intelligence, epidemiological surveillance, cross-border security and criminal activity. Collaboration: Please see project page. Funding: Tekes/BusinessFinland, European Commission
Revita:
-
Computational modeling for supporting language learning. Revitalization of endangered
languages: from Finno-Ugric, Turkic, and other language families.
Collaboration: Yle, Opetushallitus.
Funding: Academy of Finland,
Project FinUgRevita.
Etymon:
-
Computational models of language evolution.
Modeling how the Finnish language genetically relates to the Uralic
language family, based on data in etymological databases.
Methods: information theory, the Minimum Description Length principle (MDL).
Applying the methods beyond the Uralic family—Turkic, Indo-European and Khoisan. Collaboration:
-
Russian Academy of Sciences (RAS), Institute of Linguistics. The
StarLing
Project, a collection of etymological databases for many
language families of the world.
KOTUS: enhancement of the Finnish etymological dictionary
"Suomen Sanojen
Alkuperä". (The database is proprietary, and will be
released for public access soon.)
SIGSLAV: Special Interest Group on Slavic Natural Language Processing of the Association for Computational Linguistics
Prior projects
-
Etymological BANANAS:
- Clarin:
-
the EU Clarin project for building infrastructures for
linguistic resources.
ContentFactory:
-
Research Focus: Modeling genetic relationships among
members of a language family, using methods from population
genetics. Application to etymological data from different language
families, starting with Uralic and Turkic.
Collaboration:
-
J Corander's
group, at the Department of Mathematics and Statistics:
population-genetics models. Part of
the COIN Center of
Excellence of the Academy of Finland
Russian Academy of Sciences: analysis of etymological
databases, the
StarLing Project
KOTUS: etymological databases of the Uralic language family.
-
Collaboration between PULS and
the TermFactory
project
at the Department of
Modern Languages, focusing on large-scale ontology creation and
maintenance, for text-analysis tasks.