Computational Linguistics Research Group
applications in digital humanities and social sciences

The Group conducts research on core problems in NLP

how language conveys information,
how information can be extracted from text,
how latent structure can be learned from observed data.

Our work combines empirical and theoretical approaches. We work on projects in various application domains with partners in the academia, industry and government organizations.

People:

Jue Hou
Jose María Hoya Quecedo
Sardana Ivanova
Anisia Katinskaia
Max Koppatz
Tomi Rikander
Group leader: Roman Yangarber
Visitors:
- Jukka Perkiö
- Michal Trzos

Recent alumni:
- Lidia Pivovarova, PhD
- Llorenç Escoter, MS
- Javad Nouri, MS
- Mian Du, PhD
- Matthew Pierce, MS
- Guowei Lv, MS
- Hannes Wettig, PhD
- Suvi Hiltunen, MS
- Peter von Etter, MS
- Silja Huttunen, MS
- Heikki Manninen, MS
- ...

Projects:

PULS:

Analysis of big-data streams of news media. Information Extraction: finding facts and events in text, and reasoning over extracted data.
Methods: neural networks, supervised and weakly-supervised machine learning.
Domains: general news, business intelligence, epidemiological surveillance, cross-border security and crime.
Collaboration: Please see project page for partners.
Funding: Tekes/BusinessFinland, European Commission

Revita:
- Computational modeling to support language learning.
- Revitalization of endangered languages from the Finno-Ugric, Turkic, and other language families.
- Collaboration: University of Helsinki Department of Modern Languages, Department of Finnish, Finno-Ugrian and Scandinavian Studies
  YLE, Opetushallitus, University of Jyväskylä, Università degli Studi di Milano
- Funding: Academy of Finland, Project FinUgRevita.
Etymon:
- Computational models of language evolution. Modeling how Finnish is genetically related to the Uralic language family, based on data in etymological databases.
  Methods: information theory, Minimum Description Length principle (MDL).
  Applying the methods beyond the Uralic family—Turkic, Indo-European, Khoisan.
- Collaboration:
  - Russian Academy of Sciences (RAS), Institute of Linguistics.
  - The StarLing Project, a collection of etymological databases for many language families of the world.
  - KOTUS: enhancement of the Finnish etymological dictionary "Suomen Sanojen Alkuperä".
    (The database is proprietary, and will be released for public access soon.)
- Funding: jointly funded by the Academy of Finland and the Russian Fund for the Humanities, project UraLink.
SIGSLAV: Special Interest Group of the Association for Computational Linguistics on NLP for Slavic languages

Previous projects

Etymological BANANAS:
- Research: Modeling genetic relationships among members of a language family, using methods from population genetics.
  Application to etymological data from different language families, starting with Uralic and Turkic.
- Collaboration:
  - J Corander's group, at the Department of Mathematics and Statistics: population-genetics models.
    Part of the COIN Center of Excellence of the Academy of Finland
  - Russian Academy of Sciences: analysis of etymological databases, the StarLing Project
  - KOTUS: etymological databases of the Uralic language family.
Clarin:
- the EU Clarin project for building infrastructures for linguistic resources.
ContentFactory:
- Collaboration between PULS and the TermFactory project at the Department of Modern Languages: large-scale ontologies for text-analysis tasks.

Computational Linguistics Research Group applications in digital humanities and social sciences

People:

Projects:

Previous projects

Computational Linguistics Research Group
applications in digital humanities and social sciences