readme

Table of Contents

1 Training Data for Automatic Transliteration

These are 13 parallel corpora extracted from Wikipedia page titles, using Wikipedia Categories and Language Links.

We used the Russian Wikipedia dump dated to 12 December 2012, and the English and Farsi Wikipedia dumps, dated approximately the same time.

Each dataset was semi-automatically cleaned. Some amount of noise is left in the data; for example, the English name corresponding to the Russian city "Санкт-Петербург" is "Saint-Petersburg", because this is the traditional way of rendering the name in English, although [Sankt-Peterburg] would be more accurate from the phonetic point of view. We tried to remove most cases where titles of corresponding articles were translations rather than transliterations. We also deleted all patronymic names from the Russian data, since in most cases they are omitted in other languages.

Each dataset is presented in two forms: raw and preprocessed.

All results reported in our papers have been achieved using the preprocessed data. Preprocessing consists of the following:

We also removed some inaccuracies discovered in the data at this stage. Thus, preprocessed data are more clean than the raw data. The preprocessed data contains one single word per entry, if names consisted of multiple tokens, they were split into separate entries. The "positional" properties of words, such as distinction between given names and family names, are not preserved in the preprocessed data.

The complete list of the data is shown in the table below. The dataset names correspond to the respective Wikipedia Category.

DatasetLanguages# PairsRaw File NamePreprocessed File Name
American ActorsEN-FA841AmericanActorsFAamerican-actors-en-fa.utf8
EN-GK410AmericanActorsGRamerican-actors-en-gr.utf8
EN-HE1267AmericanActorsHEamerican-actors-en-he.utf8
EN-RU1474AmericanActorsRUamerican-actors-en-Ru.utf8
French CitiesFR-RU688FrenchCitiesRUfrench-cities-ru-fr.utf8
Iranian CitiesEN-FA439IranianCitiesENiranian-cities-en-fa.utf8
FA-RU470IranianCitiesRUiranian-cities-ru-fa.utf8
Iranian LocationsFA-RU1896IranianLocationsRUiranian-locations-ru-fa.utf8
Russian CitiesEN-RU1136RussianCitiesENrussian-cities-ru-en.utf8
FA-RU870RussianCitiesFArussian-cities-ru-fa.utf8
FR-RU828RussianCitiesFRrussian-cities-ru-fr.utf8
JP-RU317RussianCitiesJArussian-cities-ru-jp.utf8
Russian WritersEN-RU1473RussianWritersENrussian-writers-en-ru.utf8

Notes:

______________________________________________________________________________

Date: Tue Apr 14 12:44:12 EEST 2013

Author: Roman Yangarber <Roman.Yangarber@cs.helsinki.fi>

Date: 2013/07/30 00:27:39