Rule Discovery and Probabilistic Modeling for Onomastic Data

Antti Leino, Heikki Mannila, Ritva Liisa Pitkänen

Paper presented at the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases

Abstract

The naming of natural features, such as hills, lakes, springs, meadows etc., provides a wealth of linguistic information; the study of the names and naming systems is called onomastics. We consider a data set containing all names and locations of about 58,000 lakes in Finland. Using computational techniques, we address two major onomastic themes. First, we address the existence of local dependencies or repulsion between occurrences of names. For this, we derive a simple form of spatial association rules. The results partially validate and partially contradict results obtained by traditional onomastic techniques. Second, we consider the existence of relatively homogeneous spatial regions with respect to the distributions of place names. Using mixture modeling, we conduct a global analysis of the data set. The clusterings of regions are spatially connected, and correspond quite well with the results obtained by other techniques; there are, however, interesting differences with previous hypotheses.


Antti Leino
Last modified: Wed Dec 3 15:07:28 EET 2003