Main Page

Data Sets








The data is available in plain text format. It is under consideration whether some more pre-processed formats will be made available in order to reduce the effort of the participants. In any case, the plain text versions will remain available, and can be used in the challenge — since pre-processing is likely to lose some information, some methods may benefit from using the original data.

There is one primary data-set which is used to determine the primary ranking of the participants. The total number of points earned in all of the data-sets (excluding the primary one) will be used as a secondary ranking scheme. For more information on the ranking schemes, see the Rules.

Disclaimer: All the data-sets are provided only for use related to the challenge. Further use of the data requires an explicit permission by the organizers and the original providers of the data.

About the Nexus format: The conversion from plain (aligned) text to the Nexus format was performed by ordering at each 'locus' (a place in the text) the words that appear in the locus in alphabetical order, and by assigning to each variant a symbol (A,B,C,...). Thus, if three different variants of a given word appear in the texts, then the symbols A,B, and C appear in the Nexus file at the same locus. Missing words are replaced by a question mark (?). The Nexus files are directly compatible with, for instance, the PAUP* software. (See the Results.)

Primary Data-Set: Heinrichi

The primary data-set is a result of an experiment performed by the organizers and 17 volunteers (see Credits). The original text is in old Finnish, and is known as Piispa Henrikin surmavirsi ('The Death-Psalm of Bishop Henry').

Some of the copied variants are 'missing', i.e., not included in the given data-set. The copies were made by hand (pencil and paper). Note that the structure of the possible stemma is not necessarily limited to a bifurcating tree: more than two immediate descendants (children) and more than one immediate predecessor (parents) are possible.

New: The correct stemma and an evaluation script are available at the Causality workbench.

Data-Set #2: Parzival (Validation Data)

This is also an artificial data-set made by copying a text by hand. The correct solution will be made available during the challenge in order to enable self-assessment. This data-set will not affect the final results.

'' If vaccilation dwell with the heart, the soul will see it. 
Shame and honour clash with the courage of a steadfast man is motley like a magpie. 
But such a man may yet make merry, for Heaven & Hell have equal part in him. ... ''

The text is the beginning of the German poem Parzival by Wolfram von Eschenbach, translated to English by A.T. Hatto. The data was kindly provided to us by Matthew Spencer and Heather F. Windram.

Data-Set #3: Notre Besoin

Another artificial data-set. The text is from Stig Dagerman's, Notre besoin de consolation est impossible à rassasier, Paris: Actes Sud, 1952 (translated to French from Swedish by P. Bouquet), kindly provided to us by Caroline Macé.

'' Je suis dépourvu de foi et ne puis donc être heureux, car un homme qui risque
de craindre que sa vie ne soit errance absurde vers une mort certaine ne peut
être heureux. ... ''

Data-Set #4: Legend of St. Henry

This is a real data-set. The text is the Legend of St. Henry of Finland in Latin, written by the end of the 13th century at the latest. The surviving 52 versions are provided, some of which are severely damaged and fragmentary.

For ease of use, the texts are aligned so that each line in the file contains one word, and empty lines are added so that the lines match across different files (e.g. the 50th line in each file contains the word pontificem unless that part of the manuscript is missing or damaged). The text is roughly 900 words long.