The data is available in plain text format. It is under consideration whether
some more pre-processed formats will be made available in order to reduce
the effort of the participants. In any case, the plain text versions will
remain available, and can be used in the challenge — since
pre-processing is likely to lose some information, some methods may
benefit from using the original data.
There is one primary data-set which is used to determine the
primary ranking of the participants. The total number of points earned
in all of the data-sets (excluding the primary one) will be used as a
secondary ranking scheme. For more information on the ranking schemes,
see the Rules.
Disclaimer: All the data-sets are provided only for use related
to the challenge. Further use of the data requires an explicit
permission by the organizers and the original providers of the data.
About the Nexus format: The conversion from plain (aligned)
text to the Nexus format was performed by ordering at each 'locus' (a
place in the text) the words that appear in the locus in alphabetical
order, and by assigning to each variant a symbol (A,B,C,...). Thus,
if three different variants of a given word appear in the texts,
then the symbols A,B, and C appear in the Nexus file at the same
locus. Missing words are replaced by a question mark (?).
The Nexus files are directly compatible with, for instance, the
(See the Results.)
Primary Data-Set: Heinrichi
The primary data-set is a result of an experiment performed by the
organizers and 17 volunteers (see Credits).
The original text is in
old Finnish, and is known as Piispa Henrikin surmavirsi ('The
Death-Psalm of Bishop Henry').
Some of the copied variants are 'missing', i.e., not included in the
given data-set. The copies were made by hand (pencil and paper). Note
that the structure of the possible stemma is not necessarily limited
to a bifurcating tree: more than two immediate descendants (children)
and more than one immediate predecessor (parents) are possible.
New: The correct stemma and an evaluation
script are available at
Data-Set #2: Parzival
This is also an artificial data-set made by copying a text by hand.
The correct solution will be made available during the challenge in
order to enable self-assessment. This data-set will not affect the
'' If vaccilation dwell with the heart, the soul will see it.
Shame and honour clash with the courage of a steadfast man is motley like a magpie.
But such a man may yet make merry, for Heaven & Hell have equal part in him. ... ''
The text is the beginning of the German poem Parzival by
Wolfram von Eschenbach, translated to English by A.T. Hatto. The data
was kindly provided to us by Matthew Spencer and Heather F. Windram.
Another artificial data-set.
The text is from Stig Dagerman's, Notre besoin de consolation
est impossible à rassasier, Paris: Actes Sud, 1952 (translated to
French from Swedish by P. Bouquet), kindly provided to us by Caroline
'' Je suis dépourvu de foi et ne puis donc être heureux, car un homme qui risque
de craindre que sa vie ne soit errance absurde vers une mort certaine ne peut
être heureux. ... ''
Legend of St. Henry
This is a real data-set. The text is the Legend of St. Henry of Finland
in Latin, written by the end of the 13th century at the latest. The
surviving 52 versions are provided, some of which are severely damaged
For ease of use, the texts are aligned so that each
line in the file contains one word, and empty lines are added so that
the lines match across different files (e.g. the 50th line in each file
contains the word pontificem unless that part of the manuscript
is missing or damaged). The text is roughly 900 words long.