Main Page

Data Sets






A challenge funded by the EU Pascal network. Participation is open to all.

The Challenge is over. The data-sets will be maintained and updated in order to create a benchmark for existing and new methods for stemmatology.

Stemmatology (a.k.a. stemmatics) studies relations among different variants of a document that have been gradually built from an original by copying and modifying earlier versions. The aim of such study is to reconstruct the family tree of the variants.

We invite applications of established and, in particular, novel approaches, including but of course not restricted to hierarchical clustering, graphical modeling, link analysis, phylogenetics, string-matching, etc.

The objective of the challenge is to evaluate the performance of various approaches. Several sets of variants for different texts are provided, and the participants should attempt to reconstruct the relationships of the variants in each data-set. This enables the comparison of methods usually applied in unsupervised scenarios.

Notifications (most recent first)

March 14, 2009: A paper on the challenge has appeared in Literary and Linguistic Computing.

November 11, 2008: Correct graph, evaluation script, and numeric version of Heinrichi data available at the Causality workbench.

October 15, 2008: More results added, including SplitsTree4 (see Results).

August 30, 2007: All data-sets provided in Nexus format.

August 13, 2007: Primary data provided as an aligned table. More results for the primary data-set.

June 14, 2007: Some more results included for comparison, including PAUP*.

May 4, 2007: Secondary ranking results announced. Winner (secondary ranking): Rudi Cilibrasi.

May 2, 2007: Results announced. Winner (primary ranking): Team Demokritos.

April 11, 2007: The score of the hierarchical clustering method is corrected (see Example).

March 28, 2007: The submission deadline has been extended from March 30 to April 14.

March 27, 2007: Some preliminary results available.

March 22, 2007: Submission is open.

February 20, 2007: Solution to validation data-set available.

December 1, 2006: Primary data-set available.

October 14, 2006: A discussion group for the Challenge is created.

October 6, 2006: First-phase data available.

Important Dates

First-phase data available October 6, 2006
Primary data-set available December 1, 2006
Solution to validation data-set available February 20, 2007
Submission deadline April 14, 2007
ResultsMay 2, 2007


  • Teemu Roos (teemu.roos at, Helsinki Institute for Information Technology (contact person)
  • Tuomas Heikkilä, Department of History, University of Helsinki
  • Petri Myllymäki, Department of Computer Science, University of Helsinki

Last updated: April 29, 2009 Visitors since October 20, 2006: Visitor Counter by Digits WebCounterTM
Complex Systems Computation Group | Department of Computer Science | Helsinki Institute for Information Technology