Web pages --------- Data file: pages Pattern file: MSN The weight file for MSN (MSN.found) can be computed by using an RLCSA of the data file. After building the index, the command is rlcsa_test -i8 -l -w pages MSN Terms is the set of search terms with frequency >= 100 that were used for constructing the data file (March 2011). original/MSN is the original query log. Script split can be used to generate the pattern file from it. Stop words must be removed separately. DBLP ---- Data file: dblp Pattern file: authors_terms The pattern file was created by scripts/parse_dblp.py. Initial patterns were all author names, as well all terms occurring inside