Web pages --------- Data file: pages Pattern file: MSN The weight file for MSN (MSN.found) can be computed by using an RLCSA of the data file. After building the index, the command is rlcsa_test -i8 -l -w pages MSN Terms is the set of search terms with frequency >= 100 that were used for constructing the data file (March 2011). original/MSN is the original query log. Script split can be used to generate the pattern file from it. Stop words must be removed separately. DBLP ---- Data file: dblp Pattern file: authors_terms The pattern file was created by scripts/parse_dblp.py. Initial patterns were all author names, as well all terms occurring inside that start and end with an alphanumeric character. Short patterns and stop words (stopwords.txt) were then removed from the file. Final pattern file contains the above terms, sorted by the number of occurrences in descending order. The frequencies are the same as for web pages (combine.py). As there were more terms than for web pages, the remaining ones were removed. dblp: version 2011-03-29