To produce a 400-megabyte prefix of the fiwiki archive, with each revision of a page as a separate document, use the following command. bunzip2 -c fiwiki.bz2 | wikipedia_extract.py - 400 revision Replace 'revision' with 'page' to have all revisions of a page as one document. The following extracts the first 10000 documents instead and stores them in a single file. bunzip2 -c fiwiki.bz2 | extract_sequences.py - 10000 revision The Python script can be found in the RLCSA package.