Laajojen dokumenttiaineistojen käsittely, sl 2001

582410 Processing of large document collections, Exercise 3

Note! Change in the course requirements: the rest of the lectures are not obligatory (but highly recommended... :-) )

The solutions should be ready for inspection by Thursday 11.10.2001 (midnight).

Implement scripts/programs that remove stopwords and punctuation from the sample of Reuter's news documents and produce a list of words. You can use any programming languages.
- Perl
  - Sample scripts: word2int.pl, int2word.pl, epi_int2word.pl
  - If you copy also the file test.txt, you can test the first two scripts by giving on the command line:
    1. word2int.pl test.txt
    2. int2word.pl test.int
  - Substitution of a pattern (e.g. a character) with a replacement:
    $var =~ s/pattern/replacement/go;
    For instance, $line =~ s/\?/ /go; substitutes all the occurrences of the question mark in the string $line with a space.
  - Perl quick syntax reference
- XSLT:
  - Use XSLT if you want to pick up some elements from an XML document:
    1. copy the file setup_xerces to your directory.
    2. run the command: source setup_xerces (sets the CLASSPATH)
    3. use Xalan XSLT processor from the command line: java org.apache.xalan.xslt.Process -IN foo.xml -XSL foo.xsl -OUT foo.out
  - Samples: card.xml, card.xsl
  - Xalan Command-Line Utility

Reading: Sebastiani's article

Study the subsections 6.2 (Probabilistic classifiers) and 6.9 (Example-based classifiers) and explain informally and briefly the basic idea of the learning methods described.

You can use the same data set of 10 documents as before to illustrate the ideas. The documents are divided to 3 classes (medicine, energy, and environment). Each document is represented by a wordlist, where each word is followed by a pair (frequency in the document, frequency in the collection).

Helena.Ahonen-Myka, 5.10.2001