Note! Change in the course requirements: the rest of the lectures are not obligatory (but highly recommended... :-) )
The solutions should be ready for inspection by Thursday 11.10.2001 (midnight).
Implement scripts/programs that remove stopwords and punctuation from the sample of Reuter's news documents and produce a list of words. You can use any programming languages.
Perl
Sample scripts: word2int.pl, int2word.pl, epi_int2word.pl
If you copy also the file test.txt, you can test the first two scripts by giving on the command line:
Substitution of a pattern (e.g. a character) with a replacement:
$var =~ s/pattern/replacement/go;For instance, $line =~ s/\?/ /go; substitutes all the occurrences of the question mark in the string $line with a space.
XSLT:
source setup_xerces (sets the CLASSPATH)java org.apache.xalan.xslt.Process -IN foo.xml -XSL foo.xsl
-OUT foo.out Reading: Sebastiani's article
Study the subsections 6.2 (Probabilistic classifiers) and 6.9 (Example-based classifiers) and explain informally and briefly the basic idea of the learning methods described.
You can use the same data set of 10 documents as before to illustrate the ideas. The documents are divided to 3 classes (medicine, energy, and environment). Each document is represented by a wordlist, where each word is followed by a pair (frequency in the document, frequency in the collection).