582410 Processing of large document collections, Exercise 3

Note! Change in the course requirements: the rest of the lectures are not obligatory (but highly recommended... :-) )

The solutions should be ready for inspection by Thursday 11.10.2001 (midnight).

  1. Implement scripts/programs that remove stopwords and punctuation from the sample of Reuter's news documents and produce a list of words. You can use any programming languages.


  • Reading: Sebastiani's article

    Study the subsections 6.2 (Probabilistic classifiers) and 6.9 (Example-based classifiers) and explain informally and briefly the basic idea of the learning methods described.

    You can use the same data set of 10 documents as before to illustrate the ideas. The documents are divided to 3 classes (medicine, energy, and environment). Each document is represented by a wordlist, where each word is followed by a pair (frequency in the document, frequency in the collection).




  • Helena.Ahonen-Myka, 5.10.2001