Laajojen dokumenttiaineistojen käsittely, sl 2001

582410 Processing of large document collections, Exercise 1

The solutions should be ready for inspection by Thursday 27.9.2001 (midnight).

Sketch for the sample of Reuter's news documents the processing steps that are needed for producing a list of "important" words from the text of these documents. This wordlist could then be used to represent the documents as a vector. The output can be, e.g., one word per line. You don't have to implement the steps (yet...), but try to be as detailed as is possible for you, given your former background knowledge.
Explain why the binary case in text categorization is more general than the multi-label case. (See: Sebastiani's article)