582410 Processing of large document collections, Exercise 2

The solutions should be ready for inspection by Thursday 4.10.2001 (midnight).

Reading: Sebastiani's article

Use the data set of 10 documents in the following tasks. The documents are divided to 3 classes (medicine, energy, and environment). Each document is represented by a wordlist, where each word is followed by a pair (frequency in the document, frequency in the collection).

  1. Describe, giving examples, how the document terms are weighted if the TFIDF method is used.

  2. Compare the term (feature) selection methods

    giving examples how these methods rank terms in the documents of the data set. The article Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization (Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 1997, pp 412-420) may be helpful. ( Local copy (PostScript)) ( Local copy (PDF))

  3. Simulate the Rocchio method in text categorization.

  4. Note that you do not have to calculate weights etc. for each term and/or category. Just use enough examples to make your point clear.




    Helena.Ahonen-Myka