582410 Processing of large document collections, Exercise 2


  1. Compare the following classification problems.

    Consider, for instance, the following aspects:



  2. Assume that the effectiveness of a classifier is evaluated using a test set of 10 documents. In the following table, you can see the judgment as recorded in the test set (category/testset) and the decision of the classifier (category/classifier) for each document. What is the recall and precision of the classifier with respect to each of the categories (energy, environment, medicine) separately? Calculate the global recall and precision using both microaveraging and macroaveraging. Give also the combined measure F1.

    Doccategory/testsetcategory/classifier
    1 energy energy
    2 environment medicine
    3 energy energy
    4 medicine environment
    5 medicine medicine
    6 medicine medicine
    7 energy energy
    8 energy energy
    9 environment environment
    10 environment energy


  3. Calculate information gain for the terms 'current', 'treatment', and 'network' in the collection of 10 documents. In the file, each document is marked-up with the category (medicine, energy, or environment). After each term, the first number means the number of the occurrences of this term in this document. The second number means the number of occurrences of this term in the entire collection.

    Use base 2 for the logarithms. Some tips:

    log2 x = log10 x / log10 2

    log (x/y) = log x - log y

    log2 1 = 0, log2 2 = 1


    1. The goal of this exercise is to prepare for our next application area, namely text summarization. Choose one of the texts: The 21st Century Belongs to... or L.A. Times Reorganizes...

      Give a summary of 5-10 sentences of the text.

    2. Compare your summary to the one generated by the FociSum summarization system http://www.cs.columbia.edu/~hjing/sumDemo/ (follow the link FociSum, then Examples).

      You can also try the same text with the MEAD summarizer.



Helena.Ahonen-Myka