582410 Processing of large document collections, Exercise 2

Compare the following classification problems.
- Spam detection
- Language identification (is this document written in Finnish?)
- Author detection (is this sonet written by Shakespeare?)
Consider, for instance, the following aspects:
- Problem setting: What are the documents, what are the categories? Multi-label, single-label, binary?
- Where/how would you find training documents? What kind of training documents are good?
- What kind of terms would probably be good for each problem? Do you think that the terms found in the content are enough? Which (automatic/manual) methods could be used for term selection?
- Evaluation: are precision and recall equally important?
- Can you identify any (other) specific problems/issues?

Assume that the effectiveness of a classifier is evaluated using a test set of 10 documents. In the following table, you can see the judgment as recorded in the test set (category/testset) and the decision of the classifier (category/classifier) for each document. What is the recall and precision of the classifier with respect to each of the categories (energy, environment, medicine) separately? Calculate the global recall and precision using both microaveraging and macroaveraging. Give also the combined measure F₁.

Doc	category/testset	category/classifier
1	energy	energy
2	environment	medicine
3	energy	energy
4	medicine	environment
5	medicine	medicine
6	medicine	medicine
7	energy	energy
8	energy	energy
9	environment	environment
10	environment	energy

Calculate information gain for the terms 'current', 'treatment', and 'network' in the collection of 10 documents. In the file, each document is marked-up with the category (medicine, energy, or environment). After each term, the first number means the number of the occurrences of this term in this document. The second number means the number of occurrences of this term in the entire collection.

Use base 2 for the logarithms. Some tips:

log₂ x = log₁₀ x / log₁₀ 2

log (x/y) = log x - log y

log₂ 1 = 0, log₂ 2 = 1

1. The goal of this exercise is to prepare for our next application area, namely text summarization. Choose one of the texts: The 21st Century Belongs to... or L.A. Times Reorganizes...
  
  Give a summary of 5-10 sentences of the text.
2. Compare your summary to the one generated by the FociSum summarization system http://www.cs.columbia.edu/~hjing/sumDemo/ (follow the link FociSum, then Examples).
  
  You can also try the same text with the MEAD summarizer.

Helena.Ahonen-Myka