Processing of large document collections, Spring 2006

582410 Processing of large document collections, Exercise 3

In this exercise we study the boosting algorithm AdaBoost. On slide 19 (on the weak learner) it is mentioned that "a term is chosen that minimizes epsilon(t) or 1 - epsilon(t)", where epsilon(t) is calculated like on slide 18. Thus, a term with an error 0.8, for instance, can be chosen, if there is no term with error smaller than 0.2 in the terms. What happens later in the process (updating of weights, in the final classifier) if a chosen term has a large error? Why is it useful to choose this kind of term?

Schapire, Singer and Singhal, Boosting and Rocchio Applied to Text Filtering Proceedings of SIGIR-98, the 21st ACM International Conference on Research and Development in Information Retrieval.

Exercises 2 and 3 are based on the article Kupiec, Pedersen, Chen: A trainable document summarizer. Proceedings of the 18th ACM-SIGIR Conference, p. 68-73, 1995. Also Chapter 5 (p.55-60) in Advances in automatic text categorization, eds. Mani, Maybury. The MIT Press, 1999. [Local copy (PDF)]

This paper was also discussed in the lectures.

Data for the following exercises

A set of sentences: in the same file both a training set and a test set.
The thematic words for the training set. The thematic words in the test set. (You don't really need these sets, but you might want to check them to see which words are considered important in the scoring.)

In this exercise we study the method described in the paper more detailly. To simplify the approach, only three of the features are selected. They are described (and interpreted) below:


F1: Sentence Length Cut-off Feature

Given a threshold 7, the feature is true for all sentences
longer than the threshold, and false otherwise.

That is:  
F1(s) = 0 if sentence s has 7 or less words, and
F1(s) = 1 if sentence s has more than 7 words

F2: Paragraph Feature

Sentences in a paragraph are distinguished according to whether
they are paragraph-initial (i), paragraph-final (f), or
paragraph-medial (m). 

F2(s) = i if sentence s is the first sentence in a paragraph
F2(s) = f if there are at least two sentences in a paragraph, and
             s is the last one
F2(s) = m if there are at least three sentences in a paragraph, and
	     s is neither the first nor the last sentence

F3: Thematic Word Feature

The most frequent content words are defined as thematic words.
A set of thematic words have been selected for both the training set
(13 words) and the test set (9 words). Each sentence is scored
as a function of frequency (= sum of the frequencies of the thematic
words contained in the sentence). A set of highest scoring
sentences are selected to a set HighScores.

F3(s) = 0 if sentence s belongs to HighScores
F3(s) = 1 if sentence s does not belong to HighScores



There are 66 sentences in the training set and 37 sentences 
in the test set. 

Coding of the information:

su	the sentence belongs to the set S
n	the sentence does not belong to S

H	the sentence belongs to HighScores
L       the sentence does not belong to HighScores

*	separates paragraphs




The goal is now to be able to estimate probability of each
test set sentence to be a (good) summary sentence.
For each test set sentence, we have to find values
for all the features F1, F2, and F3. That is, check from
the data 1) if the sentence is short, 2) if it is paragraph-initial,
paragraph-final, or paragraph-media, and 3) if it belongs
to HighScores (of the test set). 

If now, e.g. the new test set sentence s' has the values
(0,i,0), we can calculate the probability that s belongs to the
set of summary sentences S:

P(s' belongs to S | 0,i,0) using the Bayes' rule:

P(s belongs to S | F1,F2,F3) = P(F1 | s belongs to S) *
			       P(F2 | s belongs to S) *
			       P(F3 | s belongs to S) *
			       P(s belongs to S) / P(F1)P(F2)P(F3)

The probabilities that are needed on the right-hand side
are calculated from the training set.

Example:

F1: you have to calculate the probabilities of the possible values
(i.e. 0 and 1) of the feature. Let's call these probabilities P1.

P1(0) = 8/66 = 0.12    (8 sentences of 66 sentences have <= 7 words) 
P1(1) = 58/66 = 0.88   (58 sentences of 66 sentences have > 7 words)  

P1(0 | s belongs to S) = 2/16 = 0.12 (2 summary sentences are short)
P1(1 | s belongs to S) = 14/16 = 0.88 (14 summary sentences are long)

Given that the new sentence has the feature values (0,i,0), we can
replace in the rule P(F1 | s belongs to S) with the value of 
P1(0 | s belongs to S) and P(F1) with the value of P1(0).

Your task is now:

Calculate the probabilities for the features F2 and F3.
Calculate the probability of belonging to the summary for each sentence in the test set. (Note: there are not that many different cases to calculate... :-) )
Choose 6 best scoring sentences to form a summary.

Evaluate the results you got in the exercise 1.
- Compare your set of summary sentences to the set of sentences marked-up with 's' in the test set. Calculate precision and recall.
- Select from your result set 7 (or 8) best scoring sentences (instead of 6). Do the precision and recall change?
- What is your own subjective evaluation of the quality of the summary?

Helena.Ahonen-Myka