582410 Processing of large document collections, Exercise 3


  1. In this exercise we study the boosting algorithm AdaBoost. On slide 19 (on the weak learner) it is mentioned that "a term is chosen that minimizes epsilon(t) or 1 - epsilon(t)", where epsilon(t) is calculated like on slide 18. Thus, a term with an error 0.8, for instance, can be chosen, if there is no term with error smaller than 0.2 in the terms. What happens later in the process (updating of weights, in the final classifier) if a chosen term has a large error? Why is it useful to choose this kind of term?

    Schapire, Singer and Singhal, Boosting and Rocchio Applied to Text Filtering Proceedings of SIGIR-98, the 21st ACM International Conference on Research and Development in Information Retrieval.

  2. Exercises 2 and 3 are based on the article Kupiec, Pedersen, Chen: A trainable document summarizer. Proceedings of the 18th ACM-SIGIR Conference, p. 68-73, 1995. Also Chapter 5 (p.55-60) in Advances in automatic text categorization, eds. Mani, Maybury. The MIT Press, 1999. [Local copy (PDF)]

    This paper was also discussed in the lectures.

    Data for the following exercises


    In this exercise we study the method described in the paper more detailly. To simplify the approach, only three of the features are selected. They are described (and interpreted) below:

    
    F1: Sentence Length Cut-off Feature
    
    Given a threshold 7, the feature is true for all sentences
    longer than the threshold, and false otherwise.
    
    That is:  
    F1(s) = 0 if sentence s has 7 or less words, and
    F1(s) = 1 if sentence s has more than 7 words
    
    F2: Paragraph Feature
    
    Sentences in a paragraph are distinguished according to whether
    they are paragraph-initial (i), paragraph-final (f), or
    paragraph-medial (m). 
    
    F2(s) = i if sentence s is the first sentence in a paragraph
    F2(s) = f if there are at least two sentences in a paragraph, and
                 s is the last one
    F2(s) = m if there are at least three sentences in a paragraph, and
    	     s is neither the first nor the last sentence
    
    F3: Thematic Word Feature
    
    The most frequent content words are defined as thematic words.
    A set of thematic words have been selected for both the training set
    (13 words) and the test set (9 words). Each sentence is scored
    as a function of frequency (= sum of the frequencies of the thematic
    words contained in the sentence). A set of highest scoring
    sentences are selected to a set HighScores.
    
    F3(s) = 0 if sentence s belongs to HighScores
    F3(s) = 1 if sentence s does not belong to HighScores
    
    
    There are 66 sentences in the training set and 37 sentences in the test set. Coding of the information: su the sentence belongs to the set S n the sentence does not belong to S H the sentence belongs to HighScores L the sentence does not belong to HighScores * separates paragraphs
    The goal is now to be able to estimate probability of each test set sentence to be a (good) summary sentence. For each test set sentence, we have to find values for all the features F1, F2, and F3. That is, check from the data 1) if the sentence is short, 2) if it is paragraph-initial, paragraph-final, or paragraph-media, and 3) if it belongs to HighScores (of the test set). If now, e.g. the new test set sentence s' has the values (0,i,0), we can calculate the probability that s belongs to the set of summary sentences S: P(s' belongs to S | 0,i,0) using the Bayes' rule: P(s belongs to S | F1,F2,F3) = P(F1 | s belongs to S) * P(F2 | s belongs to S) * P(F3 | s belongs to S) * P(s belongs to S) / P(F1)P(F2)P(F3) The probabilities that are needed on the right-hand side are calculated from the training set. Example: F1: you have to calculate the probabilities of the possible values (i.e. 0 and 1) of the feature. Let's call these probabilities P1. P1(0) = 8/66 = 0.12 (8 sentences of 66 sentences have <= 7 words) P1(1) = 58/66 = 0.88 (58 sentences of 66 sentences have > 7 words) P1(0 | s belongs to S) = 2/16 = 0.12 (2 summary sentences are short) P1(1 | s belongs to S) = 14/16 = 0.88 (14 summary sentences are long) Given that the new sentence has the feature values (0,i,0), we can replace in the rule P(F1 | s belongs to S) with the value of P1(0 | s belongs to S) and P(F1) with the value of P1(0).

    Your task is now:



  3. Evaluate the results you got in the exercise 1.




Helena.Ahonen-Myka