Information Retrieval Methods, Exercise 2, 29 Jan 2007



  1. Look at the text "Nike...".

    Give examples of good and bad terms in your opinion. Are there words in the document that you would leave out completely from the document description? Give examples of words that have a high term frequency and words that have an average or low frequency (within this document / in a document collection). What do you think are the other documents in the collection like? Would the knowledge of the characteristics of the collection influence your decisions on the goodness of terms?

  2. The words in a document description can be modified in many ways. For example, the words can be stemmed: the words "accessed", "accessibility", "accessible" can be stemmed to "access". When English texts are stemmed, we often use the Porter algorithm. Read about the Porter algorithm and explain the main function of the algorithm. How does stemming affect precision and recall? What other modifications could we use? Study the document in the first task.

  3. Assume that we have a document collection described by the document-term matrix below (d1-d10 are documents, and the terms are frog, snake, computer, user, want and try). The elements of the matrix denote the term's term frequency (tf) in a document.

      frog snake computer user want try
    d1 1 3   1 4 1
    d2     4 1 5 1
    d3 2 1       4
    d4       1 7  
    d5   1     1 1
    d6       1 2 3
    d7 1     1   2
    d8         4  
    d9         3 1
    d10     1 1 1  

    Compute the (tf x idf) weight for terms in documents d1-d5, in two ways:

    What kind of differences in weights you can find?



Helena Ahonen-Myka
Last modified: Mon Jan 22 12:55:41 EET 2007