Information Retrieval Methods, Exercise 3, 5 Feb 2007



  1. Assume that we have a document collection described by the document-term matrix below (d1-d10 are documents, and the terms are frog, snake, computer, user, want and try). The elements denote if a term occurs in a document (1 = occurs, " " = does not occur).

      frog snake computer user want try
    d1 1 1   1 1 1
    d2     1 1 1 1
    d3 1 1       1
    d4       1 1  
    d5   1     1 1
    d6       1 1 1
    d7 1   1 1   1
    d8         1  
    d9         1 1
    d10     1 1 1  

    a) What Boolean query would you constuct for the following retrieval task?

    1. "I am interested in frogs and snakes, but not in users."

    2. "I would like to see documents which tell about some computer user trying to draw animals, like frogs and snakes."

    What documents would be returned as results of these queries?

    b) Explain the execution of the following query when an inverted file for the collection is available.

    ((user and want) or try) and not computer

  2. Assume that we use the same document collection as in task 2.3:

      frog snake computer user want try
    d1 1 3   1 4 1
    d2     4 1 5 1
    d3 2 1       4
    d4       1 7  
    d5   1     1 1
    d6       1 2 3
    d7 1     1   2
    d8         4  
    d9         3 1
    d10     1 1 1  

    a) Compute the inner product and the cosine similarity value for the following document pairs:

    1. d2 and d6

    2. d2 and d8

    b) Assume that the user gives the query terms computer, want and try. Give cosine similarity value for the five documents that best match the query.

  3. Compute similarity values for the document pairs (d2, d6) and (d2, d8) using the following similarity measures: Overlap, Dice and Jaccard. You find the formulas in a separate file (pdf). 

    Do you notice any differences compared with the inner product and the cosine measure?

  4. Assume that the pairwise similarity values for documents A-G are:

          A     B    C    D    E    F    G

    A 1 0.6 0.9 0.6 0.8 0.8 0.7

    B 1 0.3 0.6 0.5 0.8 0.7

    C 1 0 0.7 0.3 0.5

    D 1 0.4 0.4 0.4

    E 1 0.7 0.3

    F 1 0.5

    G 1


    Explain the hierarchical agglomerative clustering method presented in the lecture when single link is used for comparing similarities between the clusters.

    If you did not attend the lecture, see Salton's book Automatic Text Processing and the example on pages 329-333.



Helena Ahonen-Myka
Greger Lindén