Assume that we have a document collection described by the document-term matrix below (d1-d10 are documents, and the terms are frog, snake, computer, user, want and try). The elements denote if a term occurs in a document (1 = occurs, " " = does not occur).
frog | snake | computer | user | want | try | |
---|---|---|---|---|---|---|
d1 | 1 | 1 | 1 | 1 | 1 | |
d2 | 1 | 1 | 1 | 1 | ||
d3 | 1 | 1 | 1 | |||
d4 | 1 | 1 | ||||
d5 | 1 | 1 | 1 | |||
d6 | 1 | 1 | 1 | |||
d7 | 1 | 1 | 1 | 1 | ||
d8 | 1 | |||||
d9 | 1 | 1 | ||||
d10 | 1 | 1 | 1 |
a) What Boolean query would you constuct for the following retrieval task?
"I am interested in frogs and snakes, but not in users."
"I would like to see documents which tell about some computer user trying to draw animals, like frogs and snakes."
What documents would be returned as results of these queries?
b) Explain the execution of the following query when an inverted file for the collection is available.
((user and want) or try) and not computer
Assume that we use the same document collection as in task 2.3:
frog | snake | computer | user | want | try | |
---|---|---|---|---|---|---|
d1 | 1 | 3 | 1 | 4 | 1 | |
d2 | 4 | 1 | 5 | 1 | ||
d3 | 2 | 1 | 4 | |||
d4 | 1 | 7 | ||||
d5 | 1 | 1 | 1 | |||
d6 | 1 | 2 | 3 | |||
d7 | 1 | 1 | 2 | |||
d8 | 4 | |||||
d9 | 3 | 1 | ||||
d10 | 1 | 1 | 1 |
a) Compute the inner product and the cosine similarity value for the following document pairs:
d2 and d6
d2 and d8
b) Assume that the user gives the query terms computer, want and try. Give cosine similarity value for the five documents that best match the query.
Compute similarity values for the document pairs (d2, d6) and (d2, d8) using the following similarity measures: Overlap, Dice and Jaccard. You find the formulas in a separate file (pdf).
Do you notice any differences compared with the inner product and the cosine measure?
Assume that the pairwise similarity values for documents A-G are:
A B C D E F G
A 1 0.6 0.9 0.6 0.8 0.8 0.7
B 1 0.3 0.6 0.5 0.8 0.7
C 1 0 0.7 0.3 0.5
D 1 0.4 0.4 0.4
E 1 0.7 0.3
F 1 0.5
G 1
Explain the hierarchical agglomerative clustering method presented
in the lecture when single link is used for comparing similarities
between the clusters.
If you did not attend the lecture, see Salton's book Automatic Text Processing and the example on pages 329-333.