581257-8 Information retrieval methods - Exercises 2/2001 (7.2.)


Tasks marked with (**) will be counted as double tasks.
1. (**) Let us define a small collection of 12 documents with the following 10 index terms: child, cloth, cotton, import, leather, material, production, summer, use, winter. The document vectors consist of weights of these index terms. The documents are here presented as the following lists:

D1 = (summer:0.8, material:0.3, production:0.9, cloth:1.0)
D2 = (child:0.5, material:0.3, winter:0.8, production:0.9, cloth:0.9)
D3 = (use:0.3, child:1.0, material:0.8, winter:0.6, leather:0.2)
D4 = (summer:0.2, child:0.2, cotton:0.6, production:0.8, cloth:0.2)
D5 = (use:0.6, child:1.0, cloth:1.0, leather:0.1)
D6 = (child:0.9, cloth:0.5)
D7 = (child:0.8, cloth=0.9, leather:0.1)
D8 = (import:0.4, cloth:0.4, leather:0.7)
D9 = (import:1.0, cloth:0.8)
D10 = (summer:0.5, child:0.8, cotton:0.4, import:0.8, cloth:0.7)
D11 = (child:0.7, import:0.9, cloth:0.2)
D12 = (cotton:1.0, production:0.8)

To learn how various coefficients describe the similarity of the documents with a query, calculate some example values for the similarity coefficients: inner product, cosine coefficient, Dice's c. and Jaccard's c. for the query
Q = (1.0, 1.0, 0.7, 0, 0.3, 0, 0, 0, 0, 0),
i.e. (child:1.0, cloth:1.0, cotton:0.7, material:0.3).

2. Explain the variation of the similarity coefficients mentioned in task 1 by calculating their values in some artificial situations:
- a document has e..g. t, t/2, t/5, 4t/5 (generally t/k, 2t/k,..., (k-1)t/k) 1's (the other terms 0),
- a query has e.g. p 1's; p is either slightly or very much smaller than t (t is the total number of terms).

(As all coefficients are based on the inner product. the variations describe the normalization factor,)

3. Consider the relevance feedback principle (Salton, p. 319-320).
a) What can we say about the length of the modified queries (the number of query terms)?
b) Is it possible to use this technique if the result of the initial query is empty? (Is it possible to prevent that the result is empty?)
c) Could you in some general way characterize the situations where the relevance feedback technique is usable or not usable?
d) Do the WWW search engines use any query modification techniques? (based on relavance feedback, some other principles - or anyway support some technique with many succeeding phases)

4. (**) Explain the most important results in article [1] in a summary of 1-2 pages.

References:

1. Magennis, M. & van Rijsbergen, C.J., The potential and actual effectiveness of interactive query expansion. Proc. ACM SIGIR97 Conf., 1997. (http://dev.acm.org/pubs/contents/proceedings/ir/258525/ p324-magennis/p324-magennis.pdf; the article is available at least using the workstations at the department, a copy is also in the course folder (room A412))
Hannu.Erkio@cs.Helsinki.FI