581257-8 Information retrieval methods - Exercises 1/2001 (31.1.)


Tasks marked with (**) will be counted as double tasks.
Tasks 1 and 2 are intended to give some practical background for information retrieval problems. The other tasks are quite introductory too.

1. Make some experiments with two library systems: the departmental library system (http://www.cs.helsinki.fi/cgi-bin/bibsearch?L=ENG), and the HELKA system of our university (http://wwls.lib.helsinki.fi/).

a) Which different types of searches it is possible to perform with these systems (a different type is understood as an expression of what the user needs, not as a technical question)? Evaluate also how you succeed in your searches.
b) Evaluate the systems: what deficiencies or problems there are in using these systems? (Consider retreival problems, not only the user interface.)

2. Explain the concepts recall, precision, and relevance in a situation where we have only 8 documents, e.g. textbooks of computer science. The titles of the books are sufficient here to represent the books. Give the documents and the queries as lists of terms, for example,
D1 = (sorting, searching, art, computer, programming)
= D. Knuthin teos 'The Art of Computer Programming, Vol. 3: Sorting and Searching',
Q1 = (sort, program) = 'books of sorting'.
You do not need to consider any details in evaluating the query. We assume that, for example, document D1 belongs to the answer set of the query Q1 even if there are different spellings of sorting (sort vs. sorting).

3. Relevance is an important concept when evaluating the result of a search. It is, however, not easy to determine exactly what it means.
a) Try to determine or characterize the concept relevance. Which kind of problems there exist in deciding the relevance of a document?

b) (This part is based on a Finnish textbook and it is not 'possible' to translate in English. You might try to find some research articles from WWW e.g. by search terms 'topical relevance' and read something of them.)

4. a) In 8.3.1, Salton gives some examples of using distance constraints as extensions of inverted index operations. Another way to extend these is to specify in which part of the document the term occurrences are (within title, within abstarct, within metadata, etc.) In what situations these extensions might be appropriate? Give examples.

b) Do these features exist in common search agents?

5, (**) In 8.5.2, Salton introduces quorum-level searches.

a) Suppose that the query terms are formed automatically from a natural language expression describing what the user wants to find. How the following situations wloud be handled with quorum-like queries:

- effects of poisons to plankton,
- professional fishing on Finnish lakes.

Are all the different parts of the queries meaningful?

Make experiments where you give the (quorum) sub-queries of the second example (fishing) to AltaVista (or some other common search agent). What are your conclusions on quorum-level method (based on your experiments)?

c) Might it be possible to make some changes in quorum-level method to improve it (without upgiving the whole idea of the automatic construction of a Boolean type query)? (The method as such is quite tedious although systematic.)
Hannu.Erkio@cs.Helsinki.FI