581257-8 Information retrieval methods - Exercises 5/2001 (7.3.)


Tasks marked with (**) will be counted as double tasks.
1. Explain how pattern P will be found in string S by the KMP method (Knuth-Morris-Pratt) when

S = aababaabbabaabc, and P = abaabc.

How many comparisons are needed?

2. As exercise 1 but using BM (Boyer-Moore) method.

3. The command grep is a well-known technique in making searches in the file hierarchy (in Unix). (See e.g. Linux man pages and make experiments.) Evaluate the pros/cons of grep in general and compare it to 'standard' retrieval techniques (e.g. search engines in WWW) by considering the following points:

a) In what situations grep is (is not) good?

b) Effects of the volume and format (organisation) of the data being searched?

c) The volume of the search result?

d) Phrases in searching, multiple search terms?

e) Other factors?

4. This task concerns approximate string searches.

a) Make some experiments for approximate searching with a departmental library system (or some other suitable system). Is approximate searching viable in practice? Are there any problems?

b)In GLIMPSE system approximations are specified using the edit distance. The three error types may have different weights (specified as options -Dx, -Iy, -Sz). Are these features suitable for practical needs? Are there any other ways to specify the amount error in approximationsz?

5. (**) Explain the main principles of the LSI technique described in [1].

References:

1. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990): Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391 - 407.
http://citeseer.nj.nec.com/deerwester90indexing.html

Hannu.Erkio@cs.Helsinki.FI