Information retrieval methods, spring 2001

581257-8 Information retrieval methods - Exercises 3/2001 (14.2.)

Tasks marked with (**) will be counted as double tasks.

1. (**) Apply the clustering algorithm given in Salton, Ch. 10.2.2, to a collection of documents A, B, C, D, E, F. The pairwise similarities are


        AD   0.9        EF   0.6        EB   0.3
        BD   0.8        AC   0.4        AF   0.2
        EC   0.8        AE   0.4        CD   0.2
        CF   0.7        BC   0.4        DE   0.1
        AB   0.7        BF   0.3        DF   0.1

In this task, use the 'single link' principle.
Check the result with some similarity levels.

2. (**) Make the same clustering as in task 1, using
a)'complete link', and b) 'group average' principle.

3. Read the main parts of the Scatter/Gather articles [1-3], and explain especially how clustering is done in this method ([2]). Article [3] contains additional material and article [1] (actually some www pages only) a general introduction to the method.

4. a) Compare the algorithmic clustering (especially using Scatter/Gather technique) and the ordinary classification (e.g. in Yahoo). What are the pros/cons of these techniques in retrieving information?

b) Documents could also be classified by some 'simple' attributes (author, publication time, language , etc.). Are these usable and how? Is it possible to combine the use of these with the other techniques?

References:

Xerox PARC: About Scatter/Gather. ( a href="http://www.parc.xerox.com/istl/projects/ia/sg-overview.html (introduction, examples, references)
Cutting, D.R. et al., Scatter/Gather: a cluster-based approach to browsing large document collections. ACM SIGIR'92, 318-329. (the original presentation of the method; the principles are given here)
Hearst, M. & Pedersen, J., Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. ACM SIGIR'96, 76-84.

You can find the articles [2,3] (as well as the other SIGIR conferences) using the ACM Digital Library. It is possible to use the digital library from the workstation at the department. However, printing these articles seems to take fairly long; copies are also in the course folder (room A412).
Hannu.Erkio@cs.Helsinki.FI