581257-8 Information retrieval methods - Exercises 3/2001 (14.2.)
Tasks marked with (**) will be counted as double tasks.
1. (**) Apply the clustering algorithm given in Salton, Ch. 10.2.2,
to a collection of documents A, B, C, D, E, F. The pairwise similarities are
AD 0.9 EF 0.6 EB 0.3
BD 0.8 AC 0.4 AF 0.2
EC 0.8 AE 0.4 CD 0.2
CF 0.7 BC 0.4 DE 0.1
AB 0.7 BF 0.3 DF 0.1
In this task, use the 'single link' principle.
Check the result with some similarity levels.
2. (**) Make the same clustering as in task 1, using
a)'complete link',
and b) 'group average' principle.
3. Read the main parts of the Scatter/Gather articles [1-3], and
explain especially how clustering is done in this method ([2]).
Article [3] contains additional material and article [1] (actually some
www pages only) a general introduction to the method.
4. a) Compare the algorithmic clustering (especially using Scatter/Gather
technique) and the ordinary classification (e.g. in Yahoo). What are the
pros/cons of these techniques in retrieving information?
b) Documents could also be classified by some 'simple' attributes
(author, publication time, language , etc.). Are these usable and how?
Is it possible to combine the use of these with the other techniques?
References:
- Xerox PARC: About Scatter/Gather.
(
a href="http://www.parc.xerox.com/istl/projects/ia/sg-overview.html
(introduction, examples, references)
- Cutting, D.R. et al., Scatter/Gather: a cluster-based approach to
browsing large
document collections. ACM SIGIR'92, 318-329. (the original presentation
of the method; the principles are given here)
- Hearst, M. & Pedersen, J., Reexamining the Cluster Hypothesis:
Scatter/Gather on
Retrieval Results. ACM SIGIR'96, 76-84.
You can find the articles [2,3] (as well as the other SIGIR conferences) using the
ACM
Digital Library. It is possible to use the digital library from the
workstation at the department. However, printing these articles seems to take
fairly long; copies are also in the course folder (room A412).
Hannu.Erkio@cs.Helsinki.FI