581257-8 Information retrieval methods - Exercises 6/2001 (14.3.)


Tasks marked with (**) will be counted as double tasks.
1. A document is divided into chapters and sections (5 chapters, each with 5 sections), and is implemented as a hypertext with 1 + 5 + 25 = 31 nodes. The essential contents of the document is in section nodes; the upper nodes contain only links to the next level.

The internal weights of section nodes A(i,j) (for some combination of query terms) are 10, 30, 30, 50, 50; 20, 20, 20, 20, 20; 10, 80, 10, 10, 10; 100, 0, 0, 0, 0; 10, 0, 0, 0, 0) in the natural order of the nodes (A(1,1), A(1,2), ..., A(5,5)).

a) Calculate the total weights of the nodes in situations where the external weight of a node is 1) 0.5 * (the sum of the total weights of its descendants), and 2) the average of the total weights of its descendants. We assume that there are no other links than those which maintain the hierarchical structure.

Evaluate the results: Are there some problems in these ways of calculation? Are there any better policies that could be used?

2. a) Suppose that there are also other links in the hypertext of task 1 than those specifying the structure: A(1,i) --> A(3,2) (for i = 1,...,5); A(2,i) --> A(4,i) (for i = 1,...,5); A(3,i) --> A(3,j) (for i,j = 1,...,5, i<>j). The additional links are thus reference links. The relationships induced by these links are also regarded here as descendant relationships. What is the effect of these additional links to the calculations of task 1?

b) Evaluate (generally, not necessarily with the example hypertext) following factors that might be used in calculating the total weight:

- the distance between the nodes (in links); is it possible to take the distance into account?
- the neighbourhood of the node (not depending on the direction of the link),
- normalization of the weights by the node size (length of the text), or by the total size of the nodes in the neighbourhood.

3. What different link types can be found on the Web pages? Use some examples, e.g. http://www.cs.helsinki.fi/opiskelu/index.en.html (with some neighbourhood), or the W3C HTML specification (part 'Links').

Consider the purpose of the link and not technical features as a criterion of the different type of a link.

4. (**) Explain the main principles of the studies described in [1].

References:

1. Bharat, K. & Henzinger, M.R., Improved algorithms for topic distillation in a hyperlinked environment. ACM SIGIR'98 Conference, 104-111. http://www.acm.org/pubs/contents/proceedings/ir/290941/

Hannu.Erkio@cs.Helsinki.FI