Information retrieval methods, spring 2001

581257-8 Information retrieval methods - Exercises 4/2001 (28.2.)

Tasks marked with (**) will be counted as double tasks.

1. Explain term weighting based on inversed document frequency (tf-idf) by calculating some example values for weights

a) in an artificial situation with N = 10000 documents, a term occurring in 1,2,10,100,1000,10000 documents, and 1,2,... times in a specific document,

b) when some real documents are used.

In b, you can use e.g. abstracts of scientific articles as (substitutes of the) entire documents. In this case the term frequencies can be determined even manually. Document frequencies should be determined by some 'educated guess'. (It is fairly easy to count the term frequencies at least for some selected terms also automatically, i.e. using some lines of code.)

(The purpose of this task is to give some practical feeling about the frequency levels found in various situations. Calculations as such are not an end in itself.)

2. The tf-idf-measure can be seen as a fairly coarse measure for the significance of a term. There are many features omitted like the number of occurrences when the document frequency is counted, and the point of occurrence (is it central like in title, in subtitle, in abstract, in the beginning of the document, etc., or less central like reference, footnote or like?) when the tf-component is determined.

Evaluate the tf-idf weighting scheme in this respect: is it possible to refine the weighting by considering the above features? how? are these further refinements also practical? in which kind of documents? Make the situation concrete by analyzing the contents of at least two documents.

3. Give two examples for queries (term sets) where you consider it reasonable to use a thesaurus connections like BT (broader term) or NT (narrower term) in evaluating the query result. (We want to find situations where the query term probably does not occur in all relevant documents but terms having a BT/NT connection with it occur.) Are there any problems with this kind of use of a thesaurus?

4. Try to find at least two examples of stoplists (from WWW). At least one of those should concern some application area.

5. Evaluate the effects of term frequencies for inverted index, suffix structures (tree/array), and signature files. (The general wisdom of avoiding the use of too general terms as index terms can now be forgotten.)

Hannu.Erkio@cs.Helsinki.FI