University homepageSuomenkielinen versio puuttuuInte på svenskaIn english
University of Helsinki Department of Computer Science
 

Department of Computer Science

581257 Information Retrieval Methods (6 ECTS, 3 cu) Spring 2007

Project work

The project work is done in groups of 4-5 students. The groups are formed during the first exercise session on 22 January 2007. If you cannot make it to that session, please contact Helena or Niina. Each group should give at least one contact person and one email address to Helena and Niina.

Groups can be smaller but the best benefit (e.g., from relevance evaluation) would be to have groups with several people).

Each project group will give an informal presentation during the last exercise session on Monday February 19th (starting at 12.15 in C221). The length of the presentation should be about 15-20 minutes. The project work does not have to be completed at the time of the presentation; the aim is to give an overview of the progress so far (what is your topic, what kind of queries and results you have studied, etc.).

The project report must be ready on Friday, 9 March 2007 at midnight THE LATEST. For each day the report comes in late we will decrease the points by two (2). You can have a maximum of 15 points for the work.

More instructions on how to use Lucene. Old instructions by Antoine Doucet.

  • Note 1: The current version seems to be 2.0.0, but you can also use version 1.4.3 as mentioned in Antoine's instructions).
  • Note 2: You can load the retrieval engine from http://mirrors.isc.org/pub/apache/lucene/java/ (the link in instructions seems to be outdated). The binaries for Linux should be fine (lucene-2.0.0.tar.gz).

Task

For each group

  • The project work (documents, reports, etc.) should be done in English. (For the convenience of all who participate in the course. Try to write correct language. The language, however, does not affect the points you can get for the work except if it is impossible to understand.)
  • The group chooses some particular topic for the documents they will collect (see next point).
  • Each member collects 10 documents on the topic e.g. from the Internet.
  • The documents are indexed with the Lucene retrieval engine. You will then be able to use the query interface of Lucene.
  • Each member should give two retrieval tasks.
  • For each retrieval task, you should form two queries
    • A Boolean expression (The result will be the documents that satisfy the expression. The documents are not ordered).
    • A list of terms i.e. a so-called vector model query (The result is a list of documents ranked according to their relevance.)
  • The group evaluates the relevance of all the documents compared to each retrieval task. For each task there should be three independent evaluations.
  • Execute the queries with Lucene.
  • Compute recall and precision values for all results. When the result is an ordered list, please draw recall-precision curves (use average results of all tasks!)

Report

Each group writes a report in HTML. The report does not have to be very long but all of the following parts should be included. Also use full sentences.

  • A description of the document collection, the number of documents, language, topic, number of words altogether and per document (on average).
  • The retrieval tasks and queries.
  • Experiences from relevance evaluation. Did the evaluators agree? Why not?
  • Presentation of the retrieval results (e.g. a recall-precision graph for average results from the vector model queries, average precision and recall for Boolean queries). You can use Gnuplot for drawing the curves.
  • Some explanation of the differences in the usability of the results for different query types (Boolean vs. vector model queries).
  • As an additional task you can describe and try queries that would have resulted in optimal recall and precision. You can do this for all tasks or for some particular task.

Lucene

You can load the retrieval engine from http://mirrors.isc.org/pub/apache/lucene/java/. The binaries for Linux should be fine (lucene-2.0.0.tar.gz).

More information about Lucene on the Apache Lucene page.


Helena Ahonen-Myka, Greger Lindén