IRM, Spring 2007, Exercise 2

Information Retrieval Methods, Exercise 2, 29 Jan 2007

Look at the text "Nike...".
Give examples of good and bad terms in your opinion. Are there words in the document that you would leave out completely from the document description? Give examples of words that have a high term frequency and words that have an average or low frequency (within this document / in a document collection). What do you think are the other documents in the collection like? Would the knowledge of the characteristics of the collection influence your decisions on the goodness of terms?
The words in a document description can be modified in many ways. For example, the words can be stemmed: the words "accessed", "accessibility", "accessible" can be stemmed to "access". When English texts are stemmed, we often use the Porter algorithm. Read about the Porter algorithm and explain the main function of the algorithm. How does stemming affect precision and recall? What other modifications could we use? Study the document in the first task.

Assume that we have a document collection described by the document-term matrix below (d1-d10 are documents, and the terms are frog, snake, computer, user, want and try). The elements of the matrix denote the term's term frequency (tf) in a document.

	frog	snake	computer	user	want	try
d1	1	3		1	4	1
d2			4	1	5	1
d3	2	1				4
d4				1	7
d5		1			1	1
d6				1	2	3
d7	1			1		2
d8					4
d9					3	1
d10			1	1	1

Compute the (tf x idf) weight for terms in documents d1-d5, in two ways:

tf = number of occurrencs of a term in a document
tf = number of occurrencs of a term in a document divided with the number of occurrences of the term that occurs most frequently in the same document. For example, in document d1, the term "want" occurs most frequently (4 times).

What kind of differences in weights you can find?

Helena Ahonen-Myka

Last modified: Mon Jan 22 12:55:41 EET 2007