Give examples of good and bad terms in your opinion. Are there words in the document that you would leave out completely from the document description? Give examples of words that have a high term frequency and words that have an average or low frequency (within this document / in a document collection). What do you think are the other documents in the collection like? Would the knowledge of the characteristics of the collection influence your decisions on the goodness of terms?
The words in a document description can be modified in
many ways. For example, the words can be stemmed: the words
"accessed", "accessibility", "accessible" can be stemmed to "access".
When English texts are stemmed, we often use the Porter algorithm. Read
about the Porter algorithm and explain the main function of the
algorithm. How does stemming affect precision and recall? What other
modifications could we use? Study the document in the first task.
Assume that we have a document collection described by the
document-term matrix below (d1-d10 are documents, and the terms are
frog, snake, computer, user, want and try). The elements of the matrix
denote the term's term frequency (tf) in a document.
frog | snake | computer | user | want | try | |
---|---|---|---|---|---|---|
d1 | 1 | 3 | 1 | 4 | 1 | |
d2 | 4 | 1 | 5 | 1 | ||
d3 | 2 | 1 | 4 | |||
d4 | 1 | 7 | ||||
d5 | 1 | 1 | 1 | |||
d6 | 1 | 2 | 3 | |||
d7 | 1 | 1 | 2 | |||
d8 | 4 | |||||
d9 | 3 | 1 | ||||
d10 | 1 | 1 | 1 |
Compute the (tf x idf) weight for terms in documents d1-d5, in two ways:
tf = number of occurrencs of a term in a document
tf = number of occurrencs of a term in a document divided with the number of occurrences of the term that occurs most frequently in the same document. For example, in document d1, the term "want" occurs most frequently (4 times).
What kind of differences in weights you can find?