Processing of large document collections, Spring 2006

582410 Processing of large document collections, Exercise 1

Assume we have the following set of 5 documents (only terms that have been selected are shown):
```
Doc1: cat cat cat
Doc2: cat cat cat dog
Doc3: cat dog mouse
Doc4: cat cat dog dog dog
Doc5: mouse
```
1. How would you represent each document as a vector?
2. Calculate the TF*IDF weights for the terms.
(equals 2 exercises) We use the same document collection as in Exercise 1 above. Assume that we have a category C and we would like to build a classifier for this category. That is, we need a classifier that can decide if some document belongs to the category C or not. We can use the document set above for training a classifier. Some expert has kindly told us that the documents Doc1 and Doc2 belong to the category C, while the documents Doc3, Doc4, and Doc5 do not belong to C. We decide to use the Rocchio method.
1. Construct a classifier (manually is OK).
2. How does the classifier decide, if a new document Doc6 belongs to the category C, if Doc6 contains the terms "cat cat dog"? What is the decision?
Use TF*IDF weights.