582410 Processing of large document collections, Exercise 1

  1. Assume we have the following set of 5 documents (only terms that have been selected are shown):


    Doc1: cat cat cat
    Doc2: cat cat cat dog
    Doc3: cat dog mouse
    Doc4: cat cat dog dog dog
    Doc5: mouse
    

    1. How would you represent each document as a vector?

    2. Calculate the TF*IDF weights for the terms.

  2. (equals 2 exercises) We use the same document collection as in Exercise 1 above. Assume that we have a category C and we would like to build a classifier for this category. That is, we need a classifier that can decide if some document belongs to the category C or not. We can use the document set above for training a classifier. Some expert has kindly told us that the documents Doc1 and Doc2 belong to the category C, while the documents Doc3, Doc4, and Doc5 do not belong to C. We decide to use the Rocchio method.

    1. Construct a classifier (manually is OK).

    2. How does the classifier decide, if a new document Doc6 belongs to the category C, if Doc6 contains the terms "cat cat dog"? What is the decision?

    Use TF*IDF weights.