Processing of Large Document Collections: Exercise 1.

Solution for Exercise 1.1

Assume we have the following set of 5 documents (only terms that have been selected are shown):

Doc1: cat cat cat
Doc2: cat cat cat dog
Doc3: cat dog mouse
Doc4: cat cat dog dog dog
Doc5: mouse

How would you represent each document as a vector?

Each occurring term has to be assigned a unique number that corresponds to a dimension of the term space. This can be done in several different ways. For example, one could simply take ordinal of the term in the alphabet or the order of appearance. Using the latter approach, the document vectors with term frequency are presented as follows:
```
      dimensions
        1  2  3 
----------------------        
Doc1: ( 3, 0, 0 ) 
Doc2: ( 3, 1, 0 )
Doc3: ( 1, 1, 1 )
Doc4: ( 2, 3, 0 )
Doc5: ( 0, 0, 1 )
	  
```
A binary vector (consisting of 1's and 0's) would do as well.
Calculate the TF*IDF weights for the terms.

The scaling with TF*IDF means that the number of occurrences of term in the document (term frequency) is multiplied by the inverse document frequency [1]. The IDF represents the informativity (or the specificity) of the term. If the term occurs in every document of the collection, it has low informativity; it is of little use in determining the similarity of two documents, for instance.

IDF is defined according to the formula: IDF(t) = log( N / d(t) ), where N is the number of documents in the collection and d(t) is the number of documents in the collection where the term t occurs.

We calculate the inverse document frequencies as follows:
```
  IDF('cat') =     0.321928094887362 
  IDF('dog') =     0.736965594166206 
  IDF('mouse') =   1.32192809488736
	  
```
NOTE: Here we employ logarithm of base 2. You can switch the base of the logarithm by a simple calculation: If you have n-based logarithm and you want to convert to 2-based, you simply divide by log_n(2), i.e., log₂(x) = log_n(x) / log_n(2).

Base 10 logarithms are just as good as these although the values are considerably smaller. The neat thing about log₂ is that, for a computer scientist, it tells how many bits (binary digits) is needed to represent a given number in the binary format (rounding up). For instance, log₂(8) = 3, which means that you need three bits (2³ = 8). Similarly, log₁₀ tells how many digits one needs: log₁₀( 72 ) = 1.87 --> 2 digits, log₁₀( 7200 ) = 3.86 --> 4 digits.

As a result, the TF*IDF [2]vectors are:
```
Doc1: (  0.97, 0.00, 0.00 )
Doc2: (  0.97, 0.74, 0.00 )
Doc3: (  0.32, 0.74, 1.32 )
Doc4: (  0.65, 2.21, 0.00 )
Doc5: (  0.00, 0.00, 1.32 )
	  
```
If one wanted to calculate the normalized TF*IDF weights for the terms, the weight of each term in each vector is divided by the length of the vector. The vector length is calculated as follows.

|a| = √ x² + y² + z²
```
   |Doc1| = sqrt( 0.97^2 + 0.00 + 0.00 ) = 0.97
   |Doc2| = sqrt( 0.97^2 + 0.74^2 + 0.00 ) = 1.21
   |Doc3| = sqrt( 0.32^2 + 0.74^2 + 1.32^2 ) = 1.54
   |Doc4| = sqrt( 0.65^2 + 2.21^2 + 0.00 ) = 2.30
   |Doc5| = sqrt( 0.00 + 0.00 + 1.32^2 ) = 1.32
	  
```
Finally, the normalized vectors are:
```
Doc1: (  1.00, 0.00, 0.00 )
Doc2: (  0.79, 0.61, 0.00 )
Doc3: (  0.21, 0.48, 0.85 )
Doc4: (  0.28, 0.96, 0.00 )
Doc5: (  0.00, 0.00, 1.00 )
	  
```

Solution for Exercise 1.2

We use the same document collection as in Exercise 1 above. Assume that we have a category C and we would like to build a classifier for this category. That is, we need a classifier that can decide if some document belongs to the category C. We can use the document set above for training the classifier. Some expert has kindly told us that the documents Doc1 and Doc2 belong to the category C, while the documents Doc3, Doc4, and Doc5 do not belong to C. We use the Rocchio method.

Construct a classifier (manually, just on paper).
How does the classifier decide if a new document Doc6 belongs to the category C, if Doc6 contains the terms "cat cat dog"? What is the decision?

Use TF*IDF weights.

The underlying assumption is that the distribution of terms occurring in the documents of class C is somehow different from that of not-C. Usually, it requires considerable amount of training samples to form a classifier that can make the distinction at some feasible level of effectiveness.

Rocchio classifier examines the labeled documents, and attempts to formulate an optimal query , i.e., as a set of words that "maximizes the difference between the mean of the correlations of the relevant documents [...] and the mean of the correlations of the non-relevant documents" [3]. So what is correlation? Correlation is a measure of covariance: how much two variables covary (high measure of covariance implies that when one variable is high/low the other one is high/low also). Here, given document vectors, we measure the angle between them: if the angle is small, the vectors are highly correlated and vice versa. If we normalize the angle with the length of the vectors, we get cosine of the vectors. Salton, for instance, uses correlation, association and similarity synonymously to some extent. If two documents correlate, they have similar values for same terms, i.e., the same terms occur with the same emphasis, and therefore, they are similar, and they have a high association.

Hence, we have a query vector and a document vector; their correlation means they have the same terms with similar weights, i.e., the vectors have the same direction and length.

cos( q, d ) =

dot(q, d )
--------------------------
\|q\| \|d\|

where dot(q, d) is the dot-product (inner-product) of the two vectors.

So, let us assume we have a corpus (a collection of documents) D = {d₁, d₂, d₃,..., d_N } of N documents and two subsets of documents: set of documents of class C Õ D and set of documents not belonging to class C, i.e., the complement of C, C^c Õ D.

The optimal query vector q* contains weights for each term so that it recognizes documents of class C from not-C. It is a sort of a descriptor of the class C. Rocchio derives (*) an approximation of the optimal vector q*. The weight of the k^th term equals

q_k =

∑
d_i∈C

d_i,k

------

|C|

∑
d_i∉C

d_i,k

------

|C^c|

where β and γ are suitable multipliers.

We built two classifiers: one for the class C and another for not_C. In determining whether a particular (new/unlabeled) vector belongs to the class C or not, we measure the cosine of the new vector and each of these class vectors. If the vector is closer to C than not_C, we are inclined to assign the new vector to C.

Generally, the process of classification is not as straightforward as this example. Usually, the threshold and similarity values are set lower than those for dissimilarity. So, if the cosine of a new vector and vector of C yielded a very low score (less than the threshold) we would say that the vector is not in C. Here we did not need the not_C vector. However, the threshold is not a revelation. One often uses machine learning of some sort to derive the threshold: the learning is typically learning from errors, and an error by definition requires two classes. (Learning to recognize Volvos in the rush-hour stream of cars: there would have to be cars of other makes, too, or otherwise we would not learn much.)

The Rocchio vector could be calculated, for example, thus

        i = class;
        POS_i = numner of document of class i;
        NEG_i = number of document of not class i;
        FOREACH t in terms 
          begin
          weight = 0;
          FOREACH d in documents
            begin
            if ( class(d) == i ) 
            then
              weight += beta * tfidf(t,d)/POS_i;
            else
              weight -= gamma * tfidf(t,d)/NEG_i;
            end;
          end;
          c_i[t] = weight;
        end;

As a result we would have a vector c_i = ( c_i[0], c_i[1], ..., c_i[n] ) representing the class i. One could replace TF*IDF with any values, but TF*IDF has been shown to work quite well with a lot of the classification tasks. One can vary the values beta and gamma according to one's taste,and here we assign: β = 2 and γ=1.

              cat               dog                mouse
--------------------------------------------------------------------- 
class C:      1.60964047443681  -0.245655198055403 -0.881285396591573
class NOT_C: -0.321928094887362  1.596758787360113 1.7625707931831467

**      input query
        >cat cat dog
**      building query vector
**      Reading IDF file in .
        cat     0.321928094887362
        dog     0.736965594166206
        mouse   1.32192809488736
**      Calculating TFIDF values
**      Using termspace: termspace.list
**      generating vectors
        v = (  0.643856189774724 0.736965594166206 0 )


**      cosine of the query and class C: 0.4720765039202388
**      cosine of the query and class NOT_C: 0.412783983692336

The difference between the two is small.

Let's experiment with the normalized vectors:

ORIGINAL

Doc1: (  0.97, 0.00, 0.00 )
Doc2: (  0.97, 0.74, 0.00 )
Doc3: (  0.32, 0.74, 1.32 )
Doc4: (  0.65, 2.21, 0.00 )
Doc5: (  0.00, 0.00, 1.32 )

NORMALIZED

Doc1: (  1.00, 0.00, 0.00 )
Doc2: (  0.79, 0.61, 0.00 )
Doc3: (  0.21, 0.48, 0.85 )
Doc4: (  0.28, 0.96, 0.00 )
Doc5: (  0.00, 0.00, 1.00 )

the rocchio vectors would have been different too. Notice how the large weight of 'dog' in Doc4 gets soothened. The value of 'dog' in the Class C vector is now positive.

              cat               dog                mouse
-----------------------------------------------------------------------
class C:      1.632430674063078 0.1278330522452787 -0.618108320821812
class NOT_C: -0.572386659625350 0.6542812111738740  1.23621664164362

**      input query
        >cat cat dog
**      building query vector
**      Reading IDF file in .
        cat     0.321928094887362
        dog     0.736965594166206
        mouse   1.32192809488736
**      Calculating TFIDF values
**      Using termspace: termspace.list
**      generating vectors
        v = (  0.643856189774724 0.736965594166206 0 )
**      done

**      cosine of the query and class C: 0,683279777689521
**      cosine of the query and class NOT_C: 0,07852428326038

Viitteet

K. Spark Jones (1972) A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28(1): 11-21.
G. Salton, A. Wong and C.S. Yang (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11): 613-620.
J. J. Rocchio (1971) Relevance feedback in information retrieval. Teoksessa G. Salton (toim.), The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliff, NJ, USA, 313-323.

Juha Makkonen, Miro Lehtonen