Processing of large document collections Some discussion on the exercises 2.1-2.3 23.10.2001, Helena Ahonen-Myka Vector representation of a document ----------------------------------- In our sample document collection, there are 118 words. In alphbetical order, the list starts with: absorption 1,1 agriculture 1,1 anaemia 2,1 analyse 1,1 application 1,2 appropriate 1,1 area 1,2 assess 1,2 benefit 1,1 biological 2,1 blood 1,2 ... If we choose to consider each word as a term, we can represent all the 10 documents of the collection using vectors of 118 dimensions. You can think about a document vector as an array of 118 elements, indexed, e.g., 0-117. We can, for instance, let the 1st element to represent the 1st word in the alphabetical order, the 2nd element to represent the 2nd word, etc. In our case, we can give a name di to the term array for the document i. If we are only interested in whether some terms occurs in a document, the vector for the document Doc1 would start like this: d1[0]=0 # absorption doesn't occur d1[1]=0 # agriculture -"- d1[2]=0 # anaemia -"- d1[3]=0 # analyse -"- d1[4]=1 # application occurs ... d1[21]=1 # current occurs d1[22]=0 # deficiency doesn't occur d1[23]=0 # demand -"- d1[24]=1 # develop occurs ... Term weighting using TFIDF -------------------------- Usually we want to say that some terms are more important than some other terms. We can express this by weighting terms of a vector. Very often, the standard TFIDF function is used: tfidf(t_k,d_j) = #(t_k,d_j) * log(|Tr|/#Tr(t_k)), in which #(t_k,d_j) denotes the number of times term t_k occurs in document d_j #Tr(t_k) denotes the number of documents in Tr in which t_k occurs |Tr| denotes the number of documents (in our case: 10) For instance, in the Doc1, the term 'application' occurs once, and in the collection it occurs 2 times: tfidf(application,Doc1) = 1 * log(10 / 2) = log 5 ~ 0.70 tfidf(current,Doc1) = 1 * log(10 / 7) = log 1.4 ~ 0.15 (In the calculations here and below, the approximate values have been calculated using base 10 in logarithms.) If we had some word that occurs 7 times in Doc1 and only in Doc1, the weight of that word would be: tfidf(doc1word,Doc1) = 7 * log(10/1) = 7 * log 10 = 7 The effect of 'log' in this formula is a smoothing one. Assume there were 1000 documents in the collection: 5 * log (1000 / 1) = 5 * 3.00 = 15.0 5 * log (1000 / 2) = 5 * 2.70 = 13.5 5 * log (1000 / 3) = 5 * 2.52 = 12.6 5 * log (1000 / 10) = 5 * 2.00 = 10.0 5 * log (1000 / 11) = 5 * 1.96 = 9.8 5 * log (1000 / 100) = 5 * 1.00 = 5.0 It may be helpful to think in our case that we really want to make only differences like: - a term occurs in 1 document - a term occurs in 2 documents - a term occurs in many documents - a term occurs in very many documents that is, we don't care whether a term occurs in 50 or in 51 documents. For weighting, this difference wouldn't be significant. In the Doc1, there are 17 words, which means that in the array d1 there are 17 non-zero weights. The rest of the elements have the weight 0. The weights of the non-zero terms range from 0.15 to 2: 1,1: log 10 = 1 (resonance, chemical, probe, solution) 1,2: 0.7 (magnetic,application,field,patient,medical,imaging,research, two,develop) 1,3: log (10 / 3) = 0.5 (nuclear) 2,1: 2 * log 10 = 2 (biological) 1,7: 0.15 (current, project) d1[0] = 0 d1[1] = 0 ... d1[4] = 0.7 ... d1[21] = 0.15 d1[24] = 0.7 ... To obtain weights between 0 and 1, the weights can be normalized. Term (feature) selection ------------------------ Often it is necessary to reduce the dimensionality of the term space, e.g., to speed up the learning or to produce more general results. Reducing the dimensionality means in our case selecting just part of the words and removing the others. After this, the document vectors are shorter. There are several methods for selecting "good" terms. The following ones calculate an importance value to each term, after which we can rank the terms according to the calculated value and choose the best ones. Document frequency (DF) Document frequency is the number of documents in which a term occurs. Based on this function, the ranking of the terms looks like the following: 7 project 7 current 4 environment 3 nuclear 2 application 2 area 2 assess 2 blood 2 change 2 climate 2 develop 2 ecology 2 effect 2 energy 2 environment 2 field 2 gene 2 imaging 2 impact 2 instrument 2 lung 2 magnetic 2 medical 2 natural 2 patient 2 research 2 treatment 2 two 2 water 1 use 1 temperature ... We might, e.g., set the threshold to 2 and remove all the words that occur only once. Information gain (IG) --------------------- In order to calculate information gain for a term, we have to estimate the following probabilities: P(t) probability of a term t, 1. P(current) = 7/10, or 2. P(current) = 1/118, or 3. P(current) = 7/182 if we consider the proportion of the occurrences of 'current' of the all term occurrences P(~t) probability of some other term than t, 1. P(~current) = 3/10, or 2. P(~current) = 117/118, or 3. P(~current) = 175/182 P(c_i) probability of category i, 1. P(medicine) = 1/3 we have 3 categories, equal probability, or 2. P(medicine) = 4/10 if we consider the proportion of medicine documents in the collection P(c_i | t) probability of category i if t is in the document, i.e. which proportion of the documents where t occurs belong to the category i P(medicine | current) = 4/7 P(medicine | treatment) = 2/2 P(medicine | network) = 0/1 P(c_i | ~t) probability of category i if t isn't in the document, i.e., which proportion of the documents where t doesn't occur belongs to the category i P(medicine | ~current) = 0/3 P(medicine | ~treatment) = 2/8 P(medicine | ~network) = 4/9 (For symmetry, also these probabilities are explained here, although they are not needed with IG: P(t | c_i) in which proportion of documents in category c_i t occurs P(~t | c_i) in which proportion of documents in category c_i t doesn't occur ) Information gain can now be calculated for each term: G(t) = - (P(c_1) log P(c_1) + P(c_2) log P(c_2) + P(c_3) log P(c_3)) + P(t) (P(c_1|t) log P(c_1|t) + P(c_2|t) log P(c_2|t) + P(c_3|t) log P(c_3|t)) + P(~t) (P(c_1|~t) log P(c_1|~t) + P(c_2|~t) log P(c_2|~t) + P(c_3|~t) log P(c_3|~t)) P(medicine) = 4/10 log(4/10) = -0.4 P(energy) = 2/10 log(2/10) = -0.7 P(environment) = 3/10 log(3/10) = -0.5 P(nuclear) = 3/10 = 0.3 P(~nuclear) = 7/10 = 0.7 P(medicine|nuclear) = 2/3 log(2/3) = -0.18 P(energy|nuclear) = 1/3 log(1/3) = -0.48 P(environment|nuclear) = 0/3 log 0 = not defined P(medicine|~nuclear) = 2/7 log(2/7) = -0.54 P(energy|~nuclear) = 1/7 log(1/7) = -0.85 P(environment|~nuclear) = 4/7 log(4/7) = -0.24 IG(nuclear) = -(0.4 * -0.4 + 0.2 * -0.7 + 0.3 * -0.5) + 0.3 (0.66 * -0.16 + 0.33 * -0.48 + 0) + 0.7 (0.29 * -0.54 + 0.14 * -0.85 + 0.57 * -0.24) = 0.3 - 0.3 * 0.26 - 0.7 * 0.41 = 0.3 -0.36 = -0.06 P(current) = 7/10 P(~current) = 3/10 P(medicine|current) = 4/7 log(4/7)= -0.24 P(energy|current) = 1/7 log(1/7) = -0.85 P(environment|current) = 2/7 log(2/7) = -0.54 P(medicine|~current) = 0/3 log(0/3) not defn P(energy|~current) = 1/3 log(1/3) = -0.48 P(environment|~current) = 2/3 log(2/3) = -0.18 IG(current) = 0.3 - 0.7 * 0.41 - 0.3 * 0.26 = 0.3 - 0.36 = -0.06 What if 'current' occured in 10 documents? P(treatment) = 2/10 P(~treatment) = 8/10 P(medicine|treatment) = 2/2 log 1 = 0 P(energy|treatment) = 0/2 undef P(environment|treatment) = 0/2 undef P(medicine|~treatment) = 2/8 log(0.25) = -0.60 P(energy|~treatment) = 2/8 -0.60 P(environment|~treatment) = 4/8 -0.30 IG(treatment) = 0.3 - 0.2 * 0 - 0.8 * 0.45 = 0.3 - 0.36 = -0.06 What if 'treatment' would occur in all of the 4 medicine documents? Mutual information (MI) ----------------------- A: the number of times term t and category c co-occur B: the number of times t occurs without c C: the number of times c occurs without t N: the total number of documents (=10) MI(t,c) = log (A * N / (A + C) * (A + B)) MI(t) = max {MI(t,c_i)} (for instance...) MI(nuclear, medicine) A=2, B=1, C=2 MI(nuclear,medicine) = log (2 * 10 / 4 * 3) = log 1.67 = 0.22 A=1, B=2, C=1 MI(nuclear,energy) = log (1 * 10 / 1 * 2) = log 5 = 0.70 A=0, B=3, C=4 MI(nuclear,environment) = log 0 = undef MI(nuclear) = 0.70 The Rocchio method ------------------ In the Rocchio categorization method, the categories are represented by similar vectors as documents. That is, if in our example, without dimensionality reduction, we have 118 terms, the categories medicine, energy, and environment are represented by vectors of 118 elements. Let POS_i and NEG_i be the set of positive and negative examples, respectively, for the category i. In our dataset, POS_medicine includes the documents Doc1-Doc4 and NEG_medicine includes the documents Doc5-Doc10, hence |POS_medicine| = 4 and |NEG_medicine| = 6. The weights of the category vector c_medicine can now be set using the formula: w_k_medicine = B/|POS_medicine| * (w_k1 + w_k2 + w_k3 + w_k4) -A/|NEG_medicine| * (w_k5 + w_k6 + w_k7 + w_k8 + w_k9 + w_k10) where w_k_medicine is the weight of the kth term/element of the medicine vector. Similarly, w_k1 is the kth term of Doc1. In this formula, A and B are control parameters that can be used to set the relative importance of positive and negative examples. For instance, if we say that A=2 and B=1, we don't want the negative examples to have as strong influence as the positive examples. In the positive sample documents for medicine, the following weights are given for 'treatment' (term 110): w_110_1 = 0 w_110_2 = 0.7 w_110_3 = 0.7 w_110_4 = 0 As 'treatment' does not occur in the negative examples, all the corresponding weights are zero. Hence, the weight for the 110th term of the vector for 'medicine' is: w_110_medicine = 2/4 * (0 + 0.7 + 0.7 + 0) - 1/6 * 0 = 0.7 Similarly, we can calculate a weight for the term 'nuclear', which also occurs in one negative example: w_70_1 = 0.5 w_70_2 = 0 w_70_3 = 0 w_70_4 = 0.5 w_70_6 = 0.5 w_70_medicine = 2/4 * (0.5 + 0.5) - 1/6 * 0.5 = 0.5 + 0.08 = 0.58 In this way, weights are calculated for each term in the term space, and the resulting vector is used to represent the corresponding category. When new documents are classified, the similarity between the new document vector and each category vector is calculated, e.g. using the cosine measure (vector dot product). The new document is classified under the category whose vector is the most similar .