Processing of large document collections
Some discussion on the exercises 2.1-2.3
23.10.2001, Helena Ahonen-Myka
Vector representation of a document
-----------------------------------
In our sample document collection, there are 118 words.
In alphbetical order, the list starts with:
absorption 1,1
agriculture 1,1
anaemia 2,1
analyse 1,1
application 1,2
appropriate 1,1
area 1,2
assess 1,2
benefit 1,1
biological 2,1
blood 1,2
...
If we choose to consider each word as a term,
we can represent all the 10 documents of the collection
using vectors of 118 dimensions. You can think about a document
vector as an array of 118 elements, indexed, e.g., 0-117.
We can, for instance, let the 1st element to represent the
1st word in the alphabetical order, the 2nd element to
represent the 2nd word, etc. In our case, we can give a name di
to the term array for the document i. If we are only interested in
whether some terms occurs in a document, the vector for
the document Doc1 would start like this:
d1[0]=0 # absorption doesn't occur
d1[1]=0 # agriculture -"-
d1[2]=0 # anaemia -"-
d1[3]=0 # analyse -"-
d1[4]=1 # application occurs
...
d1[21]=1 # current occurs
d1[22]=0 # deficiency doesn't occur
d1[23]=0 # demand -"-
d1[24]=1 # develop occurs
...
Term weighting using TFIDF
--------------------------
Usually we want to say that some terms are more important than
some other terms. We can express this by weighting terms of
a vector. Very often, the standard TFIDF function is used:
tfidf(t_k,d_j) = #(t_k,d_j) * log(|Tr|/#Tr(t_k)),
in which
#(t_k,d_j) denotes the number of times term t_k occurs
in document d_j
#Tr(t_k) denotes the number of documents in Tr in which
t_k occurs
|Tr| denotes the number of documents (in our case: 10)
For instance, in the Doc1, the term 'application' occurs once,
and in the collection it occurs 2 times:
tfidf(application,Doc1) = 1 * log(10 / 2) = log 5 ~ 0.70
tfidf(current,Doc1) = 1 * log(10 / 7) = log 1.4 ~ 0.15
(In the calculations here and below, the approximate values have
been calculated using base 10 in logarithms.)
If we had some word that occurs 7 times in Doc1 and only in Doc1,
the weight of that word would be:
tfidf(doc1word,Doc1) = 7 * log(10/1) = 7 * log 10 = 7
The effect of 'log' in this formula is a smoothing one.
Assume there were 1000 documents in the collection:
5 * log (1000 / 1) = 5 * 3.00 = 15.0
5 * log (1000 / 2) = 5 * 2.70 = 13.5
5 * log (1000 / 3) = 5 * 2.52 = 12.6
5 * log (1000 / 10) = 5 * 2.00 = 10.0
5 * log (1000 / 11) = 5 * 1.96 = 9.8
5 * log (1000 / 100) = 5 * 1.00 = 5.0
It may be helpful to think in our case that we really want
to make only differences like:
- a term occurs in 1 document
- a term occurs in 2 documents
- a term occurs in many documents
- a term occurs in very many documents
that is, we don't care whether a term occurs in 50 or in 51
documents. For weighting, this difference wouldn't be
significant.
In the Doc1, there are 17 words, which means that in the array d1
there are 17 non-zero weights. The rest of the elements have
the weight 0. The weights of the non-zero terms range from 0.15 to 2:
1,1: log 10 = 1 (resonance, chemical, probe, solution)
1,2: 0.7 (magnetic,application,field,patient,medical,imaging,research,
two,develop)
1,3: log (10 / 3) = 0.5 (nuclear)
2,1: 2 * log 10 = 2 (biological)
1,7: 0.15 (current, project)
d1[0] = 0
d1[1] = 0
...
d1[4] = 0.7
...
d1[21] = 0.15
d1[24] = 0.7
...
To obtain weights between 0 and 1, the weights can be normalized.
Term (feature) selection
------------------------
Often it is necessary to reduce the dimensionality of
the term space, e.g., to speed up the learning or
to produce more general results. Reducing the dimensionality
means in our case selecting just part of the words and
removing the others. After this, the document vectors are
shorter.
There are several methods for selecting "good" terms.
The following ones calculate an importance value to
each term, after which we can rank the terms according
to the calculated value and choose the best ones.
Document frequency (DF)
Document frequency is the number of documents in which
a term occurs. Based on this function, the ranking of the terms
looks like the following:
7 project
7 current
4 environment
3 nuclear
2 application
2 area
2 assess
2 blood
2 change
2 climate
2 develop
2 ecology
2 effect
2 energy
2 environment
2 field
2 gene
2 imaging
2 impact
2 instrument
2 lung
2 magnetic
2 medical
2 natural
2 patient
2 research
2 treatment
2 two
2 water
1 use
1 temperature
...
We might, e.g., set the threshold to 2 and remove all the words that
occur only once.
Information gain (IG)
---------------------
In order to calculate information gain for a term, we have to
estimate the following probabilities:
P(t) probability of a term t,
1. P(current) = 7/10, or
2. P(current) = 1/118, or
3. P(current) = 7/182 if we consider the proportion of
the occurrences of 'current' of the all term occurrences
P(~t) probability of some other term than t,
1. P(~current) = 3/10, or
2. P(~current) = 117/118, or
3. P(~current) = 175/182
P(c_i) probability of category i,
1. P(medicine) = 1/3 we have 3 categories, equal probability, or
2. P(medicine) = 4/10 if we consider the proportion of medicine
documents in the collection
P(c_i | t) probability of category i if t is in the document,
i.e. which proportion of the documents where t occurs
belong to the category i
P(medicine | current) = 4/7
P(medicine | treatment) = 2/2
P(medicine | network) = 0/1
P(c_i | ~t) probability of category i if t isn't in the document,
i.e., which proportion of the documents where t doesn't occur
belongs to the category i
P(medicine | ~current) = 0/3
P(medicine | ~treatment) = 2/8
P(medicine | ~network) = 4/9
(For symmetry, also these probabilities are explained here, although
they are not needed with IG:
P(t | c_i) in which proportion of documents in category c_i t occurs
P(~t | c_i) in which proportion of documents in category c_i t doesn't occur
)
Information gain can now be calculated for each term:
G(t) = - (P(c_1) log P(c_1) + P(c_2) log P(c_2) + P(c_3) log P(c_3))
+ P(t) (P(c_1|t) log P(c_1|t) + P(c_2|t) log P(c_2|t)
+ P(c_3|t) log P(c_3|t))
+ P(~t) (P(c_1|~t) log P(c_1|~t) + P(c_2|~t) log P(c_2|~t)
+ P(c_3|~t) log P(c_3|~t))
P(medicine) = 4/10 log(4/10) = -0.4
P(energy) = 2/10 log(2/10) = -0.7
P(environment) = 3/10 log(3/10) = -0.5
P(nuclear) = 3/10 = 0.3
P(~nuclear) = 7/10 = 0.7
P(medicine|nuclear) = 2/3 log(2/3) = -0.18
P(energy|nuclear) = 1/3 log(1/3) = -0.48
P(environment|nuclear) = 0/3 log 0 = not defined
P(medicine|~nuclear) = 2/7 log(2/7) = -0.54
P(energy|~nuclear) = 1/7 log(1/7) = -0.85
P(environment|~nuclear) = 4/7 log(4/7) = -0.24
IG(nuclear) = -(0.4 * -0.4 + 0.2 * -0.7 + 0.3 * -0.5)
+ 0.3 (0.66 * -0.16 + 0.33 * -0.48 + 0)
+ 0.7 (0.29 * -0.54 + 0.14 * -0.85 + 0.57 * -0.24)
= 0.3 - 0.3 * 0.26 - 0.7 * 0.41 = 0.3 -0.36 = -0.06
P(current) = 7/10
P(~current) = 3/10
P(medicine|current) = 4/7 log(4/7)= -0.24
P(energy|current) = 1/7 log(1/7) = -0.85
P(environment|current) = 2/7 log(2/7) = -0.54
P(medicine|~current) = 0/3 log(0/3) not defn
P(energy|~current) = 1/3 log(1/3) = -0.48
P(environment|~current) = 2/3 log(2/3) = -0.18
IG(current) = 0.3 - 0.7 * 0.41 - 0.3 * 0.26 = 0.3 - 0.36 = -0.06
What if 'current' occured in 10 documents?
P(treatment) = 2/10
P(~treatment) = 8/10
P(medicine|treatment) = 2/2 log 1 = 0
P(energy|treatment) = 0/2 undef
P(environment|treatment) = 0/2 undef
P(medicine|~treatment) = 2/8 log(0.25) = -0.60
P(energy|~treatment) = 2/8 -0.60
P(environment|~treatment) = 4/8 -0.30
IG(treatment) = 0.3 - 0.2 * 0 - 0.8 * 0.45 = 0.3 - 0.36 = -0.06
What if 'treatment' would occur in all of the 4 medicine documents?
Mutual information (MI)
-----------------------
A: the number of times term t and category c co-occur
B: the number of times t occurs without c
C: the number of times c occurs without t
N: the total number of documents (=10)
MI(t,c) = log (A * N / (A + C) * (A + B))
MI(t) = max {MI(t,c_i)} (for instance...)
MI(nuclear, medicine)
A=2, B=1, C=2
MI(nuclear,medicine) = log (2 * 10 / 4 * 3) = log 1.67 = 0.22
A=1, B=2, C=1
MI(nuclear,energy) = log (1 * 10 / 1 * 2) = log 5 = 0.70
A=0, B=3, C=4
MI(nuclear,environment) = log 0 = undef
MI(nuclear) = 0.70
The Rocchio method
------------------
In the Rocchio categorization method, the categories are
represented by similar vectors as documents. That is, if
in our example, without dimensionality reduction, we have
118 terms, the categories medicine, energy, and environment
are represented by vectors of 118 elements.
Let POS_i and NEG_i be the set of positive and negative
examples, respectively, for the category i.
In our dataset, POS_medicine includes the documents Doc1-Doc4
and NEG_medicine includes the documents Doc5-Doc10, hence
|POS_medicine| = 4 and |NEG_medicine| = 6. The weights of
the category vector c_medicine can now be set using the formula:
w_k_medicine = B/|POS_medicine| * (w_k1 + w_k2 + w_k3 + w_k4)
-A/|NEG_medicine| * (w_k5 + w_k6 + w_k7 + w_k8 + w_k9 + w_k10)
where w_k_medicine is the weight of the kth term/element of
the medicine vector. Similarly, w_k1 is the kth term of Doc1.
In this formula, A and B are control parameters that can be used
to set the relative importance of positive and negative examples.
For instance, if we say that A=2 and B=1, we don't want the negative
examples to have as strong influence as the positive examples.
In the positive sample documents for medicine, the following weights
are given for 'treatment' (term 110):
w_110_1 = 0
w_110_2 = 0.7
w_110_3 = 0.7
w_110_4 = 0
As 'treatment' does not occur in the negative examples, all the
corresponding weights are zero. Hence, the weight for the
110th term of the vector for 'medicine' is:
w_110_medicine = 2/4 * (0 + 0.7 + 0.7 + 0)
- 1/6 * 0 = 0.7
Similarly, we can calculate a weight for the term 'nuclear',
which also occurs in one negative example:
w_70_1 = 0.5
w_70_2 = 0
w_70_3 = 0
w_70_4 = 0.5
w_70_6 = 0.5
w_70_medicine = 2/4 * (0.5 + 0.5) - 1/6 * 0.5 = 0.5 + 0.08 = 0.58
In this way, weights are calculated for each term in the term space,
and the resulting vector is used to represent the corresponding
category. When new documents are classified, the similarity between
the new document vector and each category vector is calculated,
e.g. using the cosine measure (vector dot product). The new document
is classified under the category whose vector is the most similar
.