Processing of Large Document Collections: Exercise 2.

What kind of training documents are good?

In machine learning, we assume that the system is trained with the same kind of material that the system is expected to classify later on. The 'same kind' means roughly that the terms and class labels have the same (unknown) distribution. So, in a sense, a classifier attempts to learn the associations between the terms and the class labels by building crude estimates on the distribution sets of terms (documents) and the corresponding label(s) (category).

Given news documents and categories, if one wants the system to recognize news about bridge tournaments (the card game) as sports news, the training material should contain examples of this. Whatever the training data is, it should be similar to the actual data. If the training data is "enhanced" or manipulated in some ways, there is a risk that the performance actually gets worse because the training data would have a different bias from the actual data.

Trivially speaking, any document "drawn" from the (unknown) distribution is good, because that is the sort of documents that the system will see. Mislabeled documents, empty documents, noise, outliers and anomalies are obviously bad. In the case where the document is labeled with two or more class labels, we can expect difficulties.

However, sometimes the training data is not available, sometimes there's too little of it, sometimes the document is only partially positive (and thus it would not be less of an error to misclassify it than a real positive).

Solution for Exercise 2.1

Compare the following classification problems.

Spam detection
Language identification (is this document written in Finnish?)
Author detection (is this sonet written by Shakespeare?)

Consider, for instance, the following aspects:

Problem setting: What are the documents, what are the categories? Multi-label, single-label, binary?
Where/how would you find training documents? What kind of training documents are good?
What kind of terms would probably be good for each problem? Do you think that the terms found in the content are enough? Which (automatic/manual) methods could be used for term selection?
Evaluation: are precision and recall equally important?
Can you identify any (other) specific problems/issues?

Spam detection

This can be seen as a form of binary classification problem: spam vs. not-spam. One needs to recognize unwanted documents (emails, posts on a forum, ...) without losing any of the relevant ones. Basically, one builds a classifier which is run on a stream of documents.

The training material consists of user-labeled documents. Typically, a learning process requires both positive and negative examples. In other words, it is difficult to learn what spam looks like without ever seeing anything other than spam.

The spam documents are likely to portray some aberration (poikkeavuus) in terms of the vocabulary or the distribution of term frequencies. References to human genitalia, porno or business/mortgage proposals could be something less frequent amongst the non-spam documents. Naturally, non-informative words (the, a, be, than, ...) would be the same in both classes. There is a variety of methods for choosing the "best" terms for the classification problem [see, e.g., 1], the most of which attribute a larger weight for the word if the term occurs mostly in one of the two classes (and not so much in the other one). Emails have metadata in addition to the context, and the classifier may also look into the sender, receiver, cc (carbon copy) and title; this information could be inserted to the classifier.

There is an obvious trade-off between precision and recall. One can achieve a perfect recall by considering all documents spam. This is achieved at the cost of poor precision. If spam detection leads to a high number false-alarms, i.e., considers something spam when it is not, it performs poorly. Loss of important emails is a greater bad than missing some spam documents. Thus, high precision is more important than high recall. Yet, a spam filter with a very low recall is obviously useless.

The spammers tend to fill their emails with distracting content words. There can be total gibberish or rare medical terms, for instance. This simple method is used to mislead the classifiers: in the presence of rare terms, the classifier is less likely to label the document as spam. Another problem is the volume and the variety of spam.

Language detection

The purpose is to recognize the language in which the document is written. The categorization would thus be multi-classed. Alternatively, the classes could be Finnish and not-Finnish. Any written material qualifies as documents. The classes of this type of recognition are naturally languages: Finnish, English, Swedish, ... Documents could be multi-labeled, i.e., one is to recognize all the languages that appear in the document.

Obviously, documents which contain several languages are difficult to classify: a list of Finnish Eurovision candidates is in Finnish but all the names of the songs are in English.

Frequent terms (character strings) that occur only in one language are obviously good. Words like 'into', 'me', ... exist in both Finnish and English. Maybe taking the n most frequent terms of the different languages (as represented by the documents) as the set of features might be a good start.

Neither precision nor recall needs to be stressed.

The sample of a given language as represented by the documents might be somewhat small.

Author detection

Like in language detection, this can be either multiclassed (Aristotle, Hume, Leibniz, Plato,...) or binary (Gartland, not-Gartland). Typically novels and poems have one author, but there have been joint efforts. Research papers and articles often have multiple authors.

Material could be excerpts or full texts by the novelists, philosophers, historians, poets. They could be black-mailing notes, threats, memos, etc (usually the same person does not develop a corpus of black-mails to learn from.)

Assuming the data is in one language only, the common words would be pretty much the same across the texts. Older authors are, of course, likelier to use old, possibly obsolete words. Some authors might be more eloquent/flamboyant than others, and the vocabulary would reflect this. In addition to vocabulary, authors differ in terms of syntactical structure or stylistic instruments (e.g., repetition). Typically, Plato writes dialogues (not a widely used among the philosophers). Parsers might experience some problems with poems, plays and some short stories (e.g., Beckett).

The relationship between precision and recall needs not adjusting - unless for some personal preference.

Authors change their style over their careers. They might experiment with things. Recognition might be easier with authors that work distinctively in some particular genre. As English authors write in English the differences between the vocabularies might show more variance between the books by the same author than between two authors.

Solution for exercise 2.2

Assume that the effectiveness of a classifier is evaluated using a test set of 10 documents. In the following table, you can see the judgment as recorded in the test set (category/testset) and the decision of the classifier (category/classifier) for each document. What is the recall and precision of the classifier with respect to each of the categories (energy, environment, medicine) separately? Calculate the global recall and precision using both microaveraging and macroaveraging. Give also the combined measure F₁.

Doc	category/testset	category/classifier
1	energy	energy
2	environment	medicine
3	energy	energy
4	medicine	environment
5	medicine	medicine
6	medicine	medicine
7	energy	energy
8	energy	energy
9	environment	environment
10	environment	energy

The somewhat a standard way of evaluating the effectiveness is the use of precision and recall [2]

	relevant	non-relevant
retrieved	true positive	false positive
not-retrieved	false negative	true negative

And now adopting this table to classification task we get:

                       |TP_i|         (the number of correct classifications to class i)
     precision(i) = ---------------                       
                   |TP_i| + |FP_i|    (the number of classifications to class i)


                         |TP_i|         (the number correct classifications to class i)
     recall(i)    = --------------                       
                    |TP_i| + |FN_i|     (the size of class i)

So, to this end we build a class-class matrix listing the distribution of classifications per class, as is shown below. The columns stand for labelings as they appear in the testing set; these are the real labels of the documents. The rows stand for the results as given by the system. Each cell stores the number of classifications: energy-energy cell says how many real members of energy were found such. The total of the same row says that the system found 5 documents of class energy.

		Corpus (labels)
		energy	environment	medicine	total
System	energy	4	1	0	5
	environment	0	1	1	2
	medicine	0	1	2	3
	total	4	3	3	10

Now the calculation of effectiveness is straight-forward:

ENERGY:

                    4 (energy-energy)
      precision = --------------------- = 0.8
                    5 (total-energy) 

                    4 (energy-energy)
      recall =    --------------------- = 1.0
                    4 (energy-total)

ENVIRONMENT:

                    1 (env-env)
      precision = --------------------- = 0.5
                    2 (total-env) 

                    1 (env-env)
      recall =    --------------------- = 0.33
                    3 (env-total)


MEDICINE:

                    2 (med-med)
      precision = --------------------- = 0.67
                    3 (total-med) 

                    2 (med-med)
      recall =    --------------------- = 0.67
                    3 (med-total)

Micro-average calculates the average over all of the data, i.e., it considers all the decisions as a single group.

               4 + 1 + 2            7
p_micro = ---------------------  = --- = 0.7
          (4+1) + (1+1) + (1+2)     10

               4 + 1 + 2            7
r_micro = ---------------------  = --- = 0.7
          (4) + (1+1+1) + (1+2)     10

              2 * p_micro * r_micro     0.98
f_1_micro =  ----------------------- = ------ = 0.714
                p_micro + r_micro       1.4

Macro-average calculates the averages the per class effectivenesses.

           4     1     2       24 + 15 + 20
          --- + --- + ---     ---------------
           5     2     3             30           59
p_macro = ---------------- = ----------------- = ---- = 0.656
                 3                   3            90

           3     1     2
          --- + --- + ---
           3     3     3          6
r_macro = -------------------  = --- = 0.667
                   3              9

              2 * p_macro * r_macro     0.875
f_1_macro =  ----------------------- = -------- = 0.661
                p_macro + r_macro       1.323

Solution for exercise 2.3

Calculate information gain for the terms 'current', 'treatment', and 'network' in the collection of 10 documents. In the file, each document is marked-up with the category (medicine, energy, or environment). After each term, the first number means the number of the occurrences of this term in this document. The second number means the number of occurrences of this term in the entire collection. Use base 2 for the logarithms.

"Entropy measures the amount of information of a random variable. It is normally measured in bits (hence the log to the base 2), but using any other base yields only a linear scaling of the result." [3]

Entropy [4] can also be characterized as the uncertainty of a given random variable. If the values of random variable are equally probable, the entropy is high. For example, a fair dice has a probability of 1/6 for each outcome, which makes the random variable quite uncertain, and thus we need more bits to represent the results. On the contrary, if the probabilities are biased, for example, P(x=6) = 4/6 and P(x=1) = P(x=2) = P(x=3) = P(x=4) = P(x=5) = 1/15, the outcome is less uncertain, there is less entropy and thus we need fewer bits to code the information contained by the random variable. In the first case the entropy H(x) would result in:

[Note: 'H' is Greek 'eta'.]

H(X) = - 0.167 * log(0.167) - ... - 0.167 * log(0.167)
     = - 6 * 0.167 * log(0.167) 
     = 2.587

So, it takes 2.587 bits on the average to code the outcome of a dice: 000 = '1', 001 = '2', 010 = '3', 011 = '4', 100 = '5', 101='6'. We have still 110 and 111 left, so we don't need all the three bits -- on the average.

The latter case where P(x=6) = 4/6 and 1/15 for the rest would result in an entropy:

H(X) = - 0.067 * log(0.067) - ... - 0.067 * log(0.067) - 0.667 * log(0.667)
     = - 5 * 0.067 * log(0.067) - 0.667 * log(0.667)
     = 1.306 + 0.389  
     = 1.695

Now, we need only 1.695 bits on the average to represent the information contained in the random variable. The outcome is less uncertain, i.e., it is quite likely that the outcome is a '6' and thus a loaded dice is less surprising with less entropy.

Note: if the dice gives always a '6', i.e., P(x=6) = 1 and P(x=1) = P(x=2) = P(x=3) = P(x=4) = P(x=5) = 0, there is no uncertainty about the outcome and therefore no entropy:

H(X) = - 5 * 0 - 1 * log(1) 
     = 0 - 1 * 0 
     = 0

If we consider the class of the document a random variable with three possible values: energy, medicine, environment, we can calculate the entropy (or base entropy) as follows:

** Reading 10docs.txt for term and class frequencies.
** Frequencies calculated: total of 10 documents
        in class energy 2 documents
        in class medicine 4 documents
        in class environment 4 documents

** Resolving the base entropy with logbase of 2

        H(X) =  - SUM p(class) log(p(class))
             = - ( 0.2 * log(0.2) + 0.4 * log(0.4) + 0.4 * log(0.4) )
             = 1.52192809488736

Information (not related to semantics) is a measure of change in the uncertainty. It measures how much the entropy is decreased given the distribution of terms. As a reminder:

P(c|t)   probability of class c in case we observe term t,
         i.e. which proportion of the documents where t occurs
         are of class c.
P(c|!t)  probability of class c in case we do not observe term t,
         i.e. which proportion of the documents where t does not 
	 occurs are of class c.

            P(c,t)
P(c|t) = --------------
             P(t)

And now, the information gain values:


** Resolving the information gain of 'treatment' (occurs in 2 documents)
        P(treatment) = 0.2
        P(!treatment) = 0.8
        P(energy|treatment)     =  / 2  = 0
        P(energy|!treatment)    = 2 / 8 = 0.25
        P(environment|treatment)        =  / 2  = 0
        P(environment|!treatment)       = 4 / 8 = 0.5
        P(medicine|treatment)   = 2 / 2 = 1
        P(medicine|!treatment)  = 2 / 8 = 0.25
IG(treatment) = 1.52192809488736 + ( 0.2 * 0 ) + ( (1-0.2) * -1.5 ) =  0.321928094887362

** Resolving the information gain of 'current' (occurs in 9 documents)
        P(current) = 0.9
        P(!current) = 0.1
        P(energy|current)       = 1 / 9 = 0.111111111111111
        P(energy|!current)      = 1 / 1 = 1
        P(environment|current)  = 4 / 9 = 0.444444444444444
        P(environment|!current) = 0 / 1 = 0
        P(medicine|current)     = 4 / 9 = 0.444444444444444
        P(medicine|!current)    = 0 / 1 = 0
IG(current) = 1.52192809488736 + ( 0.9 * -1.39214722366453 ) + ( (1-0.9) * 0 ) =  0.268995593589281

** Resolving the information gain of 'network' (occurs in 1 documents)
        P(network) = 0.1
        P(!network) = 0.9
        P(energy|network)       = 1 / 1 = 1
        P(energy|!network)      = 1 / 9 = 0.111111111111111
        P(environment|network)  =  / 1  = 0
        P(environment|!network) = 4 / 9 = 0.444444444444444
        P(medicine|network)     =  / 1  = 0
        P(medicine|!network)    = 4 / 9 = 0.444444444444444
IG(network) = 1.52192809488736 + ( 0.1 * 0 ) + ( (1-0.1) * -1.39214722366453 ) =  0.268995593589281

Solution for exercise 2.4

Choose one of the texts: The 21st Century Belongs to... or L.A. Times Reorganizes...

Give a summary of 5-10 sentences of the text.
Compare your summary to the one generated by the FociSum summarization system http://www.cs.columbia.edu/~hjing/sumDemo/ (follow the link FociSum, then Examples).

You can also try the same text with the MEAD summarizer.

FociSum on The "21st Century Belongs to..."

9 4 Next Sunday, China's President, Jiang Zemin, arrives in the United States for an official visit. 15 3 "We are obsessively obsessed by China because of its huge numbers and rapid growth rates," said Jagdish Bhagwati, professor of economics at Columbia University. 7 2 BODY: LISTENING to President Clinton's speeches in Latin America last week, audiences might have concluded that Washington's economic hopes and preoccupations lay to the south. 32 2 By 2010, Clinton advisers say, United States trade with Latin America will surpass trade with Europe and Japan combined. 34 2 "If you look at U.S. policy toward Latin America, to a much, much larger extent, we have pressed for democracy," she said, adding that it might be time to revise old opinions in light of the currency turmoil in Southeast Asia. 48 2 When they do allow a U.S. brand name to be sold in China, they immediately ask for an export plan. 51 2 It is a hoax to say that U.S. jobs are dependent on China.

Mead summarizer

[1, 1] LISTENING to President Clinton's speeches in Latin America last week audiences might have concluded that Washington's economic hopes and preoccupations lay to the south. [1, 4] And chances are that visions of a vast Chinese market with a bottomless appetite for consumer goods -- or of a repressive and militaristic China -- will relegate every other place to the sidelines again. [1, 5] As the summit meeting approaches television sets and newspaper columns seminar rooms and breakfast talkfests are already resounding with opinions about a nation that seems for good or evil to loom over the next century like no other country in the world. [1, 7] Add to that the so-far seamless return to China of British Hong Kong an economic dynamo in its own right and it is hard to avoid the conclusion that China is a land of destiny writ large. [1, 8] But some economists -- as well as politicians and foreign policy experts -- think this obsession with China is not healthy or smart not for foreign policy or even for trade and investment especially when other areas of the world are also enjoying robust economic growth. [1, 15] Among the regions economists suggest as alternatives worth looking at are not just Latin America -- where the average growth rate has settled at just over 4 percent a climb out of negative growth two years ago -- but also South Asia dominated by India and Pakistan. [1, 19] The nations of the European Union had an average growth rate of only 1.6 percent and the United States' gross domestic product increased by only 2.4 percent last year but few find either fact reason enough to avoid those nations. [1, 20] Professor Bhagwati who toured Latin America recently said the region is reaping the benefits of democratic government a free climate for economic activity and a new respect for economists. [1, 27] Susan Kaufman Purcell vice president of the Council of the Americas in New York said that Latin American countries have always faced harsher scrutiny than have China or other Asian nations -- and that this puts them at a disadvantage. [1, 28] If you look at U.S. policy toward Latin America to a much much larger extent we have pressed for democracy she said adding that it might be time to revise old opinions in light of the currency turmoil in Southeast Asia. [1, 29] There was this vision that Asia was a very stable place and that even though many of its governments were authoritarian it was perceived that somehow they had better business values than the Latin American governments.

The automatic summarization tends to: Use the sentences of the source text as they appear in the source. What would it require to 'understand' the text? Prefer direct quotations from the source. Prefer sentences with proper names.

Viitteet

Yiming Yang and Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12. Morgan Kaufmann, San Francisco, CA, USA, 1997, 412-420.
Allen Kent, Madeline M. Berry, Fred U. Luerhs, J. R. Perry, J. W. Perry, Machine literature searching VIII: Operational criteria for designing information retrieval systems. American Documentation, Vol. 6(2), 1955, 93-101.
Christopher F. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, Massachusetts, USA, 1997.
Claude E. Shannon, The Mathematical Theory of Communication. University of Illinois Press, Chicago, 1998 (1949).