Processing of Large Document Collections: Exercise 4.

Solution for Exercise 4.1

Both the Boguraev-Kennedy method and the MEAD method use cross-sentence dependencies to decide which fragments of the text are important. Sketch a hybrid method that combines these methods. For instance, consider how the Boguraev-Kennedy method could be modified to work in the multiple-document summarization case, and/or how the ideas of the MEAD method could be used in the single-document summarization of the Boguraev-Kennedy method. You can replace features with your own favorites and also make other modifications/simplifications to the methods, if you like.

Before outlining a new method, let us have a look at MEAD and Boguraev-Kennedy methods:

MEAD

The MEAD is a multi-document summarizer: it employs a centroid vector - some sort of a collection of important terms in the documents of the cluster that is summarized. The importance is based on statistical observations. The sentences in the documents are evaluated by three qualities:

Centroid value: a sort of "inner product" of the centroid and the sentence. The more the sentence contains the same words as the cluster centroid, the higher the score. The words that do not occur in the cluster centroid have no impact.
Positional value: The first sentence of the document is of the greatest value. The values of the subsequent sentences are discounted.
First-sentence overlap: The first sentence is the most important. The subsequent sentences are compared to the first one: the more the sentences have common terms, the higher the score.

These values are compiled into a score via linear combination:

score = a*C + b*P + c*F,

where C, P, and F are the three values and a, b, and c are the corresponding multipliers. Each sentence is assigned such a score. The score is decreased by a "redundancy penalty": the more a sentence overlaps with a high-score sentence, the more its value is reduced. The highest scoring sentence contains the information, and repeating it would be redundant. Finally, the sentences are ranked based on the scores.

Boguraev-Kennedy (BK)

BK summarizes one document at a time. It produces a capsule overview - a normalized representation of the document - based on topic stamps: phrasal units that portray the content of the document. The representation is not necessarily readable text, but rather a collection of excerpts of the highlights of the text.

In order to find the topic stamps, BK relies on (lexico-syntactical) similarity of "technical terms" that represent topicality and discourse properties that reflect the importance of the phrasal units. For identifying and extracting the technical terms, we've got algorithms that show moderate success. But we also have some challenges:

Undergeneration: mere technical terms are not sufficient descriptors of the content. BK suggests an exhaustive approach: in the phrasal analysis, it considers every expression where a participant of the event occurs. Such expressions can be pronouns, reduced descriptions, or complex nominals, for instance.
Overgeneration: Extending the phrasal analysis beyond technical terms results in a huge number of candidates. Hence, BK attempts to link the phrases referring to the discourse object via anaphora (hunh?) resolution. In other words, sometimes (quite often) the interpretation of a sentence depends on the preceding sentences and words. BK wants to collapse all the different ways to refer to the same thing into a single reference and thus reduce the number of potential topic stamps.
Differentiation: the documents about the same topic are likely to produce similar topic stamps EVEN THOUGH they might be about different sub-topics, different aspects, or whatnot. To this end, BK ranks the terms by their saliency. Two documents will now produce same terms but in different order.

Now, a hybrid of these two could have characteristics and features like:

The discourse segmentation and phrasal analysis of BK is carried out for a cluster of documents. Via anaphora resolution, redundancy increases the importance of a phrasal unit. Positional weighting could be applied (taking the advantage of the document structure).
The centroid could be composed of topic stamps. There could be centroids per paragraph/section or something.

In the exercises 2 and 3 we try to figure out what might happen in the lexical analysis and name recognition phases of an information extraction process. Study the following document fragments.

Police sources have reported that unidentified individuals planted a
bomb in front of a Mormon Church in Talcahuano District. The bomb,
which exploded and caused property damage worth 50,000 pesos, was
placed at a chapel of the Church of Jesus Christ of Latter-Day Saints
located at No 3856 Gomez Carreno Street.

Prosecutor Juan Carbone Herrera requested the 25 years imprisonment
for General Rolando Cabezas Alarcon of the Republican Guard for
ordering the shooting of 124 of the San Pedro prison inmates.

Last night in San Clemente District, 9 km north of Pisco, a
group of terrorists dynamited machinery belonging to Albolones
Peruanos, Inc.

Solution for Exercise 4.2

Give examples of information that is available in the lexical analysis of these sentences. You can assume that some language analyser or special dictionaries are available. You don't have to analyse all the text, just give some examples of the output of the lexical analysis phase using the sample text fragments.

You can find examples and descriptions on what language analysers can do on the web pages of the following language analysers:

Connexor Machinese Phrase Tagger
- Description and Demo
Lingsoft ENGTWOL and FINTWOL
- Demo
- List of morphological tags

ENGTWOL produces the following (the non-HTML tags do not show)

	"police"       N NOM SG/PL  @NN
	"source"       N NOM PL  @SUBJ
	"have"         V PRES -SG3 VFIN  @+FAUXV
	"report"       PCP2  @-FMAINV
	"that"         CS @CS
	"unidentified" A ABS  @AN
	"individual"   N NOM PL  @SUBJ
	"plant"        V PAST VFIN @+FMAINV
	"a"            DET CENTRAL ART SG @DN
	"bomb"         N NOM SG  @OBJ
	"in=front=of"  PREP  @NOM @ADVL
	"a"            DET CENTRAL ART SG @DN
	"mormon"       N NOM SG  @NN
	"church"       N NOM SG  @P
	"in"           PREP  @NOM @ADVL
	"talcahuano"  N NOM SG @NN
	"district"     N NOM SG  @P

With the lexical information at hand, one can do various things:

We have the part-of-speech for each word (thus 'report' could be disambiguated... whether a participle is a POS, is another question. What is a part-of-speech, is yet another.). An external ontology, such as WordNet, can be used to find relations between words (it usually requires part-of-speech information for the words).
Locations relate to other locations via geographical ontology. Disambiguation maybe difficult (Kingston?).
Named entities start often with upper-case letter, but sometimes the type of the entity can be resolved with the parsing information.

Solution for Exercise 4.3

Give examples of names and other special forms in the sample fragments. Try to formulate informal rules for finding the names and special forms, using the knowledge you found above in (2). You can also try to formulate the rules using regular expressions.

Last night in San Clemente District, 9 km north of Pisco, a group of terrorists dynamited machinery belonging to Albolones Peruanos, Inc.

"Police sources have reported that unidentified individuals planted a bomb in front of a Mormon Church in Talcahuano District."

"Police" could be seen as a name for institution we want to extract, and is somewhat easy to recognize (unless referring to the band). "Talcahuano District" contains a word (district) that refers to a geographical area, and it is preceded by the preposition 'in'. Hence, it makes a location name. The phrase "a Mormon Church" will probably end up being tagged as a name, but it has the indefinite determiner "a" in front of it, which makes it unclear which church (of all the Mormon Churches) is in question.
"The bomb, which exploded and caused property damage worth 50,000 pesos, was placed at a chapel of the Church of Jesus Christ of Latter-Day Saints located at No 3856 Gomez Carreno Street."

Numbers are seldom useful in Information Retrieval, unless we can associate whtm with a noun that indicates what it is a quantity of. Hence, "50,000 pesos" is an amount of money. "the Church of Jesus Christ of Latter-Day Saints" is a chain of genetive structure. It could include "the chapel of" but it's not really part of the name. Finally, there's the address.

"Prosecutor Juan Carbone Herrera requested the 25 years imprisonment for General Rolando Cabezas Alarcon of the Republican Guard for ordering the shooting of 124 of the San Pedro prison inmates."

The title 'prosecutor' followed by a sequence of strings with upper-case first letter is typically a title-name combination. The same goes for the 'general'. The measure '124' refers to requires that we find the first plural noun that follows 'of San Pedro...'. With 'San Pedro', it would easy to recognize it as a location, but disambiguation is a bit more tricky ("tricky" is a glib, really):

cities.txt:san_pedro    buenos_aires    argentina       southern_south_america  latin_america
cities.txt:san_pedro    jujuy   argentina       southern_south_america  latin_america
cities.txt:san_pedro    santa_cruz      bolivia central_south_america   latin_america
cities.txt:san_pedro    belize  belize  central_america latin_america
cities.txt:san_pedro    metropolitana   chile   southern_south_america  latin_america
cities.txt:san_pedro    antioquia       colombia        northern_south_america  latin_america
cities.txt:san_pedro    sucre   colombia        northern_south_america  latin_america
cities.txt:san_pedro    valle_del_cauca colombia        northern_south_america  latin_america
cities.txt:san_pedro    alajuela        costa_rica      central_america latin_america
cities.txt:san_pedro    heredia costa_rica      central_america latin_america
cities.txt:san_pedro    san_jose        costa_rica      central_america latin_america
cities.txt:san_pedro    coahuila        mexico  central_america latin_america
cities.txt:san_pedro    southern_tagalog        philippines     southeastern_asia       asia

"Last night in San Clemente District, 9 km north of Pisco, a group of terrorists dynamited machinery belonging to Albolones Peruanos, Inc."

Temporal expressions require the utterance of date to be properly understood. Recognition and formalization would require an automaton, for instance. "San Clemente District" is preceded by "in" and contains the word "District", both of which testify for a location term. "9 km" is a typical numeral-measure pair. With "Pisco" our gazetteer is somewhat less verbose:
```
cities.txt:pisco        ica     peru    western_south_america latin_america
```
The names of companies are generally a bit more tricky than this. If one is to collapse all the different instances of Albolones Peruanos into one, it would probably mean tracking for instances with and without "inc.", possibly with or without "Peruanos". There could be various names of divisions etc. that would need to be stripped.

Last modified: Tue May 2 10:50:22 EEST 2006