Both the Boguraev-Kennedy method and the MEAD method use cross-sentence dependencies to decide which fragments of the text are important. Sketch a hybrid method that combines these methods. For instance, consider how the Boguraev-Kennedy method could be modified to work in the multiple-document summarization case, and/or how the ideas of the MEAD method could be used in the single-document summarization of the Boguraev-Kennedy method. You can replace features with your own favorites and also make other modifications/simplifications to the methods, if you like.
Before outlining a new method, let us have a look at MEAD and Boguraev-Kennedy methods:
The MEAD is a multi-document summarizer: it employs a centroid vector - some sort of a collection of important terms in the documents of the cluster that is summarized. The importance is based on statistical observations. The sentences in the documents are evaluated by three qualities:
Centroid value: a sort of "inner product" of the centroid and the sentence. The more the sentence contains the same words as the cluster centroid, the higher the score. The words that do not occur in the cluster centroid have no impact.
Positional value: The first sentence of the document is of the greatest value. The values of the subsequent sentences are discounted.
First-sentence overlap: The first sentence is the most important. The subsequent sentences are compared to the first one: the more the sentences have common terms, the higher the score.
These values are compiled into a score via linear combination:
score = a*C + b*P + c*F,
where C, P, and F are the three values and a, b, and c are the corresponding multipliers. Each sentence is assigned such a score. The score is decreased by a "redundancy penalty": the more a sentence overlaps with a high-score sentence, the more its value is reduced. The highest scoring sentence contains the information, and repeating it would be redundant. Finally, the sentences are ranked based on the scores.
BK summarizes one document at a time. It produces a capsule overview - a normalized representation of the document - based on topic stamps: phrasal units that portray the content of the document. The representation is not necessarily readable text, but rather a collection of excerpts of the highlights of the text.
In order to find the topic stamps, BK relies on (lexico-syntactical) similarity of "technical terms" that represent topicality and discourse properties that reflect the importance of the phrasal units. For identifying and extracting the technical terms, we've got algorithms that show moderate success. But we also have some challenges:
Now, a hybrid of these two could have characteristics and features like:
In the exercises 2 and 3 we try to figure out what might happen in the lexical analysis and name recognition phases of an information extraction process. Study the following document fragments.
Police sources have reported that unidentified individuals planted a bomb in front of a Mormon Church in Talcahuano District. The bomb, which exploded and caused property damage worth 50,000 pesos, was placed at a chapel of the Church of Jesus Christ of Latter-Day Saints located at No 3856 Gomez Carreno Street. Prosecutor Juan Carbone Herrera requested the 25 years imprisonment for General Rolando Cabezas Alarcon of the Republican Guard for ordering the shooting of 124 of the San Pedro prison inmates. Last night in San Clemente District, 9 km north of Pisco, a group of terrorists dynamited machinery belonging to Albolones Peruanos, Inc.
Give examples of information that is available in the lexical analysis of these sentences. You can assume that some language analyser or special dictionaries are available. You don't have to analyse all the text, just give some examples of the output of the lexical analysis phase using the sample text fragments.
You can find examples and descriptions on what language analysers can do on the web pages of the following language analysers:
Connexor Machinese Phrase Tagger
Lingsoft ENGTWOL and FINTWOL
ENGTWOL produces the following (the non-HTML tags do not show)
"police" N NOM SG/PL @NN "source" N NOM PL @SUBJ "have" V PRES -SG3 VFIN @+FAUXV "report" PCP2 @-FMAINV "that" CS @CS "unidentified" A ABS @AN "individual" N NOM PL @SUBJ "plant" V PAST VFIN @+FMAINV "a" DET CENTRAL ART SG @DN "bomb" N NOM SG @OBJ "in=front=of" PREP @NOM @ADVL "a" DET CENTRAL ART SG @DN "mormon" N NOM SG @NN "church" N NOM SG @P "in" PREP @NOM @ADVL "talcahuano" N NOM SG @NN "district" N NOM SG @P
With the lexical information at hand, one can do various things:
Give examples of names and other special forms in the sample fragments. Try to formulate informal rules for finding the names and special forms, using the knowledge you found above in (2). You can also try to formulate the rules using regular expressions.
Last night in San Clemente District, 9 km north of Pisco, a group of terrorists dynamited machinery belonging to Albolones Peruanos, Inc.cities.txt:san_pedro buenos_aires argentina southern_south_america latin_america cities.txt:san_pedro jujuy argentina southern_south_america latin_america cities.txt:san_pedro santa_cruz bolivia central_south_america latin_america cities.txt:san_pedro belize belize central_america latin_america cities.txt:san_pedro metropolitana chile southern_south_america latin_america cities.txt:san_pedro antioquia colombia northern_south_america latin_america cities.txt:san_pedro sucre colombia northern_south_america latin_america cities.txt:san_pedro valle_del_cauca colombia northern_south_america latin_america cities.txt:san_pedro alajuela costa_rica central_america latin_america cities.txt:san_pedro heredia costa_rica central_america latin_america cities.txt:san_pedro san_jose costa_rica central_america latin_america cities.txt:san_pedro coahuila mexico central_america latin_america cities.txt:san_pedro southern_tagalog philippines southeastern_asia asia
cities.txt:pisco ica peru western_south_america latin_americaThe names of companies are generally a bit more tricky than this. If one is to collapse all the different instances of Albolones Peruanos into one, it would probably mean tracking for instances with and without "inc.", possibly with or without "Peruanos". There could be various names of divisions etc. that would need to be stripped.