In this exercise, we study the algorithm of AutoSlog-TS (Riloff, E. Automatically Generating Extraction Patterns from Untagged Text"). Assume, our text collection contains the following documents and explain the process of AutoSlog-TS using these and give the ranking for the extraction patterns that are generated.
text 1; relevant s: A group of terrorists v: attacked do: a post pp: in Nuevo Progreso. text 2; relevant s: The National offices v: were attacked time: today. s: Unidentified individuals v: detonated do: a bomb. s: The bomb v: destroyed do: a car. text 3; not relevant s: The Armed Forces units v: killed do: one rebel. s: They v: destroyed do: an underground hideout. text 4; relevant s: Unidentified individuals v: attacked do: a high tension tower. s: They v: destroyed do: it. text 5; not relevant s: The coca growers v: protest do: the destruction of their fields. s: The strike v: is supported pp: by the Shining Path.
AutoSlog is a system for dictionary construction. It forms extraction patterns automatically based on a set of heuristic rules. As input, the system is given text where target phrases are annotated with tags. The system relies on a parser for the identification of the Part-of-speech, such as subject, verb, direct object, and prepositional phrases of each sentence.
There are three kinds of rules based on the syntactic class of the target noun phrase. A rule is instantiated based on the sentence in which the target phrase appears. For instance, "Juha Makkonen, the teaching assistant, was kidnapped for ridiculous ransom" could result in a pattern "[SUBJ] was kidnapped", if the proper name was tagged as a target (as the [SUBJ] [PASSIVE VERB] rule matches).
AutoSlog-TS is an extension of AutoSlog. It generates an extraction pattern for every tagged phrase in the training corpus, and then evaluates them. There are 15 heuristic rules with which the pattern extraction is carried out. Each pattern is ranked with a relevance rate, which is a conditional probability:
P( relevant text AND text has pattern i ) P( relevant text | text has pattern i) = ------------------------------------------ P( text has pattern i )
Then, this relevance rate is weighted with frequency (if the rate was greater than 0.5 )
Score( i ) = rel(i) * log_2( freq i )
The rules are:
RULE EXAMPLE ---------------------------------------------------------------- 1. [SUBJ] passive-verb [victim] was murdered 2. [SUBJ] active-verb [perp] bombeb 3. [SUBJ] verb infin. [perp] attempted to kill 4. [SUBJ] aux noun [victim] was victim 5. passive-verb [DOBJ] killed [victim] (** a hack! **) 6. active-verb [DOBJ] bombed [target] 7. infin. [DOBJ] to kill [victim] 8. verb infin. [DOBJ] tried to kill [victim] 9. gerund [DOBJ] killing [victim] 10. noun aux [DOBJ] fatality was [victim] 11. noun prep [NP] bomb against [target] 12. active-verb prep [NP] killed with [instrument] 13. passive-verb prep [NP] was aimed at [target] 14. [SUBJ] active-verb DOBJ [perp] pillaged the village 15. infin. prep [NP] to triumph in [loc]
Now, we apply the rules to the texts:
------------- text 1; relevant "A group of terrorists attacked a post in Nuevo Progreso." * [SUBJ] attacked (rule 2) * attacked [DOBJ] (rule 6) * a post in [NP] (rule 11) * [SUBJ] attacked a post (rule 14) ------------- text 2; relevant "The National offices were attacked today." * [SUBJ] were attacked (rule 1) "Unidentified individuals detonated a bomb." * [SUBJ] detonated (rule 2) * [SUBJ] detonated a bomb (rule 14) * detonated [DOBJ] (rule 6) "The bomb destroyed a car." * [SUBJ] destroyed (rule 2) * [SUBJ] destroyed a car (rule 14) * destroyed [DOBJ] (rule 6) ------------- text 3; not relevant "The Armed Forces units killed one rebel." * [SUBJ] killed (rule 2) * [SUBJ] killed one rebel (rule 14) * killed [DOBJ] (rule 6) "They destroyed an underground hideout." * [SUBJ] destroyed (rule 2) * [SUBJ] destroyed an und...(rule 14) * destroyed [DOBJ] (rule 6) ------------- text 4; relevant "Unidentified individuals attacked a high tension tower." * [SUBJ] attacked (rule 2) * attacked [DOBJ] (rule 6) * [SUBJ] attacked a high ...(rule 14) "They destroyed it." * [SUBJ] destroyed (rule 2) * [SUBJ] destroyed it (rule 14) * destroyed [DOBJ] (rule 6) ------------- text 5; not relevant "The coca growers protest the destruction of their fields." * [SUBJ] protest (rule 2) * [SUBJ] protest the dest.. (rule 14) * protest [DOBJ] (rule 6) "The strike is supported by the Shining Path." * [SUBJ] is supported (rule 1) * is supported by [NP] (rule 13)
And now we calculate the relevance rates:
rel( "[SUBJ] attacked" ) = 2/2 = 1 rel( "[SUBJ] attacked a high" ) = 1/1 = 1 rel( "[SUBJ] attacked a post" ) = 1/1 = 1 rel( "[SUBJ] detonated" ) = 1/1 = 1 rel( "[SUBJ] detonated a bomb" ) = 1/1 = 1 rel( "[SUBJ] destroyed" ) = 2/3 = 0.667 rel( "[SUBJ] destroyed it" ) = 1/1 = 0 rel( "[SUBJ] destroyed an und.." ) = 0/1 = 0 rel( "[SUBJ] destroyed a car" ) = 1/1 = 0 rel( "[SUBJ] is supported" ) = 0/1 = 0 rel( "[SUBJ] killed" ) = 0/1 = 0 rel( "[SUBJ] killed one rebel" ) = 0/1 = 0 rel( "[SUBJ] protest" ) = 0/1 = 0 rel( "[SUBJ] protest the destr..." ) = 0/1 = 0 rel( "[SUBJ] were attacked" ) = 1/1 = 1 rel( "attacked [DOBJ]" ) = 2/2 = 1 rel( "destroyed [DOBJ]" ) = 2/3 = 0.667 rel( "detonated [DOBJ]" ) = 1/1 = 1 rel( "killed [DOBJ]" ) = 0/1 = 0 rel( "protest [DOBJ]" ) = 0/1 = 0 rel( "a post in [NP]" ) = 1/1 = 1 rel( "is supported by [NP]" ) = 0/1 = 0
We need to calculate the scores only for those patterns the relevance rate of which exceeds 0.5 (otherwise it's a zero):
score( "[SUBJ] destroyed" ) = 0.667 * log2(3) = 1.06 score( "destroyed [DOBJ]" ) = 0.667 * log2(3) = 1.06 score( "[SUBJ] attacked" ) = 1 * log2(2) = 1 score( "attacked [DOBJ]" ) = 1 * log2(2) = 1 score( "[SUBJ] attacked a high" ) = 1 * log2(1) = 0 score( "[SUBJ] attacked a post" ) = 1 * log2(1) = 0 score( "[SUBJ] detonated" ) = 1 * log2(1) = 0 score( "[SUBJ] detonated a bomb" ) = 1 * log2(1) = 0 score( "[SUBJ] were attacked" ) = 1 * log2(1) = 0 score( "detonated [DOBJ]" ) = 1 * log2(1) = 0 score( "a post in [NP]" ) = 1 * log2(1) = 0
In this exercise, we study the paper Riloff, Jones: Learning Dictionaries for Information Extraction by Multi-level Bootstrapping.
Assume we have the following document collection that has been analysed by a syntactic analyser (only parts relevant to this task are shown). Abbreviations: n = noun, np = noun phrase, av = active verb, p = preposition:
np(Mason) av(waits) p(with) np(n(dozens) p(of) np(other n(tourists))) p(in) np(a long line). np(Sixteen charter planes) av(landed) p(in) np(a single n(day)) p(at) np(the sea n(resort) p(of) np(Hurghada)). Last year np(Egypt) av(attracted) many tourists who av(came) p(to) np(the Middle East). He av(runs) np(a papyrys n(shop)) p(in) the old n(city) p(of) np(Cairo). np(Stone Town) is the urban n(center) p(of) np(Zanzibar). Few cars av(came) p(to) np(the south n(coast) p(of) np(Zanzibar)). The package includes a half-day tour to the n(city) p(of) np(Hurghada). The shop is located right at the city n(center) p(of) np(Cairo). Labor av(united) p(with) np(immigrants) on reform issues. np(n(City) p(of) np(Nairobi)) unveils a new user-firendly bike map. The government of the region av(asked) the security advicer at the U.S. n(Embassy) p(in) np(Nairobi) about the warning. The attackers blew up the U.S. n(Embassy) p(in) np(Dar es Salaam). The n(city) p(of) np(Zanzibar) av(consists) p(of) np(Stone Town and Ngambo). Previous n(visitors) p(to) np(Mount Kumgang) had to go by ferry. His n(visit) p(to) np(Cairo) was delayed. In 1964 Tanganyika av(united) p(with) np(Zanzibar) to form Tanzania.
Assume further that we want to use only the following two AutoSlog heuristic rules:
noun prep <noun-phrase> active-verb prep <noun-phrase>
If the set of seed words is Cairo and Zanzibar, which other words would be added to the semantic lexicon? Why? It is enough if you study the first part of the method ('Mutual Bootstrapping') only.
As the data set is very small, you can use a simpler score, e.g. score(pattern) = R * F.
First, we do pattern matching
ROUND 1 LEX: zanzibar cairo PATTERN F N ---------------------------------------- dozens of X 0 1 resort of X 0 1 city of X 1 4 center of X 1 2 coast of X 1 1 city of X 0 4 center of X 1 2 city of X 0 4 embassy in X 0 2 embassy in X 0 2 city of X 1 4 visitors to X 0 1 visit to X 1 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0 city of X 1 visitors to X 0 consists of X 0 resort of X 0 center of X 2 visit to X 1 coast of X 1 waits with X 0 united with X 0.5 BEST PATTERN: center of X 2 LEX zanzibar exists already LEX cairo exists already ROUND 2 LEX: zanzibar cairo PATTERN F N ---------------------------------------- dozens of X 0 1 resort of X 0 1 city of X 1 4 coast of X 1 1 city of X 0 4 city of X 0 4 embassy in X 0 2 embassy in X 0 2 city of X 1 4 visitors to X 0 1 visit to X 1 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0 city of X 1 visitors to X 0 consists of X 0 resort of X 0 visit to X 1 waits with X 0 coast of X 1 united with X 0.5 BEST PATTERN: city of X 1 LEX cairo exists already Adding hurghada to LEX Adding nairobi to LEX LEX zanzibar exists already ROUND 3 LEX: zanzibar hurghada cairo nairobi PATTERN F N ---------------------------------------- dozens of X 0 1 resort of X 1 1 coast of X 1 1 embassy in X 1 2 embassy in X 0 2 visitors to X 0 1 visit to X 1 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0.5 visitors to X 0 consists of X 0 resort of X 1 visit to X 1 waits with X 0 coast of X 1 united with X 0.5 BEST PATTERN: resort of X 1 LEX hurghada exists already ROUND 4 LEX: zanzibar hurghada cairo nairobi PATTERN F N ---------------------------------------- dozens of X 0 1 coast of X 1 1 embassy in X 1 2 embassy in X 0 2 visitors to X 0 1 visit to X 1 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0.5 visitors to X 0 consists of X 0 visit to X 1 waits with X 0 coast of X 1 united with X 0.5 BEST PATTERN: visit to X 1 LEX cairo exists already ROUND 5 LEX: zanzibar hurghada cairo nairobi PATTERN F N ---------------------------------------- dozens of X 0 1 coast of X 1 1 embassy in X 1 2 embassy in X 0 2 visitors to X 0 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0.5 visitors to X 0 consists of X 0 waits with X 0 coast of X 1 united with X 0.5 BEST PATTERN: coast of X 1 LEX zanzibar exists already ROUND 6 LEX: zanzibar hurghada cairo nairobi PATTERN F N ---------------------------------------- dozens of X 0 1 embassy in X 1 2 embassy in X 0 2 visitors to X 0 1 waits with X 0 1 landed in X 0 1 came to X 0 2 came to X 1 2 united with X 0 2 consists of X 0 1 united with X 1 2 PATTERN score ----------------------------------------- landed in X 0 came to X 0.5 dozens of X 0 embassy in X 0.5 visitors to X 0 consists of X 0 waits with X 0 united with X 0.5 BEST PATTERN: came to X 0.5 Adding the middle east to LEX Adding the south coast of zanzibar to LEX
Study the ProMED-PLUS Epidemiological Fact Base http://doremi.cs.helsinki.fi/puls/ and use the Web interface.
Search for cases of "avian influenza" and try to find examples where the extraction system has made mistakes.
docno: UH20050420.1110 [1, 2, 3, 4, 5, 6] pub_date: 2005.04.20 disease: Avian Influenza count: 51 status: sick location: Viet Nam state: Vietnam case_descriptor: 51 people
In fact, the 51 people were dead from Thailand, Cambodia and Vietnam. The most recent symptoms were in Cambodia near Vietnamese border.
docno: UH20050415.1085 [1, 2, 3, 4, 5, 6] pub_date: 2005.04.15 count: 200000000 end_time: 2004 status: sick state: Italy location: Italy case_descriptor: some 200 million birds
The speaker was Italian, the birds Asian.
docno: UH20050420.1109 [1, 2] pub_date: 2005.04.20 location: Russia case_descriptor: birds state: Russia
The speaker was Russian minister, the poulty was from Italian turkeys.