Processing of Large Document Collections: Exercise 5.

Solution for Exercise 5.1

In this exercise, we study the algorithm of AutoSlog-TS (Riloff, E. Automatically Generating Extraction Patterns from Untagged Text"). Assume, our text collection contains the following documents and explain the process of AutoSlog-TS using these and give the ranking for the extraction patterns that are generated.

text 1; relevant
s: A group of terrorists
v: attacked
do: a post
pp: in Nuevo Progreso.

text 2; relevant
s: The National offices
v: were attacked
time: today.
s: Unidentified individuals
v: detonated
do: a bomb.
s: The bomb
v: destroyed
do: a car.

text 3; not relevant
s: The Armed Forces units
v: killed
do: one rebel.
s: They
v: destroyed
do: an underground hideout.

text 4; relevant
s: Unidentified individuals
v: attacked
do: a high tension tower.
s: They
v: destroyed
do: it.

text 5; not relevant
s: The coca growers
v: protest
do: the destruction of their fields.
s: The strike
v: is supported
pp: by the Shining Path.

AutoSlog is a system for dictionary construction. It forms extraction patterns automatically based on a set of heuristic rules. As input, the system is given text where target phrases are annotated with tags. The system relies on a parser for the identification of the Part-of-speech, such as subject, verb, direct object, and prepositional phrases of each sentence.

There are three kinds of rules based on the syntactic class of the target noun phrase. A rule is instantiated based on the sentence in which the target phrase appears. For instance, "Juha Makkonen, the teaching assistant, was kidnapped for ridiculous ransom" could result in a pattern "[SUBJ] was kidnapped", if the proper name was tagged as a target (as the [SUBJ] [PASSIVE VERB] rule matches).

AutoSlog-TS is an extension of AutoSlog. It generates an extraction pattern for every tagged phrase in the training corpus, and then evaluates them. There are 15 heuristic rules with which the pattern extraction is carried out. Each pattern is ranked with a relevance rate, which is a conditional probability:

                                                 P( relevant text AND text has pattern i )
	P( relevant text | text has pattern i) = ------------------------------------------
                                                 P( text has pattern i )	                                       
      

Then, this relevance rate is weighted with frequency (if the rate was greater than 0.5 )

      Score( i ) = rel(i) * log_2( freq i )

The rules are:

RULE                          EXAMPLE
----------------------------------------------------------------
1. [SUBJ] passive-verb        [victim] was murdered
2. [SUBJ] active-verb         [perp] bombeb
3. [SUBJ] verb infin.         [perp] attempted to kill
4. [SUBJ] aux noun            [victim] was victim


5. passive-verb [DOBJ]        killed [victim] (** a hack! **)
6. active-verb [DOBJ]         bombed [target]
7. infin. [DOBJ]              to kill [victim]
8. verb infin. [DOBJ]         tried to kill [victim]
9. gerund [DOBJ]              killing [victim]
10. noun aux [DOBJ]           fatality was [victim]

11. noun prep [NP]            bomb against [target]
12. active-verb prep [NP]     killed with [instrument]
13. passive-verb prep [NP]    was aimed at [target]

14. [SUBJ] active-verb DOBJ   [perp] pillaged the village
15. infin. prep [NP]          to triumph in [loc]

Now, we apply the rules to the texts:

-------------

text 1; relevant
"A group of terrorists attacked a post in Nuevo Progreso."

 * [SUBJ] attacked           (rule 2)
 * attacked [DOBJ]           (rule 6)
 * a post in [NP]            (rule 11)
 * [SUBJ] attacked a post    (rule 14)
 
-------------

text 2; relevant
"The National offices were attacked today."

 * [SUBJ] were attacked      (rule 1)


"Unidentified individuals detonated a bomb."

 * [SUBJ] detonated          (rule 2)
 * [SUBJ] detonated a bomb   (rule 14)
 * detonated [DOBJ]          (rule 6)

"The bomb destroyed a car."

 * [SUBJ] destroyed          (rule 2)
 * [SUBJ] destroyed a car    (rule 14)
 * destroyed [DOBJ]          (rule 6)

-------------

text 3; not relevant
"The Armed Forces units killed one rebel."

 * [SUBJ] killed             (rule 2)
 * [SUBJ] killed  one rebel  (rule 14)
 * killed [DOBJ]             (rule 6)

"They destroyed an underground hideout."

 * [SUBJ] destroyed          (rule 2)
 * [SUBJ] destroyed an und...(rule 14)
 * destroyed [DOBJ]          (rule 6)

-------------

text 4; relevant
"Unidentified individuals attacked a high tension tower."

 * [SUBJ] attacked           (rule 2)
 * attacked [DOBJ]           (rule 6)
 * [SUBJ] attacked a high ...(rule 14)

"They destroyed it."

 * [SUBJ] destroyed          (rule 2)
 * [SUBJ] destroyed it       (rule 14)
 * destroyed [DOBJ]          (rule 6)

-------------

text 5; not relevant
"The coca growers protest the destruction of their fields."

 * [SUBJ] protest            (rule 2)
 * [SUBJ] protest the dest.. (rule 14)
 * protest [DOBJ]            (rule 6)


"The strike is supported by the Shining Path."

 * [SUBJ] is supported       (rule 1)
 * is supported by [NP]      (rule 13)

And now we calculate the relevance rates:

rel( "[SUBJ] attacked" ) =             2/2 = 1
rel( "[SUBJ] attacked a high" ) =      1/1 = 1
rel( "[SUBJ] attacked a post" ) =      1/1 = 1
rel( "[SUBJ] detonated" ) =            1/1 = 1
rel( "[SUBJ] detonated a bomb" ) =     1/1 = 1      
rel( "[SUBJ] destroyed" ) =            2/3 = 0.667
rel( "[SUBJ] destroyed it" ) =         1/1 = 0
rel( "[SUBJ] destroyed an und.." ) =   0/1 = 0
rel( "[SUBJ] destroyed a car" ) =      1/1 = 0
rel( "[SUBJ] is supported" ) =         0/1 = 0
rel( "[SUBJ] killed" ) =               0/1 = 0
rel( "[SUBJ] killed one rebel" ) =     0/1 = 0
rel( "[SUBJ] protest" ) =              0/1 = 0
rel( "[SUBJ] protest the destr..." ) = 0/1 = 0
rel( "[SUBJ] were attacked" ) =        1/1 = 1  

rel( "attacked [DOBJ]" ) =             2/2 = 1 
rel( "destroyed [DOBJ]" ) =            2/3 = 0.667  
rel( "detonated [DOBJ]" ) =            1/1 = 1  
rel( "killed [DOBJ]" ) =               0/1 = 0  
rel( "protest [DOBJ]" ) =              0/1 = 0  

rel( "a post in [NP]" ) =              1/1 = 1  
rel( "is supported by [NP]" ) =        0/1 = 0

We need to calculate the scores only for those patterns the relevance rate of which exceeds 0.5 (otherwise it's a zero):

score( "[SUBJ] destroyed" )        = 0.667 * log2(3) = 1.06
score( "destroyed [DOBJ]" )        = 0.667 * log2(3) = 1.06
score( "[SUBJ] attacked" )         = 1 * log2(2) = 1
score( "attacked [DOBJ]" )         = 1 * log2(2) = 1
score( "[SUBJ] attacked a high" )  = 1 * log2(1) = 0
score( "[SUBJ] attacked a post" )  = 1 * log2(1) = 0
score( "[SUBJ] detonated" )        = 1 * log2(1) = 0
score( "[SUBJ] detonated a bomb" ) = 1 * log2(1) = 0
score( "[SUBJ] were attacked" )    = 1 * log2(1) = 0
score( "detonated [DOBJ]" )        = 1 * log2(1) = 0
score( "a post in [NP]" )          = 1 * log2(1) = 0

Solution for Exercise 5.2

In this exercise, we study the paper Riloff, Jones: Learning Dictionaries for Information Extraction by Multi-level Bootstrapping.

Assume we have the following document collection that has been analysed by a syntactic analyser (only parts relevant to this task are shown). Abbreviations: n = noun, np = noun phrase, av = active verb, p = preposition:


np(Mason) av(waits) p(with) np(n(dozens) p(of) np(other n(tourists))) 
p(in) np(a long line).

np(Sixteen charter planes) av(landed) p(in) np(a single n(day)) 
p(at) np(the sea n(resort) p(of) np(Hurghada)).

Last year np(Egypt) av(attracted) many tourists who av(came) 
p(to) np(the Middle East).

He av(runs) np(a papyrys n(shop)) p(in) the old n(city) p(of) np(Cairo).

np(Stone Town) is the urban n(center) p(of) np(Zanzibar).

Few cars av(came) p(to) np(the south n(coast) p(of) np(Zanzibar)).

The package includes a half-day tour to the n(city) p(of) np(Hurghada).

The shop is located right at the city n(center) p(of) np(Cairo).

Labor av(united) p(with) np(immigrants) on reform issues.

np(n(City) p(of) np(Nairobi)) unveils a new user-firendly bike map.

The government of the region av(asked) the security advicer 
at the U.S. n(Embassy) p(in) np(Nairobi) about the warning.

The attackers blew up the U.S. n(Embassy) p(in) np(Dar es Salaam).

The n(city) p(of) np(Zanzibar) av(consists) p(of) np(Stone Town and
Ngambo).

Previous n(visitors) p(to) np(Mount Kumgang) had to go by ferry.

His n(visit) p(to) np(Cairo) was delayed.

In 1964 Tanganyika av(united) p(with) np(Zanzibar) to form Tanzania. 


Assume further that we want to use only the following two AutoSlog heuristic rules:

noun prep <noun-phrase>
active-verb prep <noun-phrase>

If the set of seed words is Cairo and Zanzibar, which other words would be added to the semantic lexicon? Why? It is enough if you study the first part of the method ('Mutual Bootstrapping') only.

As the data set is very small, you can use a simpler score, e.g. score(pattern) = R * F.


First, we do pattern matching

noun prep [NP]

active-verb prep [NP]




ROUND 1

LEX:  zanzibar  cairo 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
resort of X	0	1
city of X	1	4
center of X	1	2
coast of X	1	1
city of X	0	4
center of X	1	2
city of X	0	4
embassy in X	0	2
embassy in X	0	2
city of X	1	4
visitors to X	0	1
visit to X	1	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0
city of X	1
visitors to X	0
consists of X	0
resort of X	0
center of X	2
visit to X	1
coast of X	1
waits with X	0
united with X	0.5
BEST PATTERN: center of X 	2
LEX zanzibar exists already
LEX cairo exists already



ROUND 2

LEX:  zanzibar  cairo 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
resort of X	0	1
city of X	1	4
coast of X	1	1
city of X	0	4
city of X	0	4
embassy in X	0	2
embassy in X	0	2
city of X	1	4
visitors to X	0	1
visit to X	1	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0
city of X	1
visitors to X	0
consists of X	0
resort of X	0
visit to X	1
waits with X	0
coast of X	1
united with X	0.5
BEST PATTERN: city of X 	1
LEX cairo exists already
Adding hurghada to LEX
Adding nairobi to LEX
LEX zanzibar exists already



ROUND 3

LEX:  zanzibar  hurghada  cairo  nairobi 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
resort of X	1	1
coast of X	1	1
embassy in X	1	2
embassy in X	0	2
visitors to X	0	1
visit to X	1	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0.5
visitors to X	0
consists of X	0
resort of X	1
visit to X	1
waits with X	0
coast of X	1
united with X	0.5
BEST PATTERN: resort of X 	1
LEX hurghada exists already



ROUND 4

LEX:  zanzibar  hurghada  cairo  nairobi 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
coast of X	1	1
embassy in X	1	2
embassy in X	0	2
visitors to X	0	1
visit to X	1	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0.5
visitors to X	0
consists of X	0
visit to X	1
waits with X	0
coast of X	1
united with X	0.5
BEST PATTERN: visit to X 	1
LEX cairo exists already



ROUND 5

LEX:  zanzibar  hurghada  cairo  nairobi 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
coast of X	1	1
embassy in X	1	2
embassy in X	0	2
visitors to X	0	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0.5
visitors to X	0
consists of X	0
waits with X	0
coast of X	1
united with X	0.5
BEST PATTERN: coast of X 	1
LEX zanzibar exists already



ROUND 6

LEX:  zanzibar  hurghada  cairo  nairobi 
PATTERN		F	N
----------------------------------------
dozens of X	0	1
embassy in X	1	2
embassy in X	0	2
visitors to X	0	1
waits with X	0	1
landed in X	0	1
came to X	0	2
came to X	1	2
united with X	0	2
consists of X	0	1
united with X	1	2

PATTERN		score
-----------------------------------------
landed in X	0
came to X	0.5
dozens of X	0
embassy in X	0.5
visitors to X	0
consists of X	0
waits with X	0
united with X	0.5
BEST PATTERN: came to X 	0.5
Adding the middle east to LEX
Adding the south coast of zanzibar to LEX

Solution for Exercise 5.3

Study the ProMED-PLUS Epidemiological Fact Base http://doremi.cs.helsinki.fi/puls/ and use the Web interface.

Search for cases of "avian influenza" and try to find examples where the extraction system has made mistakes.


docno: UH20050420.1110 [1, 2, 3, 4, 5, 6] 
pub_date: 2005.04.20
disease: Avian Influenza
count: 51
status: sick
location: Viet Nam
state: Vietnam
case_descriptor: 51 people
      

In fact, the 51 people were dead from Thailand, Cambodia and Vietnam. The most recent symptoms were in Cambodia near Vietnamese border.

docno: UH20050415.1085 [1, 2, 3, 4, 5, 6] 
pub_date: 2005.04.15
count: 200000000
end_time: 2004
status: sick
state: Italy
location: Italy
case_descriptor: some 200 million birds

The speaker was Italian, the birds Asian.

docno: UH20050420.1109 [1, 2] 
pub_date: 2005.04.20
location: Russia
case_descriptor: birds
state: Russia

The speaker was Russian minister, the poulty was from Italian turkeys.


Last modified: Fri May 2 18:12:12 EEST 2006