582410 Processing of large document collections, Exercise 5



  1. In this exercise, we study the algorithm of AutoSlog-TS (Riloff, E. Automatically Generating Extraction Patterns from Untagged Text"). Assume, our text collection contains the following documents:


    
    text 1; relevant
    
    s: A group of terrorists
    v: attacked
    do: a post
    pp: in Nuevo Progreso.
    
    
    text 2; relevant
    
    s: The National offices
    v: were attacked
    time: today.
    s: Unidentified individuals
    v: detonated
    do: a bomb.
    s: The bomb
    v: destroyed
    do: a car.
    
    
    text 3; not relevant
    
    s: The Armed Forces units
    v: killed
    do: one rebel.
    s: They
    v: destroyed
    do: an underground hideout.
    
    
    text 4; relevant
    
    s: Unidentified individuals
    v: attacked
    do: a high tension tower.
    s: They
    v: destroyed
    do: it.
    
    
    text 5; not relevant
    
    s: The coca growers
    v: protest
    do: the destruction of their fields.
    s: The strike
    v: is supported
    pp: by the Shining Path.
    
    

    Explain the process of AutoSlog-TS using these documents and give the ranking for the extraction patterns that are generated.

    Abbreviations: s=subject, v=verb, do=direct object, pp=preposition phrase


  2. In this exercise, we study the paper Riloff, Jones: Learning Dictionaries for Information Extraction by Multi-level Bootstrapping.

    Assume we have the following document collection that has been analysed by a syntactic analyser (only parts relevant to this task are shown). Abbreviations: n = noun, np = noun phrase, av = active verb, p = preposition:


    np(Mason) av(waits) p(with) np(n(dozens) p(of) np(other n(tourists))) 
    p(in) np(a long line).
    
    np(Sixteen charter planes) av(landed) p(in) np(a single n(day)) 
    p(at) np(the sea n(resort) p(of) np(Hurghada)).
    
    Last year np(Egypt) av(attracted) many tourists who av(came) 
    p(to) np(the Middle East).
    
    He av(runs) np(a papyrys n(shop)) p(in) the old n(city) p(of) np(Cairo).
    
    np(Stone Town) is the urban n(center) p(of) np(Zanzibar).
    
    Few cars av(came) p(to) np(the south n(coast) p(of) np(Zanzibar)).
    
    The package includes a half-day tour to the n(city) p(of) np(Hurghada).
    
    The shop is located right at the city n(center) p(of) np(Cairo).
    
    Labor av(united) p(with) np(immigrants) on reform issues.
    
    np(n(City) p(of) np(Nairobi)) unveils a new user-firendly bike map.
    
    The government of the region av(asked) the security advicer 
    at the U.S. n(Embassy) p(in) np(Nairobi) about the warning.
    
    The attackers blew up the U.S. n(Embassy) p(in) np(Dar es Salaam).
    
    The n(city) p(of) np(Zanzibar) av(consists) p(of) np(Stone Town and
    Ngambo).
    
    Previous n(visitors) p(to) np(Mount Kumgang) had to go by ferry.
    
    His n(visit) p(to) np(Cairo) was delayed.
    
    In 1964 Tanganyika av(united) p(with) np(Zanzibar) to form Tanzania. 
    
    

    Assume further that we want to use only the following two AutoSlog heuristic rules:

    noun prep <noun-phrase>
    active-verb prep <noun-phrase>
    

    If the set of seed words is Cairo and Zanzibar, which other words would be added to the semantic lexicon? Why? It is enough if you study the first part of the method ('Mutual Bootstrapping') only.

    As the data set is very small, you can use a simpler score, e.g. score(pattern) = R * F.


  3. Study the ProMED-PULS Epidemiological Fact Base http://doremi.cs.helsinki.fi/puls/ and try its Web interface.

    Search for cases of "avian influenza" (bird flu) and try to find examples where the extraction system has made mistakes.


  4. Give feedback about the course: [ In Finnish] [ In English]



Helena Ahonen-Myka
Last modified: Tue Apr 18 11:15:27 EEST 2006