Information extraction from text, Week 5

The solutions should be ready for inspection by Thursday 13.3.2002 (midnight).

Remember that always, if you are in doubt what you should do, you can ask Lili or send a message to our newsgroup!!

In this exercise, we study the paper Lerman, Knoblock, Minton: Automatic Data Extraction from Lists and Tables in Web Sources ( IJCAI-2001 Workshop on Adaptive Text Extraction and Mining). Try to find answers to the following questions:
- What is a Web wrapper?
- What is meant with wrapper induction?
- Describe briefly the steps of the method presented in the paper. You can concentrate on understanding the problem(s) that each step tries to solve. We'll discuss the steps more detailly in the lecture.
Let's try to apply the method above to some real web pages. Search on the Amazon.com web page for books on
1. "daughter of fortune" and
2. "shadow warriors".
Study the result pages and their HTML source ("View page source" etc.) and describe what each step of the method would do, if these pages were given as an input. You can restrict yourself to some representative samples of the data. You can also simplify the HTML code if you like. As above, concentrate on understanding the problems, not that much on the solutions.

Last modified: Wed Mar 5 18:35:14 EET 2003