582410 Processing of large document collections, Exercise 5

The solutions should be ready for inspection by Thursday 1.11.2001 (midnight).


  1. Study the FociSum summarization system described on the web page http://www.cs.columbia.edu/~hjing/sumDemo/ (follow the link FociSum). Describe the architecture of the system. Try the demo (link: Examples) with the article you chose in the exercise 4 (or choose one now...) and compare the generated summary to your own. Describe which kind of information is provided on the page for summarization. You can find more information on the system in the article Kan, McKeown: Information extraction and summarization: domain independence through focus types (local PDF) (but you are not required to read it...)

  2. Try the Conexor Functional Dependency Grammar (FDG) demo. Extract from the HTML page that you get as a result a list of nouns that appeared in your sample text. Tip: If a word is a noun, there is a string " N " in the last column of the line.

    You can use any tools and programming languages, but below are some instructions for XSLT.

    In order to use XML tools like an XSLT transformation processor Xalan, the text should be XML. The text of the result file is HTML but not XML. So, first you have to convert the file into XML.

    1. Use the program Tidy to clean up the HTML file and convert it into XML:

            tidy --output-xml yes --add-xml-decl yes --doctype omit 
      	   --numeric-entities yes < inputfile > outputfile
      

      You can find the executable program here, or download binaries or the source code (ANSI C) from the Tidy page.

    2. Remove from the outputfile <!DOCTYPE ....> definition. Actually Tidy shouldn't output this (option: --doctype omit)...

    3. Write an XSLT transformation to pick up the information you need. There is a partial solution available. Note how the transformation contains a template for each node in the document tree. You can try it first with Xalan to see what it outputs without any modifications (see instructions for Exercise 3.1). Some tips:

      • <xsl:apply-templates select="tr"/> command (within a template for "table") tells that to output a "table" output all the "tr" children.

      • <xsl:value-of select="td[3]"/> command returns the contents of the third "td" element.

      • <xsl:text> and </xsl:text> pair lets you output empty space, e.g. tabs and new lines. Just give the character using the keyboard between the tags. For instance,

        <xsl:text>
        </xsl:text>
        

        produces a newline.

      • There is an IF construction:

        <xsl:if test="...">
        ...
        </xsl:if>
        

        where a test can be e.g. a function call, like contains(td[1],'1'), which tests if the 1st 'td' element contains a string "1".




    Helena.Ahonen-Myka