University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Generating Grammars for Structured Documents Using Grammatical Inference Methods

Helena Ahonen: Generating Grammars for Structured Documents Using Grammatical Inference Methods. PhD Thesis, Report A-1996-4, Department of Computer Science, University of Helsinki, November 1996. 107 pages. <http://www.cs.helsinki.fi/TR/A-1996/4>

Full paper: gzip'ed Postscript file
Metadata: XML file

Abstract

Dictionaries, user manuals, encyclopedias, and annual reports are typical examples of structured documents. Structured documents have an internal, usually hierarchical, organization that can be used, for instance, to help in retrieving information from the documents and in transforming documents into another form. The document structure is typically represented by a context-free or regular grammar. Many structured documents, however, lack the grammar: the structure of individual documents is known but the general structure of the document class is not available. Examples of this kind of documents include documents that have Standard Generalized Markup Language (SGML) tags but not a Document Type Definition (DTD).

In this thesis we present a technique for generating a grammar describing the structure of a given structured document instances. The technique is based on ideas from machine learning. It forms first finite-state automata describing the given instances completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, which are then used to construct the grammar. Some refining operations are also presented that are necessary for generating a grammar for a large and complicated document. The technique has been implemented and it has been experimented using several document types.

Index Terms

Categories and Subject Descriptors:
I.2.6 Artificial Intelligence: Learning
I.7.2 Text Processing: Document Preparation
F.4.3 Mathematical Logic and Formal Languages: Formal Languages

General Terms: Learning, Algorithms, Theory, Experimentation

Additional Key Words and Phrases: document management, grammar generation, SGML


Online Publications of Department of Computer Science, Anna Pienimäki