<?xml version="1.0" encoding="ISO-8859-1"?>
<report no="A-1996-4"
  title="Generating Grammars for Structured Documents Using Grammatical Inference Methods"
  date="November 1996"
  pages="107"
  genterms="Learning, Algorithms, Theory, Experimentation"
  keywords="document management, grammar generation, SGML"
  issn="1238-8645"
  isbn="951-45-7532-6">
<author name="Helena Ahonen"/>
<phd/>
<class name="I.2.6 Artificial Intelligence: Learning"/>
<class name="I.7.2 Text Processing: Document Preparation"/>
<class name="F.4.3 Mathematical Logic and Formal Languages: Formal Languages"/>
<file url="A-1996-4.ps.gz"/>
<abstract>
<p>
Dictionaries, user manuals, encyclopedias, and annual reports
are typical examples of structured documents.
Structured documents have an internal, usually hierarchical,
organization
that can be used, for instance, to help in retrieving information
from the documents and in transforming documents into another form.
The document structure is typically represented
by a context-free or regular grammar.
Many structured documents, however, lack the grammar:
the structure of individual documents is known but
the general structure of the document class is not available.
Examples of this kind of documents include documents that
have Standard Generalized Markup Language (SGML) tags but
not a Document Type Definition (DTD).
</p><p>
In this thesis we present a technique for
generating a grammar describing
the structure of a given structured document instances.
The technique is based on ideas from machine learning.
It forms first finite-state automata describing
the given instances completely.
These automata are modified by considering certain
context conditions; the modifications correspond to
generalizing the underlying language. Finally,
the automata are converted into regular expressions,
which are then used to construct the grammar.
Some refining operations are also presented that
are necessary for generating a grammar for a large and
complicated document. The technique has been implemented
and it has been experimented
using several document types.
</p>
</abstract>
</report>

