University of Helsinki Department of Computer Science

Department of Computer Science

Department information


Generalization of Document Structures and Document Assembly

Barbara Heikkinen: Generalization of Document Structures and Document Assembly. PhD Thesis, Report A-2000-2, Department of Computer Science, University of Helsinki, April 2000. 179 pages. <>

Full paper: gzip'ed Postscript file


The accelerating evolution of the World Wide Web has made numerous digital document collections widely available for the public. There is a clear need for new tools that assist the user to gather, combine, and reuse information from existing document collections. On the other hand, the amount of fine-structured documents will enormously increase in the near future, since the Extensible Markup Language (XML) is rapidly gaining popularity in various communities. Compared to HTML, XML makes more versatile processing and customization of documents possible. However, explicit structuring using XML leads to heterogeneously structured document collections, which causes problems when combining and reusing fragments of documents.

Document assembly is the computer-aided construction of new documents from existing document collections. Such reuse includes finding relevant document fragments, modifying them as needed, and combining the fragments. This thesis describes a document assembly model based on versatile recognition and manipulation of document fragments that are coherent, contiguous, and relatively independent document parts used as the basis for new assemblies. We also introduce a general document assembly system SAW and a specialized system for tailoring textbooks via the Web.

If the assembled documents are to be further processed, the heterogeneous structures of the original documents also have to be unified. This work presents an element-type classification method that facilitates uniform processing of heterogeneous structures. The method contains a decision procedure for mapping an arbitrary structure element to a predefined generic class. The generic classes are defined in a Document Type Definition (DTD) called generic DTD, which can be seen as a metastructure definition describing typical logical structures of electronic documents.

The element-type classification extracts information from document instances by inspecting element relations and average text lengths of element instances. In this way various structures, such as hierarchies and element containers wrapping logical units, can be recognized. The method is formally presented by using the concept of grammar morphism. Various practical examples of applying the generic classes are provided, and the method is applied to several well-known public document types. In addition to document assembly, the results of the element-type classification method can be used, for instance, in automatic generation of stylesheets for structured documents.

Index Terms

Categories and Subject Descriptors:
I.7.1 Document and Text Prosessing: Document and Text Editing
I.7.2 Document and Text Prosessing: Document Preparation
I.7.4 Document and Text Prosessing: Electronic Publishing
F.4.3 Mathematical Logic and Formal Languages: Formal Languages
H.3.3 Information Storage and Retrieval: Information Search and Retrieval
H.3.7 Information Storage and Retrieval: Digital Libraries

General Terms: Documentation, Algorithms, Experimentation, Design

Additional Key Words and Phrases: Document management, Document assembly, Metastructures, Context-free grammars, XML, SGML, Semistructured data, Digital Libraries

Online Publications of Department of Computer Science, Anna Pienimäki
Last updated Mar 16, 2001