Structured and Intelligent Documents

"Älykkäät ja rakenteiset dokumentit"

© Document Management Research Group

University of Helsinki
Department of Computer Science


Project Introduction

Structured and Intelligent Documents (SID) is a three-year research project, which studies and develops methods for attaching intelligent features to structured documents. The purpose of these features is to make the manipulation, i.e.,

of documents easier.

The SID project started on August 1, 1995 and will end on July 31, 1998. SID is a part of the Electronic Printing and Publishing programme started by the Finnish Technology Development Centre (TEKES). Funding for SID is provided by TEKES and a group of supporting companies.

Primary Concepts

Intelligent Documents

The role and even the concept of a document is undergoing a tremendous change. A document is no longer a passive linear presentation of text. Text with a structure is quite common: dictionaries, reference manuals, annual reports, etc., are typical examples. We create structured documents by using markup methods, such as SGML or the HTML standard of the World-Wide Web; however, there is more to come!

An intelligent document contains knowledge about itself and its environment. It supports assembly of documents based on inputs given by the user. An active intelligent document is able to construct and transform itself dynamically.

Intelligent Assembly

One of the basic problems in document management is to provide on-demand generation of individualised documents through dynamic document assembly. Document assembly composes new documents from an existing collection of documents. Naturally, document markup and structure contribute to the retrieval of the document fragments.

Document assembly is intelligent when it uses application-domain-specific information about the document in addition to the contents and the structure. Inherent information is present in the document or directly computable from the document, e.g., keyword or phrase lists. Besides inherent information, supplementary information is associated with the document. Supplementary information includes, e.g., references, thesauri and common-sense knowledge. Supplementary information can be described, for example, in dependencies or conceptual hierarchies, and can reside in additional document markup, as well as in separate databases.

Automated assembly consists of three phases:

  1. The user expresses his demands.
  2. Appropriate documents or document fragments are found and returned.
  3. The returned fragments are merged into a single uniform assembled document.
The result is presented to the user on the screen or on paper. The World-Wide Web is not the least of the presenting possibilities. The Java language, furthermore, introduces increased manipulative capabilities.

Goals of the Project

The goals of the SID project are

Especially, we want the assembled document to be a finished product.

As a basis for the project we consider structured documents marked up according to the Standard Generalized Markup Language (SGML), which is an ISO standard for defining document markup languages.

The project combines methods and tools from, e.g., structured-document management, information retrieval, pattern matching, machine learning, data mining, distributed systems, etc. When dealing with documents in morphologically rich languages, like Finnish, also natural language processing is vital to the success of document assembly.


(in chronological order)

Supporting companies

The supporting companies of the SID project are leading Finnish enterprises involved in electronic printing. The companies are as follows:


The following researchers take part in the project:



Department of Computer Science
P.O. Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki

Phone: +358 9 70 851
Fax: +358 9 7084 4441


You can contact any of us by email. Questions and inquiries concerning the project are most suitably emailed to project manager Pekka Kilpeläinen at Pekka.Kilpelainen@cs.Helsinki.FI.

Oskari Heinonen, SID Project, DocMan Group, March 25, 1996. Last updated November 19, 1997.