University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Nested Text-Region Algebra

Jani Jaakkola, and Pekka Kilpeläinen: Nested Text-Region Algebra. Report C-1999-2, Department of Computer Science, University of Helsinki, January 1999. 24 pages. <http://www.cs.helsinki.fi/TR/C-1999/2>

Full paper: gzip'ed Postscript file
Metadata: XML file

Abstract

So called region algebras operating on sets of text fragments have been proposed and implemented as query languages for text documents. Text documents often comprise nested regions like lists within lists or procedures within procedures. Earlier versions of region algebra do not support querying nested regions. We address this deficiency by proposing a new, unrestricted region algebra. The new algebra allows dynamic definition of nested regions. This makes it suitable for querying without any preprocessing documents, whose hierarchical structure is indicated by embedded markup. We demonstrate that this nested region algebra can be efficiently implemented, by presenting and analyzing algorithms for its operations. Especially, we show that any fixed nested region algebra expression on text of length n can be evaluated in the worst case in time O(n2), and in practice in linear time. Nested region algebra has been implemented in a publicly available Unix text search tool called sgrep.

Index Terms

Categories and Subject Descriptors:
H.2.3 [Database management]: Languages
H.3.3 [Information storage and retrieval]: Information search and retrieval
I.7.1 [Document and text processing]: Document and Text Editing

General Terms: Algorithms, Languages

Additional Key Words and Phrases: text searching, structured documents, SGML, XML


Online Publications of Department of Computer Science, Anna Pienimäki