Nested Text-Region Algebra
Jani Jaakkola, and Pekka Kilpeläinen: Nested Text-Region Algebra. Report C-1999-2, Department of Computer Science, University of Helsinki, January 1999. 24 pages. <http://www.cs.helsinki.fi/TR/C-1999/2>
Full paper: gzip'ed Postscript file
So called region algebras operating on sets of text fragments have been proposed and implemented as query languages for text documents. Text documents often comprise nested regions like lists within lists or procedures within procedures. Earlier versions of region algebra do not support querying nested regions. We address this deficiency by proposing a new, unrestricted region algebra. The new algebra allows dynamic definition of nested regions. This makes it suitable for querying without any preprocessing documents, whose hierarchical structure is indicated by embedded markup. We demonstrate that this nested region algebra can be efficiently implemented, by presenting and analyzing algorithms for its operations. Especially, we show that any fixed nested region algebra expression on text of length n can be evaluated in the worst case in time O(n2), and in practice in linear time. Nested region algebra has been implemented in a publicly available Unix text search tool called sgrep.
Categories and Subject Descriptors:
General Terms: Algorithms, Languages
Additional Key Words and Phrases: text searching, structured documents, SGML, XML
Online Publications of Department of Computer Science, Anna Pienimäki