Extending a DBMS for Sequence Querying

Gösta Grahne Raul Hakli Matti Nykänen Esko Ukkonen
Department of Computer Science
P.O. Box 26, FIN-00014 University of Helsinki, FINLAND
{grahne,hakli,mnykanen,ukkonen}@cs.helsinki.fi

0 Introduction

We have extended a database management system with a sequence language able to express queries based on the structures of the sequences. Our system is being developed for querying and processing of genetic information represented as sequences of symbols.

1 Querying and constructing

Applications like genome mapping require high level database management support and methods for exploring the combinatorial structure of genetic sequences. Compared to the other data management tools developed for sequence processing, our query language is much more powerful yielding full Turing computability. For example, it allows the user to declaratively express properties such as a sequence being a palindrome, or that one sequence occurs in another, or that one sequence is an inversion of another. In addition to the data extracting capabilities, our system also offers versatile possibilities for creating new sequences from the old ones. It is for instance possible to construct a new sequence by concatenating or shuffling existing sequences. As an example we consider concatenation of two sequences as shown in Figure 1.

2 Alignment Calculus

Our database language is based on Alignment Calculus [1], an extension of relational calculus using modal logic and capable of expressing properties of individual sequences as well as properties relating sequences to each other. These properties are expressed using both modal operators and non-modal formulae. The logic is state-based like for example temporal and dynamic logic. In a state, called an alignment, the sequences are aligned on top of each other in a certain way. Figure 2 illustrates several different alignments of sequences aat, gc and gcaat. Properties of an alignment are expressed with respect to the vertical window. For instance, in the second alignment of Figure 2 the following proposition is true: "window position of the middle sequence and the bottom sequence are equal". On the other hand, the following proposition is false in that alignment: "the window position of the topmost sequence is different from t, and the window position of the middle sequence equals c". Alignments are connected to each other through state transitions called transposes. A transpose says that certain rows in the alignment should be shifted one position to the left or to the right. Transposes can be used to examine the structures of the sequences as shown in Figure 2. The middle sequence and the bottom sequence are gradually slided to the left, and on each step the equality of their window positions is verified. When the end of the middle sequence is reached, the topmost sequence starts sliding to the left with the bottom sequence in the same manner the middle sequence did. If the equality conditions hold in every state and all of the sequences reach the end, the bottom sequence is a concatenation of the other two. Formally the property that sequence z is a concatenation of sequences y and x is expressed in the following way:

3 Sequence language

The expressions of Alignment Calculus can be embedded in queries specified in relational calculus. In addition to the formal calculus we have defined a user-friendly version which is being used in the practical implementation. The logical expression above can be written using our sequence language as shown in Figure 3. The first statement means that sequences x and y can be repeatedly shifted one position to the left provided that the resulting window positions are equal. The second statement says the same thing about sequences x and z. The last statement verifies that all the sequences have reached the end. If all these three statements can be applied to sequences x, y and z, sequence z must be concatenation of sequences y and x. In addition to the transposes of the sequences the syntax of the sequence language includes structures for repetitional selection.

4 Computation

In order to procedurally interpret expressions in the sequence language we use finite state acceptors to evaluate the statements dealing with sequences, while the purely relational part of the query is handled by the query language of the database management system. The expressions of the sequence language are used to construct a transition table for a multitape finite state acceptor. The tapes of the acceptor contain the sequences involved in the expression and the evaluation is carried out by running the acceptor. Figure 4 shows a 3-tape finite state acceptor which corresponds to the expressions above. In our prototype we are using the Postgres95 [2] database management system developed at the University of California at Berkeley. The query language is Postgres SQL extended with the sequence language. The sequence language compiler and other necessary functions are implemented with the C programming language. As a testbed we are using a simplified version of a genome sequence database. The database schema includes relations for sequence data, organism species, and comments.

References

[1] G. Grahne, M. Nykänen, and E. Ukkonen. Reasoning about Strings in Databases. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 303-312, 1994. URL: http://www.cs.helsinki.fi/~grahne/pods94.ps. Extended version: http://www.cs.helsinki.fi/~grahne/fullpods94.ps.
[2] A. Yu and J. Chen. The POSTGRES95 User Manual. University of California at Berkeley, California, 1995.



hakli@cs.helsinki.fi