University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Finding a Good Collection of Patterns Covering a Set of Sequences

Alvis Brazma, Esko Ukkonen, and Jaak Vilo: Finding a Good Collection of Patterns Covering a Set of Sequences. Report C-1995-60, Department of Computer Science, University of Helsinki, December 1995. 22 pages. <http://www.cs.helsinki.fi/TR/C-1995/60>

Full paper: gzip'ed Postscript file
Metadata: XML file

Abstract

We consider a problem of learning of unions of pattern languages from positive examples. We consider three different classes of patterns - regular patterns, substring patterns and the so called PROSITE patterns. By regular patterns we understand patterns where each variable symbol can appear only once. By substring patterns we understand a subclass of regular patterns of the type xffy, where x and y are variables and ff is a string of constant symbols. The PROSITE patterns is a class of patterns used for classification of bio-sequences in PROSITE database. We present an algorithm which, given a set of sequences, finds a `good' collection of patterns `covering' this set. The notion of a `good covering' is defined as the most probable collection of patterns likely to produce the examples in some simple and natural probabilistic model. We show that this criterion is equivalent to the so called Minimum Description Length (MDL) principle. We present a polynomial-time algorithm for approximating the optimal cover within a logarithmic factor and prove its performance guarantees. In the case of substring patterns the running time of the algorithm is almost linear.

Index Terms

Categories and Subject Descriptors:
I.2.6, F.2.2, J.3

General Terms:

Additional Key Words and Phrases: Pattern discovery, strings, MDL, PROSITE patterns, union of patterns, weighted set cover


Online Publications of Department of Computer Science, Anna Pienimäki