Department of Computer Science, P.O Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki, Finland.
Abstract. Approximate string matching problem is to find all
occurrences of a pattern of size m in a text of size n allowing at most
k errors (insertions, deletions, and substitions). Dynamic programming
can be applied to the problem, providing a O(mn) search method in the worst
case, and O(kn) in the average case. Using bit-parallelism, the dynamic
programming algorithms can be improved by a factor w, where w is the length
of the computer word. Another way to speed up approximate string matching
is indexing; a static text can be preprocessed by constructing an index
structure for the text such that queries of its context can be answered
quickly. Exact string matching queries can be answered in O(m) time using
a suffix tree as an index structure, but a similar straightforward
method for approximate string matching requires a depth-first search on
suffix tree (limited to the level m+k+1), and has time complexity O(c^(m+k)),
where c is the size of the alphabet.
In this thesis, two new bit-parallel algorithms are examined and compared. Modifications for both algorithms are presented such that they can be used in depth-first search based queries on suffix indices, like suffix tree, suffix cactus, and suffix array. The speed up against the basic strategy (depth-first search combined with O(mn) dynamic programming algorithm) is experimentally verified.
The use of bit-parallel algorithms do not ease the problem with depth-first search; for long patterns and high error levels, searching may be faster with online-algorithms. Pattern partitioning can be used to overcome this problem. Pattern can be partitioned into smaller pieces so that each piece is searched with fewer errors. All the occurrences of these pieces are checked against the pattern, and the matching ones are reported. This has the advantage that the depth-first search is limited earlier, but then again there are more occurrences to be checked. This is a tradeoff-problem, where the number of pattern partitions should be chosen to minimize the overall cost of the search.
In this thesis, a new method to estimate the optimal pattern partitioning is developped, based on the statistics of the text obtained from the suffix tree combined with open form formulas for the probability of an approximate match. Comparison against an estimate, that is based on the asymtotic behaviour of the probability of an approximate match, shows that the new method is noticeably more accurate. Still, the choice of the right partitioning is shown to be crucial to gain a significant speed up against online-algorithms, because often only one partitioning stands out from the rest. The estimates used in the thesis are still too inaccurate to predict the right partitioning every time. However, the thesis shows that the use of text statistics obtained from the suffix tree is a promising strategy, and should be the subject of a more extensive study.