Speeding up approximate string matching using index structures

Speeding up approximate string matching using index structures (Master's thesis in Finnish)

Veli Mäkinen

vmakinen@cs.Helsinki.FI
Department of Computer Science, P.O Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki, Finland.

Abstract. Approximate string matching problem is to find all occurrences of a pattern of size m in a text of size n allowing at most k errors (insertions, deletions, and substitions). Dynamic programming can be applied to the problem, providing a O(mn) search method in the worst case, and O(kn) in the average case. Using bit-parallelism, the dynamic programming algorithms can be improved by a factor w, where w is the length of the computer word. Another way to speed up approximate string matching is indexing; a static text can be preprocessed by constructing an index structure for the text such that queries of its context can be answered quickly. Exact string matching queries can be answered in O(m) time using a suffix tree as an index structure, but a similar straightforward method for approximate string matching requires a depth-first search on suffix tree (limited to the level m+k+1), and has time complexity O(c^(m+k)), where c is the size of the alphabet.
    In this thesis, two new bit-parallel algorithms are examined and compared. Modifications for both algorithms are presented such that they can be used in depth-first search based queries on suffix indices, like suffix tree, suffix cactus, and suffix array. The speed up against the basic strategy (depth-first search combined with O(mn) dynamic programming algorithm) is experimentally verified.
    The use of bit-parallel algorithms do not ease the problem with depth-first search; for long patterns and high error levels, searching may be faster with online-algorithms. Pattern partitioning can be used to overcome this problem. Pattern can be partitioned into smaller pieces so that each piece is searched with fewer errors. All the occurrences of these pieces are checked against the pattern, and the matching ones are reported. This has the advantage that the depth-first search is limited earlier, but then again there are more occurrences to be checked. This is a tradeoff-problem, where the number of pattern partitions should be chosen to minimize the overall cost of the search.
    In this thesis, a new method to estimate the optimal pattern partitioning is developped, based on the statistics of the text obtained from the suffix tree combined with open form formulas for the probability of an approximate match. Comparison against an estimate, that is based on the asymtotic behaviour of the probability of an approximate match, shows that the new method is noticeably more accurate. Still, the choice of the right partitioning is shown to be crucial to gain a significant speed up against online-algorithms, because often only one partitioning stands out from the rest. The estimates used in the thesis are still too inaccurate to predict the right partitioning every time. However, the thesis shows that the use of text statistics obtained from the suffix tree is a promising strategy, and should be the subject of a more extensive study.