Matching Numeric Strings under Noise

Veli Mäkinen, Gonzalo Navarro and Esko Ukkonen


Abstract. Numeric string is a sequence of symbols from an alphabet Sigma subset of U, where U is some numerical universe closed under addition and subtraction. Given two numeric strings A=a(1)...a(m) and B=b(1) \cdots b(n) and a distance function d(A,B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is min {d(A+t,B), t in U}, where A+t = (a(1)+t)(a(2)+t)...(a_(m)+t). The corresponding matching problem is to find occurrences j of A in B where d(A+t,B(j'...j)) is smaller than some given threshold and B(j'...j) is a substring of B. In this paper, we give efficient algorithms for matching numeric strings --- with and without transposition invariance --- under noise; we consider distance functions d(A,B) such that symbols a in A and b in B can be matched if |b-a| <= delta, or the kappa largest differences |b-a| can be discarded.