Hashing and indexing: succinct data structures and smoothed analysis

Event type: 
Guest lecture
Event time: 
02.12.2014 - 14:00 - 15:00
Lecturer : 
Nicola Prezza
Exactum C124

In this talk I will describe how we tackled the problem of indexing a text T with a light data structure in order to support fast approximate search of patterns P under the Hamming distance. Our strategy employs two classes of hash functions - dubbed Hamming-aware and de Bruijn - to drastically reduce search space and memory footprint of the index, respectively. Let m=|P|, n=|T|, σ be the alphabet size, and k be maximum number of allowed errors. Hamming-aware hash functions are distance-preserving maps that permit to "squeeze" the k-radius Hamming ball centered in the pattern P to a O(k)-radius Hamming ball centered in the hash of the pattern h(P). de Bruijn hash functions are homomorphisms on de Bruijn graphs and permit to reduce space occupancy of the hash index from O(n log n) bits to n log σ + o(n log σ) bits with no significant slowdown in the lookup operation (whose cost increases from O(1) to O(log m)).  We used our succinct hash data structure to solve the k-mismatch search problem in 2n log σ + o(n log σ) bits of space with a randomized algorithm having smoothed complexity O((2σ)^k (log n)^k (log m + ξ) + (occ + 1) · m), where ξ is a term depending on m, n, and on the amplitude 'epsilon' of the noise perturbing text and pattern. Most importantly, we obtained that for any 'epsilon' > 0, for m large enough, ξ ∈ O(log m): this results improves upon previous linear-space solutions of the k-mismatch problem.

Short bio - Nicola Prezza is a PhD student in computer science at the university of Udine (Italy). His advisor is prof. Alberto Policriti. Nicola did his udergraduate studies at the university of Udine and at the university of Southern Denmark (Odense), and graduated in computer science at the university of Udine with thesis projects related to succinct text indexing and string algorithms for the analysis of epigenetic data. His current research is focused on succinct hashing, compressed text indexes, Burrows-Wheeler transform construction algorithms, and DNA methylation analysis algorithms. During his master degree and PhD he collaborated with the Institute of Applied Genomics (IGA, Udine), and with SciLifeLab (Stockholm) on projects related to differential DNA methylation analysis.

28.11.2014 - 12:27 Fabio Cunial
28.11.2014 - 12:27 Fabio Cunial