ALFALFA: a Novel Algorithm for Long Fragment Mapping and Alignment using Maximal Exact Matches

Event type: 
Guest lecture
Event time: 
05.02.2013 - 14:15 - 15:00
Lecturer : 
Michaël Vyverman
Exactum, B222

Abstract: Mapping and alignment of sequencing reads against reference sequences is a crucial step in many genome analysis pipelines [1]. Due to the continuous growth in sequencing data, the development of fast and accurate read mapping algorithms has received much attention in recent years. However, for historical reasons most algorithms focus on short reads and are not well equipped for the longer reads produced by a growing number of sequencing platforms. We developed ALFALFA, a new Algorithm for Long Fragment Alignment. It is based on the classical seed-and-extend approach and utilizes maximal exact matches as seeds. Benchmark results on real and simulated data show a practical speed-memory-accuracy trade-off. ALFALFA is faster and more accurate compared to other heuristic long read mappers, even in the presence of read errors. The maximal exact matches are computed using essaMEM [2], a tool for finding maximal exact matches that can be used in genome comparison and read mapping. essaMEM enhances an existing sparse suffix array implementation with a sparse child array. Tests indicate that the enhanced algorithm for finding maximal exact matches is much faster, while maintaining the same memory footprint. In this way, sparse suffix arrays remain competitive with the more complex compressed suffix arrays.

[1] Vyverman, M., De Baets, B., Fack, V., and Dawyndt, P. (2012). Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Research, 40(15), 6993–7015.

[2] Vyverman, M., De Baets, B., Fack, V., and Dawyndt, P. (2013). essaMEM: Finding Maximal Exact Matches Using Enhanced Sparse Suffix Arrays. Bioinformatics, to appear.

Bio: Michaël Vyverman is a PhD student in Computer Science at Ghent University (UGent), Belgium. In 2008, he received his Bachelor degree in Mathematics at Ghent University, followed by the degree of Master of Mathematical Informatics in 2010, also at Ghent University. Currently, he is part of the research group Computational Biology at the Department of Applied Mathematics and Computer Science at UGent. The topic of his PhD research is "Advanced and Distributed Index Structures for Genome Analysis". His supervisors are Prof. Dr. Peter Dawyndt and Prof. Dr. Veerle Fack from the Department of Applied Mathematics and Computer Science (UGent) and Prof. Dr. Bernard De Baets from the Department of Mathematical Modelling, Statistics and Bioinformatics (UGent). He is also affiliated to the Multidisciplinary Research Partnership “From Nucleotides to Networks (N2N)”. His research interests include (compressed) index structures, algorithms on strings and sequence and other algorithms for genome analysis.

25.01.2013 - 16:09 Veli Mäkinen
25.01.2013 - 16:09 Veli Mäkinen