Daniel Valenzuela defends his PhD thesis on Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era on June 9th, 2017

M.Sc. Daniel Valenzuela will defend his doctoral thesis Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era on Friday the 9th of June 2017 at 12 o'clock noon in the University of Helsinki Exactum Building, Auditorium CK112 (Gustaf Hällströmin katu 2b). His opponent is Directeur de recherche Gregory Kucherov (Le Centre national de la recherche scientifique, France), and custos Professor Veli Mäkinen (University of Helsinki). The defence will be held in English.

Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era

This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data.

Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes.

The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections.

Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling.

The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.

Availability of the dissertation

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-3231-4.

Printed copies will be available on request from Daniel Valenzuela: daniel.valenzuela@helsinki.fi.

 

24.05.2017 - 12:24 Pirjo Moen
23.05.2017 - 16:36 Pirjo Moen