Genome-scale algorithmics (GSA)

Summer internship positions

Diploid to Haploid alignment

Supervisor: Veli Mäkinen

To evaluate performance of variant calling with high-throughput sequencing reads, the following experiment can be created: Generate an artificial diploid genome from a reference haploid genome applying common frequent variations reported in databases, and some novel ones. Simulate sequencing reads, i.e. fragments of DNA, from the artificial diploid genome. Variant caller is supposed to align the reads to the reference haploid genome, and based on differences voted by several reads, to discover what were the artificially created variations. Due to invariances, slightly mispositioned indels, etc., evaluating the precision and recall is not straightforward. One approach to tackle this problem is to align the predicted haploid genome (created by applying predicted variations on top of the haploid reference) to the artificial diploid genome. Optimal alignment should be able to handle the invariances and slightly mispositioned indels. The topic of the internship is to evaluate this approach and develop an optimal or near optimal genome-scale diploid to haploid alignment method. One can start from known results on haploid-haploid alignment for which fast methods exist whose running time depends on the distance between the two genomes, which is also the property to be exploited here.


Reconstructing phylogenies with string compressors

Supervisor: Fabio Cunial

Alignment-free methods reconstruct the tree of life of known organisms by analyzing the repertoire and frequency of their substrings of fixed length. Specifically, a genome is represented as a vector whose components are all the possible substrings of length $k$, and the distance between two genomes is estimated using standard notions of dissimilarity between vectors. Reducing the size of these vectors has obvious effects in practice (it speeds up distance computation and it limits space), but it would also have nontrivial implications on the nature of the minimal evolutionary signal carried by genomes. In this project, for the first time, you will use the words selected by substitutional string compression algorithms (like LZ77, LZW and their variants) to reconstruct the tree of life from known genomes. You will also hack with existing algorithms and adapt them to discover more general types of repetition.

 

Succinct Colored De Bruijn Graphs

Supervisor: Veli Mäkinen

Finding variations between two genomes can be approached using De novo assembly with colored de Bruijn graphs [1]. With uncompressed colored / coverega annotated de Bruijn graph, such approach suffers from requiring enermous amount of RAM. However, there exist a succinct representation for de Bruijn graphs [2]. The goal of the project is to implement the succinct de Bruijn graph and explore ways to compress colored / coverage annotated de Bruijn graphs on real read data sets, to enable more space-efficient approach for De novo assembly -based variant calling.

[1] Z. Iqbal, M. Caccamo, I. Turner, P. Flicek, G. McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012 Jan 8;44(2):226-32. doi: 10.1038/ng.1028.

[2] A. Bowe, T. Onodera, K. Sadakane, and T. Shibuya. Succinct de Bruijn Graphs. In Proc. WABI 2012. Springer, LNCS 7534, pp. 225-235, 2012.

15.02.2013 - 17:23 Veli Mäkinen
25.01.2013 - 16:31 Veli Mäkinen