Indexing Paths in Genome Graphs

Event type: 
Guest lecture
Event time: 
13.12.2017 - 13:15 - 14:00
Lecturer : 
Jouni Sirén
Exactum B222

Reference genomes provide a prior to guide our interpretation of DNA sequence data. However, if we use a single reference sequence, our interpretations are biased towards that sequence. Genome graphs, which collapse multiple sequences into a graph, provide a richer reference model. Graphs can reduce the reference bias significantly, if we choose the variation we include in them carefully.

The positional Burrows-Wheeler transform (PBWT) represents individual genomes relative to a reference sequence. Each haplotype is encoded as a binary sequence over variant sites. At each site, the sequence tells whether the haplotype contains the reference allele or the alternate allele. The sequences are stored in a structure related to the FM-index, providing fast search for haplotypes with a given combination of adjacent alleles. Because the haplotypes are often very similar, run-length encoding compresses the index well. Hence the PBWT is also a space-efficient representation for large collections of individual genomes.

In the genome graph context, haplotypes correspond to paths in the graph. The binary sequences become sequences of node identifiers. PBWT has been extended store haplotypes over acyclic graphs. In this talk, I will present a conceptual simplification that makes the extension work with cyclic graphs. I will also discuss the algorithmic ideas that make the implementation scalable to thousands and potentially to tens of thousands of individual genomes.

About the speaker:

Jouni Sirén develops space-efficient data structures for large collections of biological sequence data. He received his PhD from the University of Helsinki in 2012 and worked as a postdoc at the University of Chile and the Wellcome Trust Sanger Institute. He will soon start as a research scientist at the UCSC Genomics Institute.

01.12.2017 - 17:18 Veli Mäkinen
01.12.2017 - 17:18 Veli Mäkinen