Department of Computer Science

Haplotype aware error correction with a coloured de Bruijn graph

The genome of an oragnism can be investigated with DNA sequencing. Current DNA sequencing technologies are not able to read the whole genome in one go but instead they produce sets of short reads, i.e. randomly sampled substrings of the genome. DNA sequencing is also not perfect but the sequencing reads contain errors such as substitutions, insertions, and deletions. Typically enough reads are produced to cover the genome several times. This also helps to mitigate the sequencing errors because the same genomic position is sequenced several times.

Error correction aims to correct the sequencing errors in reads prior to further processing. Current tools struggle with differentiating between sequencing erros and genetic variants especially with low coverage data. Pooling similar samples (e.g. samples from different human individuals from the same population) increases coverage but introduces even more genetic variants. We have previously developed an error correction method using a de Bruijn graph to correct sequencing data from a single sample. In this project you will extend this method to multi sample data by using a coloured de Bruijn graph.

The project will be implemented in C++ and knowledge of algorithms and data structures is needed. Knowledge of bioinformatics is beneficial but not necessary as the required biological background can be learned in the beginning of the project.

More information: Leena Salmela