Algorithms for DNA sequencing data
The genome of an oragnism can be investigated with DNA sequencing. DNA sequencing breaks the genome into small fragments and reports the nucleotide sequence of these fragments, i.e. substrings of the genome. We develop data structures and algorithms for analysing this kind of sequencing data. Possible topics for the summer internship include (i) lossy compression of sequencing data and (ii) indexing discriminating substrings of genomic data. The actual topic will be tailored according to the interests of the chosen applicant.
Programming skills and knowledge of algorithms and data structures is needed. Knowledge of biology or bioinformatics is beneficial but not necessary. The topics in this project are suitable for Master's thesis work.
More detailed descriptions of possible topics
- Lossy compression of sequencing data: DNA sequencing is not perfect but the sequencing data contains errors, i.e. substitutions, deletions, and insertions. This noise introduced by errors can severely degrade the compression of the data. Furthermore, in downstream processing the likely errors will be discarded. Thus we propose to investigate compression methods that can discard the errors already when compressing the data.
- Indexing discriminating substrings of genomic data: Suppose you are given a set of similar genomes (e.g. strains of the same species) and sequencing data. Your task is to find out which of the genomes are present in the sequencing data. A straightforward method to answer this query is to index the whole set of genomes. Here we will explore an alternative strategy where we first find substrings of the genomes that distinguish them from each other and then only index these to obtain a smaller index. This smaller index should be sufficient to answer the needed queries.
Group: Algorithms for Biological Sequencing Data
Supervisor: Leena Salmela, leena.salmela@helsinki.fi