Here is a list of projects available as summer intern / Master thesis projects in the Graph Algorithms team of the Algorithmic Bioinformatics group at the University of Helsinki.

If you have no prior experience with Bioinformatics, these desriptions might be very hard to understand. That's fine! No previous Bioinformatics experience is needed (though it consitutes a plus), but you are required to get familiar with the basic concepts of the project, during your internship / thesis work.

1. Flowtigs

implementation of a novel simple genome assembly algorithm, first stand-alone and then possibly integration into a real genome assembler. The stand-alone variant shall be tested on simulated data, while the real assembler should also be run on real data. The algorithm will likely need some engineering to avoid errors in the assembly. This project requires knowledge of a programming language such as Rust, C++ or Python. Some familiarity with genomic data would be ideal, but is not necessary. To integrate the algorithm in a real assembler, knowledge of C is required as well, but this can also be done by the supervisor. Experiments will be conducted using snakemake and slurm.

UNAVAILABLE ANYMORE: 2. Polishing transcript predictions with accurate short-read RNA-Seq data

Emergence of long-read RNA sequencing substantially improved the accuracy of gene and transcript discovery. However, due to high error rate, reaching up to 15% in Oxford Nanopore reads, accuracy of individual splice junction may become an issue. In this project we propose to exploit traditional short-read Illumina sequencing to correct predictions based on the long reads. Although the idea can be quite straightforward, it may still require developing new algorithms to ensure optimal results.

This project involves analysis of sequencing data, dealing with various tools under Linux (sometimes not so user-friendly), coding in Python, learning existing code and libraries. This project will be implemented on top of the IsoQuant tool.

UNAVAILABLE ANYMORE: 3. Integer Linear Programming for viral quasispecies assembly

Minimum flow decomposition, the problem of finding a minimum set of paths that perfectly decomposes a flow, is a classical problem in Computer Science, and variants of it are potent models in multiassembly problems in Bioinformatics (e.g. RNA assembly). Many RNA assemblers also use integer linear programming (ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths.

The primary motivation for a project is multiassembly, where we seek to reconstruct multiple genomic sequences from mixed samples using short substrings (called reads) generated cheaply and accurately from next-generation sequencing technology.Major multiassembly problems involve RNA assembly and viral quasispecies assembly.

No study has been done for viral quasispecies assembly, but existing tools seek minimum-sized decompositions. However, such tools cannot guarantee minimal decomposition since the complexity of minimal flow decomposition does not accommodate large inputs, resulting in less accurate assembly and being prone to errors.

In this project, we are looking into developing a tool based on our integer linear programming models for viral quasispecies assembly. We hope that our performance using integer linear programming can remove the current tradeoff between the complexity of a multiassembly model and that it can lie at the core of future practical RNA assembly tools. Our previous work and supporting material can be found at github.com/algbio/MFD-ILP, link.springer.com/chapter/10.1007/978-3-031-04749-7_14, and arxiv.org/abs/2209.00042.

4. Effects of missing coverage in genome assembly

We would like to identify low-coverage areas in real genomes, i.e. areas that are covered by less reads than others. Then we would like to simulate missing coverage in these regions, and measure the number of errors this introduces into an omnitig-based assembly. Additionally, we would like to test an error-mitigation strategy on the simulated data, as well as possibly in a real assembler. This project requires knowledge of a programming language such as Rust, C++ or python. Some familiarity with genomic data would be ideal, but is not necessary. To implement the error-mitigation strategy into a real assembler, knowledge of C is required as well, but this can also be done by the supervisor. Experiments will be conducted using snakemake and slurm. A similar project was published recently: https://www.biorxiv.org/content/10.1101/2022.01.20.477068v2.

5. Optimal plain-text compression of k-mer sets

We have an algorithm (matchtigs) that finds a set of strings representing a set of k-mers with minimum cumulative length. This algorithm can be extended to account for properties of the file-format the strings are stored in. Alternatively (or additionally), an alternative variant of the algorithm can be implemented, exchanging a many-to-many shortest path query with a text index, which likely uses less memory and may also be faster. This project requires knowledge of a programming language such as Rust, C++ or Python. Some familiarity with text indexes or algorithmic theory would be ideal, but is not necessary. For the text index, also experience with a profiler is beneficial. Experiments will be conducted using snakemake and slurm.