Genomescale algorithmics (GSA)
GSA group
News
 Simon Puglisi selected to an Academy Research Fellow position.
 Our textbook is now out: GenomeScale Algorithm Design: Biological Sequence Analysis in the Era of HighThroughput Sequencing. Cambridge University Press, 2015. [Book webpage]
 See an interview of Djamal on the story behind his STOC 2014 paper.
 We were selected to the top 10 at Top Bioinformatics Contributions 2013.
 Jouni Sirén rewarded for his PhD thesis.
Group description
We develop algorithms and data structures for the analysis of genomescale data. Such data is abundant due to modern molecular biology measurement techniques like highthroughput sequencing. We are especially interested in applications of compressed data structures, that make it possible to analyse the often highly redundant data within the space of their information content. We also study other scalability aspects like distributed computation/storage around genomescale data.
The group is a member of Helsinki Institute for Information Technology and Finnish Centre of Excellence in Cancer Genetics Research.
Research topics
Pangenome indexing
An example of our recent developments is an extension of BurrowsWheeler transform to finite automaton representing reference genome together with its common variations among the population (pangenome). This enables a spaceefficient index structure to be constructed to support efficient read alignment to a rich model of the population. The image below shows the improved accuracy of our tool GCSA compared to BWA on highlypolymorphic areas of human genome, with Best as the theoretical upperbound.
Another way to exploit pangenomic information is to build an index directly on the multiple alignment representing individual genomes. Recently, we have developed LempelZiv indexing approaches suitable for this scenario. We are working on seamless ways to improve variant calling exploiting these indexes.
 Sirén, Välimäki, Mäkinen. Indexing Graphs for Path Queries with Applications in Genome Research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2): 375388, 2014. [Article online][Implementation]
 Ferrada, Gagie, Hirvola, Puglisi, Hybrid indexes for repetitive datasets, Philosophical Transactions of the Royal Society A, Volume 372, (2014).
 Travis Gagie and Simon J. Puglisi, Searching and indexing genomic databases via kernelization, Frontiers in Bioengineering and Biotechnology, 3(12) (2015)
Time and spaceoptimal sequence analysis
We consider the classical sequence analysis tasks such as computing maximal repeats, maximal exact matches, maximal unique matches, string kernels, etc. using space within constant factor from the information theoretical minimum and time linear in the input length independent of the alphabet size. In ESA 2013 we showed that there exists a spaceefficient variant of the bidirectional BWT index that can be used for linear time analysis, and in STOC 2014 Belazzougui showed how to build such an index in linear time and small space. That result required randomization, but in the combined extended version of these paper we show that deterministic construction and analysis is feasible within the same bounds, using a variant called unidirectional BWT index. A corollary is that suffix arrays and suffix trees can be built in linear time using sublinear extra space in the output size.

Belazzougui, Cunial, Kärkkäinen, Mäkinen. Versatile succinct representations of the bidirectional BurrowsWheeler transform. In Proc. ESA 2013.
[Article online] [Implementation of MUMS in SDSL] [BW4SA library]  Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. STOC 2014.
 Cunial, Belazzougui. Indexed matching statistics and shortest unique substrings. In Proc. SPIRE 2014.
 Belazzougui, Cunial, Kärkkäinen, Mäkinen. Time and spaceoptimal string indexing and analysis. Manuscript in preparation.
Transcript assembly and quantification
The assembly of RNASeq reads can be naturally formulated graph theoretic problem of explaining a weighted DAG (directed acyclic graph) with weighted paths. In RECOMBSeq 2013 we gave a polynomialtime solution
for a variant of this problem, based on minimumcost network flow. In WABI 2013 we showed that another variant is NPhard, but that it admits an efficient solution based on dynamic programming. In a later paper (submitted) we proposed a unifying problem statement for genomeguided multiassembly problems. Moreover, we gave a more efficient algorithm for this general problem which exploits DAG decompositions based on a new graph parameter (arcwidth). In RECOMBSeq 2014 we investigated different problems which add long read, or pairedend read information to some multiassembly formulations, which are solvable by minimumcost network flow, or NPhard, respectively.
 Tomescu, Kuosmanen, Rizzi, Mäkinen. A Novel MinCost Flow Method for Estimating Transcript Expression with RNASeq. BMC Bioinformatics, 14(Suppl 5):S15 (RECOMBSeq 2013 Supplement). [Article online] [Slides] [Implementation]
 Article online] . [
 Tomescu, Gagie, Popa, Rizzi, Kuosmanen, Mäkinen. Explaining a Weighted DAG with Few Paths for Solving GenomeGuided Multiassembly. Accepted to IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015.

Rizzi, Tomescu, Veli Mäkinen. On the Complexity of Minimum Path Cover withSubpath Constraints for MultiAssembly. BMC Bioinformatics, 15(Suppl 9), S5, 2014 (RECOMBseq 2014 supplement). [Article online] [Slides]
Recombinationaware sequence alignment
We have proposed generalizations of the traditional global alignment algorithms for the scenario in which one or two of the input genomes are represented with two strings; namely, motherinherited and fatherinherited chromosomes. These generalizations take into account that individual diploid genomes evolve through a mutation and recombination process, and that predictions through variant calling and haplotype assembly may be erroneous in both dimensions.
Our more general formulation of this alignment problem considers each genome as a DAG. A covering alignment consists of two paths in each DAG that cover all the nodes. The cost is given by the pairwise edit distance of the strings spelled by these paths. The core problem of finding the optimal covering alignment is still open, but many variants of the setting can be solved efficiently.
We are currently working in applications to evaluate and optimize variant calling and haplotyping algorithms.
 Mäkinen, Rahkola. Haploid to diploid alignment for variation calling assessment. BMC Bioinformatics 14(S15): S13, 2013. [Article online] [Implementation]
 Mäkinen, Valenzuela. Recombinationaware alignment of diploid individuals. BMC Genomics 15(Suppl 6):S15, 2014. [Article online]
 Mäkinen, Valenzuela. Diploid alignments and haplotyping. Accepted to ISBRA 2015.
Genome assembly
We have made several contributions (in collaboration especially with Leena Salmela) to the problem of assembling highthroughput sequencing reads into the originating genome. This problem is commonly split into three steps: (1) contig assembly, (2) scaffolding, and (3) gap filling (the figure below illustrates this last step).
Our contributions for (1) contig assembly, (2) scaffolding, and (3) gap filling are
 Välimäki, Ladra, Mäkinen. Approximate allpairs suffix/prefix overlaps. Information & Computation, 213:4958, 2012. CPM 2010 Special Issue. [Article online] [Implementation]
 Salmela, Mäkinen, Välimäki, Ylinen, Ukkonen. Fast Scaffolding with Small Independent Mixed Integer Programs. Bioinformatics 27(23): 32593265, 2011. [Article online]
 Salmela, Sahlin, Mäkinen, Tomescu. Gap Filling as Exact Path Length Problem. Accepted to RECOMB 2015. [Implementation]
We have also studied assembly validation:

Mäkinen, Salmela, and Ylinen. Normalized N50 Assembly Metric using GapRestricted CoLinear Chaining.
BMC Bioinformatics, 13:255 (3 October 2012). [Article online] [Implementation]
The scaffolding and validation tools have been used in a recent assembly project:
 Ahola V., Lehtonen R., Somervuo P. et al. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nature Communications, 5:4737, Sept. 2014.