# Theoretical advances for comparative genomics

The Department has started a new “**Research result of the month**” series written by Professor and Deputy Head Sasu Tarkoma. In the first result of the month report, we interview Dr. Djamal Belazzougui about his recent accepted scientific article in the prestigious STOC conference, the ACM Symposium on the Theory of Computing.

### Research result of the month: February 2014

The Department has started a new “Research result of the month” series written by Professor and Deputy Head Sasu Tarkoma. This series presents the researchers behind the latest scientific achievements, gives glimpses of scientists in action, and discusses the societal and industrial relevance of the results without forgetting the big picture.

#### Theoretical advances for comparative genomics

In the first result of the month report, we interview Dr. Djamal Belazzougui about his recent accepted scientific article in the prestigious STOC conference, the ACM Symposium on the Theory of Computing (link http://www.columbia.edu/~cs2035/stoc/stoc2014/index.shtml). The main result of the work pertains to optimal space-efficient randomized algorithms for constructing compressed suffix arrays and compressed suffix trees. The result has applications in multiple domains, in particular tools for comparative genomics. The theoretical result is already being implemented by a group of students in a software engineering project for practical impact.

Dr. Belazzougui completed an engineering degree in Algeria in 2005 and then a PhD degree in France in 2011. After visiting Denmark in the Spring of 2012, he started as a post doctoral researcher at the Department of Computer Science in the Fall of 2012 in Professor Veli Mäkinen’s group “Genome-scale Algorithmics” (link http://www.cs.helsinki.fi/en/gsa/).

To gain more insight into the latest discovery, we asked Dr. Belazzougui to tell a bit about the background. The work builds on the classic suffix array and the suffix tree structures that have been known for a very long time. “Fourteen years ago a new way was found for compressing suffix arrays and suffix trees that allowed the reduction of space by a factor of 10,” he explains. Minimizing the space of the structure is key for many applications where large data sets are examined and manipulated. For example, this breakthrough enabled the indexing of the human genome on a computer with 4 gigabytes of memory that was not possible with the earlier structures.

The latest discovery was kindled by an open problem in the early work on compressed suffix arrays and trees, namely how to quickly build the compressed structure in the same space. “This open problem was almost completely solved about ten years ago for compressed suffix arrays, but it remained open for the compressed suffix tree,” Dr. Belazzougui clarifies.

A powerful variant of the compressed suffix array, the Bidirectional Burrows-Wheeler transform, was discovered around 2010. This newer variant is superior to the previous algorithm in that it runs in linear time. Professor Veli Mäkinen and Dr. Belazzougui had an idea to apply this new variant for various sequence analysis problems. “We eventually found that most of the sequence analysis described in the literature that were previously solved using the classical suffix tree could now be solved in linear time with the Bidirectional Burrows-Wheeler transform while being more space efficient,” he explains the key insight.

*From the right: Dr. Belazzougui, Professor Veli Mäkinen and Dr. Fabio Cunial presenting and discussing the new result. *

After presenting an article on the various variants of the Bidirectional Burrows-Wheeler transform at the European Symposium on Algorithms (ESA) in 2013 (link http://link.springer.com/chapter/10.1007%2F978-3-642-40450-4_12), one major open question remained: How to build those variants of the Burrows-Wheeler transform quickly and with minimal space.

After working on this open problem, Dr. Belazzougui found a solution: “Surprisingly it turned out that the Bidirectional Burrows-Wheeler can allow to build the compressed suffix tree in small space with a very simple way”. This result would not have been possible to obtain five years ago, but building on techniques described in the last two to three years, most notably by the team of Professor Enno Ohlebusch, it was possible to solve the problem.

The scientific article builds on the past work and presents a set of optimal space-efficient randomized algorithms for constructing compressed suffix arrays and compressed suffix trees. Space-efficiency of the algorithms means that the maximum space required by them is asymptotically bounded by the size of the input.

The result has many applications in the field of string algorithms. In particular sequence analysis problems can now be solved in linear time and with space that is essentially a constant factor larger than the size of the input and output of the problem being solved.

The work solves an important open problem; however, there are still many research questions that need to be addressed. The algorithms elaborated in the scientific article are optimal with the caveat that they require randomization. The challenge of devising a deterministic worst-case linear time algorithm is still open. Implementation of the algorithms for more practical impact is another important activity. “We are implementing some of the algorithms presented in the article and in our previous work from ESA 2013,” he explains the current implementation plans.

As general advice for PhD students on making discoveries, he recommends reading articles from different topics and fields. “You may make unexpected connections that no one has made before,” he concludes.

Link to the manuscript in the ArXiv collection (link http://arxiv.org/abs/1401.0936).