Scaffolding and read pair distributions

Tapahtuman tyyppi: 
03.09.2014 - 13:15 - 14:00
Kristoffer Sahlin
Exactum, C222
I will present an overview of my PhD work up until now and it is structured into three different parts. First I will talk about accurately estimating gap sizes between contigs. The second part describes a scaffolding algorithm and the third part is about generalising the gap size theory to other applications, e.g. structural variation.
Generalization of gap sizes:
Next Generation Sequencing (NGS) data are now commonly used for answer-
ing various biological questions. In many applications, insert size distribution
from paired read protocols plays an important role, for example in genome as-
sembly and structural variation detection. However, many of the the models
that are being used suer from bias. This bias arises when assuming that all
insert sizes within a distribution are equally likely to be observed, when in fact,
size matters. It can be shown that these systematic errors exists in popular
software even when the assumptions made about data is true.
We have previously shown that bias occurs for scaffolders in genome assembly
where our method was constrained to this particular application and to normally
distributed insert size distributions. Here, we generalize the theory and give
examples to illustrate the potential use in different settings. We also relax the
assumptions about normality and account for all insert size distributions using
non-parametric models with binning distributions. Furthermore, coverage is
introduced as an optional parameter to the model and we show how this affects
results. We provide examples on where bias occurs in state-of the-art software,
explain why, and improve them using our model. The results are useful for
everyone working with paired read data and insert size distributions. The theory
is implemented in a tool called GetDistr. GetDistr is highly modular and easily
integrated in other software.
Kristoffer Sahlin
PhD student

Department of Computational Biology

Royal Institute of Technology
Science for Life Laboratory
School of Computer Science and Communication
01.09.2014 - 13:29 Veli Mäkinen
01.09.2014 - 13:29 Veli Mäkinen