RNA-Seq data analysis: a combinatorial approach based on local transcriptome assembly

Event type: 
Event time: 
12.06.2013 - 14:15 - 15:00
Lecturer : 
Francesca Nadalin


RNA-Seq data analysis is fundamental in a number of experiments, e.g., epigenetic studies, RNA-editing, gene-fusion phenomena, somatic variants classification, and tissue response to stresses. The task of estimating gene expression is crucial in these contexts and the tools used to solve it are among the most delicate that the bioinformatics community is called to produce and analyse.
We propose a methodology that addresses the expression levels estimation problem by accurately computing coverage profiles and by identifying high-quality exons. The first step consists in firstly aligning RNA-Seq reads against the reference genome using an indirect method that firstly builds a partial assembly and then retrieves reads placement. Once reads are assembled into longer sequences, they can be mapped against the reference genome (e.g., with BLAST), so that reads placement can be retreived from contigs' layout. This approach is intended to overcome the well-known alignment limitations due to short query length (even much shorter than an NGS read itself, in case it spans an intron/exon junction). Finally, the coverage profile is computed from the reads.
The second step consists in selecting, for each gene, a maximum-weight/minimal-cardinality subset of exons that are sufficient to completely “explain” the expression levels of each isoform of a gene. We define, for each gene, a combinatorial problem whose variables are the expression levels of the annotated isoforms. Then, we select a square sub-system of highest quality (where exons' quality depends both on gene's structure and on coverages accuracy and consistency) and solve it in order to get the unique solution corresponding to the expression level of each isoform.
We will show preliminary results on simulated RNA-Seq experiments. Future work will be devoted in refining the second step of our pipeline by (i) choosing the most suitable quality function, and (ii) introducing heuristics to integrate biological evidence in the model. We plan to experiment on real data and compare our results with state-of-the-art tools for the same problem.
BIO: Francesca Nadalin received her MSc Degree in Mathematics in 2010 from the University of Udine, Italy. In 2010 she was employed at the Applied Genomics Institute in Udine. Now she is a third year PhD student in Computer Science at the University of Udine. Her research field is Bioinformatics, in particular her work has been focused on the study and development of algorithms for (local) genome assembly and for gene expression estimation.
10.06.2013 - 13:44 Alexandru Tomescu
10.06.2013 - 13:44 Alexandru Tomescu