A combinatorial and integrated method to analyse High Throughput Sequecing reads

Event type:

Guest lecture

Event time:

14.01.2011 - 10:15 - 11:00

Lecturer :

Eric Rivals

Place:

B222, Exactum

Description:

Eric Rivals, CNRS & Université Montpellier 2
http://www.lirmm.fr/~rivals
Work in collaboration with M. Salson (U. Rouen), N. Philippe et T. Commes (U. Montpellier 2).

Next-generation sequencing technologies are presently being used to answer key biological questions at the scale of the entire genome and with unprecedented depth. Whether determining genetic or genomic variations, cataloguing transcripts and assessing their expression levels, finding recurrent mutations in cancer, identifying DNA-protein interactions or chromatin modifications, surveying the species diversity in an environmental sample, all these tasks are now tackled with High Throughput Sequencing (HTS). For genomics and transcriptomics data sets, the current paradigm of analysis of large read sets consists in
1. mapping the reads to a reference genome contigously allowing as many differences as one expects to be necessary to accomodate sequence errors and small polymorphisms;
2. using uniquely mapped reads to determine covered genomic regions, either for computing a local coverage to predict SNPs and filter out sequence errors (cf. program ERANGE), or for delimiting expressed exons approximately (with RNA-seq; cf. programs TopHat GMORSE),
3. re-aligning unmapped reads, which were not mapped contigously at step one, to reveal exon boundaries or larger indels.
As shown by the results of approaches following this paradigm, a number of pitfalls/drawbacks must be accomodated: mapping errors induce false predictions at further steps, indels larger than 4 bp are not handled, the impossibility to distinguish SNPs from sequence errors at mapping stage, the lack of precision on exon boundaries, etc.

On the other hand, we have developped an exact mapper, called MPSCAN, for short reads (Rivals et al. 09), and analysed its performance in detecting uniquely mapped regions in function of tag length (Philippe et al. 09). We could show that one can estimate depending on the genome length, a length k of substring that will in average point to a single genomic location. Building on this work, we have conceived a new approach to analyse nowadays longer reads (> 50 bp). We record for all the k-mers along the read their matching genomic positions and number of occurrences in the reads, and then analyse jointly these profiles to determine whether a read can be mapped contigously or detect multiple causes of alignment disruption: large indels, introns, rearrangements. In this talk, we will present this procedure, the underlying data structures, show that it distinguishes SNP from sequence errors, and allies sensitivity and specificity in the prediction of exon boundaries, indels, and rearrangements.

Related publications:

* Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity
N. Philippe*, A. Boureux*, L. Bréhèlin, J. Tarhio, T. Commes, E. Rivals
Nucleic Acids Research (NAR) doi:10.1093/nar/gkp492; 2009.
* MPSCAN: fast localisation of multiple reads in genomes
E. Rivals, L. Salmela, P. Kiiskinen, P. Kalsi, J. Tarhio
Proc. 9th Workshop on Algorithms in Bioinformatics
Lecture Notes in BioInformatics (LNBI), Springer-Verlag, Vol. 5724, p. 246-260, 2009.

Last updated: 04.01.2011 - 14:20 Veli Mäkinen
Post date: 04.01.2011 - 12:34 Veli Mäkinen

Permanent link: https://www.cs.helsinki.fi/en/node/60915

Printer-friendly version

Address: Department of Computer Science, P.O. 68 (Gustaf Hällströmin katu 2b), FI-00014 UNIVERSITY OF HELSINKI, FINLAND
Opening Hours: During spring and autumn semesters Mon - Fri 7.45 - 19.45 (7.45 am - 7.45 pm)
Phone: +358 9 1911 (University switch)
General e-mail: info [at] cs.helsinki.fi
Fax: +358 9 876 4314

Department of Computer Science [pre 2018 site]

University of Helsinki

Faculty of Science

A combinatorial and integrated method to analyse High Throughput Sequecing reads