Biological Sequence Analysis (guided self study)

582483
5
Bioinformatics
Advanced studies
The course covers selected high-throughput methods for the analysis of biological sequences, including advanced alignment methods, Hidden Markov Models, and next-generation sequencing data analysis methods. Prerequisities: Basics of bioinformatics and algorithms.
Year Semester Date Period Language In charge
2015 spring 12.01-26.02. 3-3 English Veli Mäkinen

Lectures

Time Room Lecturer Date
Mon 12-14 B222 Veli Mäkinen 12.01.2015-23.02.2015

Exercise groups

Group: 1
Time Room Instructor Date Observe
Thu 10-12 B222 Veli Mäkinen 15.01.2015—26.02.2015

General

The course covers selected high-throughput methods for the analysis of biological sequences. Topics include advanced alignment methods, algorithms around hidden Markov models, and core data structures for read alignment and genome analysis. This edition of the course is guided self study, meaning that more home study is expected as there is one less lecture slot than normally. At study groups we discuss the week's topic (the exact form of study group work depends on the number of participants). Exercises test the knowledge of the study group material and their extensions to related topics. There will be some tailored assignments depending on the student's background: choice between deeper theory assignments for mathematically oriented and more labour-some implementation assignments for those who prefer learning by doing.

Completing the course

There is no course exam. The grading is based on the activity during the course. Monday study groups are mandatory (you should attend at least 4 out of 6). Exercises determine the grade: 50% gives 1, 85% gives 5. Solutions can be returned by email.

Content

  • Mon 12.1 12-14. Introductory lecture: Biology primer, Markov chains, alignments, score schemes, log-odds, BLOSUM, GC-content, GC-skew, CAI. Sections 1-1.2 + [slides.ppt] [slides.pdf]
  • Thu 15.1 10 -12.  Exercise 1 [pdf] [solution ex6.py]
  • Mon 19.1 12-14. Study group:  Dynamic programming for various alignment models  + shortest detour. Sections 6.1-6.1.2, 6.3-6.4.3.
  • Thu 22.1 10-12. Exercise 2 [pdf] [solutions]
  • Mon 26.1 12-14. Study group: Invariant technique, sparse dynamic programming, affine gap model. Sections 6.2 [..Ha*], 6.4.4 [He*..M*], 6.4.5 [N*..]
  • Thu 29.1 10-12. Exercise 3 [pdf] [solutions]
  • Mon 2.2 12-14. Study group, Hidden Markov Models, forward, backward, Baum-Welch. Chapter 7
  • Thu 5.2 10-12.  Exercise 4 [pdf]
  • Mon 9.2 12-14. Study group, Multiple alignments, jumping alignments, Section 6.6
  • Thu 12.2 10-12. Exercise 5 [pdf] [solutions]
  • Mon 16.2 12-14. Study group:  High-throughput sequencing (HTS) overview, variant calling, Burrows-Wheeler transform and indexes, search space pruning. Sections 1.3, 9.1-9.4.1, 10-10.5, 14.1.1 [slides.pptx] [slides.pdf] (enough to focus on conceptual ideas; data structure compression techniques are covered in a simultaneous data compression techniques course)
  • Thu 19.2 10-12. Exercise 6 [pdf]
  • Mon 23.2 12-14. Study group: Genome analysis, maximal repeats,  unique and exact matches on suffix tree and on bidirectional BWT index. Sections 8.4, 11.1
  • Thu 26.2 10-12. Exercise 7 [pdf]
  • An alternative way to take the course is by separate exam: http://www.cs.helsinki.fi/exams
  • Take the variation calling challenge project at period IV to learn practical skills related to the topic of the course.
  • Transcriptomics and other "upstream" analysis building on top of underlying sequence analysis are considered in Algorithms in Molecular Biology, period IV. 

Literature and material

The course is based on selected chapters from the book:

  • Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru Tomescu. Genome-Scale Algorithm Design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, in press.

More in-depth probabilistic modeling of alignments and hidden Markov models can be found from the book:

  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of proteins and nucleid acids. Cambridge University Press, 1998.

First lecture is largely based on the book:

  • R. C. Deonier, S. Tavaré, and M. S. Waterman. Computational Genome Analysis: An Introduction. Springer, 2005.