Biological Sequence Analysis

Advanced studies
The course covers selected high-throughput methods for the analysis of biological sequences, including advanced alignment methods, Hidden Markov Models, and next-generation sequencing data analysis methods. Prerequisities: Basics of bioinformatics and algorithms.
Year Semester Date Period Language In charge
2016 spring 18.01-03.03. 3-3 English Esko Ukkonen


Time Room Lecturer Date
Mon 12-14 B119 Esko Ukkonen 18.01.2016-03.03.2016
Thu 10-12 B119 Esko Ukkonen 18.01.2016-03.03.2016

Exercise groups

Group: 1
Time Room Instructor Date Observe
Mon 14-16 B222 Dominik Kempa 25.01.2016—04.03.2016 The meeting time may still change!


The course covers selected computational methods for the analysis of biological sequences. Topics include advanced alignment methods, algorithms around hidden Markov models, and core data structures for read alignment and genome analysis. The course will be organized in *active learning* mode meaning that there will be very few lectures but more home work and study groups during the course. In study groups we discuss the week's topic. It is essential that the students read in advance the texts for each study group session (the exact form of study group work depends on the number of participants). Exercises test the knowledge of the study group material and their extensions to related topics. There will be some tailored assignments depending on the student's background: choice between deeper theory assignments for mathematically oriented and more software implementation-oriented assignments for those who prefer learning by doing.There will be no course exam. Instead, you have to show sufficient activity during the course to pass.


Completing the course

There is no course exam. The grading is based on the activity during the course. The study groups are mandatory (you should attend at least 4 out of 6). Exercises determine the grade: 50% gives 1, 85% gives 5. If you cannot attend some exercise session you may return your solutions before the session by email to Dominik Kempa at dominik.kempa[at]


Content and weekly schedule

  • Mon18.1. 12-14:  Opening lecture: Bioinformatics? Biology primer, orthologs and paralogs, sequence homology, alignments, score schemes, log-odds, BLOSUM, multiple alignments, signals (motifs) in DNA, PWM and PSSM, k-mers, GC-content, GC-skew, Markov-chains, CAI. [slides.ppt] [slides.pdf]
  • Thu 21.1. 10-12:  Opening lect (cont)
  • Mon 25.1. 12-14:  Study group: Dynamic programming for various alignment models  + shortest detour. Sections 6.1-6.1.2, 6.3-6.4.3. of Mäkinen et al book [MBCT].
  • Mon 25.1. 14-16  B222:  Exercise session 1: [problems] [solutions]
  • Thu 28.1. 10-12  B119:  Exercise session 2: [problems] [solutions]
  • Mon 1.2. 12-14  B119: Study group: LCS, sparse dynamic programming, affine gap model, gene alignment. Reading: Sections 6.2, 6.4.4, 6.4.5, 6.5 of [MBCT]
  • Thu 4.2. 10-12  B119: Exercise session 3: [problems] [solutions]
  • Mon 8.2. 12-14  B119: Study group:
      Hidden Markov Models, Viterbi algorithm, Forward and Backward algorithms, HMM parameter estimation
      Reading:  Chapter 7 of [MBCT]
  • Thu 11.2. 10-12  B119: Exercise session 4: [problems] [solutions]
  • Mon 15.2. 12-14 B119: Study group: Gene alignment, Multiple alignment: scoring schemes,  progressive alignment, DAG alignment, jumping alignment. Reading: Sections 6.5 & 6.6 of [MBCT]
  • Thu 18.2. 10-12 B119: Lecture: profile HMMs and index structures (suffix trees and suffix arrays)
  • Mon 22.2. 12-14 B119: Study group: Classical indexes: k-mer index, suffix array, suffix tree, applications. Reading:  Sections 8, 9.1-2 of [MBCT]
  • Mon 22.2. 14-16 B222: Exercise session 5: [problems] [solutions]
  • Thu 25.2. 10-12 B119: Exercise session 6: [problems] [solutions]
  • Mon 29.2. 12-14  B119: Study group:

        Read alignment: pattern partitioning, dynamic programming along suffix tree paths;
        comparing genomes without an alignment; Genomics: variation calling 
        Reading: Chapter 10, pages 201-206; Chapter 11, pages 220-221 & 229-233; Chapter 14, pages 307-312 of [MBCT]

  • Thu 3.3. 10-12  B119: Exercise session 7: [problems]

Literature and material

The course is based on selected chapters from:
  • [MBCT]   V. Mäkinen, D. Belazzougui, F. Cunial, and A. Tomescu. Genome-Scale Algorithm Design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, 2015 [access electronic version].
  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of proteins and nucleid acids. Cambridge University Press, 1998 [access electronic version].
  • R. C. Deonier, M. S. Waterman, S. Tavaré. Computational Genome Analysis: An Introduction. Springer, 2005 [access electronic version].

Note: electronic versions require access from the university network.