Biological Sequence Analysis

Advanced studies
The course covers selected high-throughput methods for the analysis of biological sequences, including advanced alignment methods, Hidden Markov Models, and next-generation sequencing data analysis methods. Prerequisities: Basics of bioinformatics and algorithms.


02.05.2011 16.00 A111
Year Semester Date Period Language In charge
2011 spring 14.03-27.04. 4-4 English Veli Mäkinen


Time Room Lecturer Date
Mon 10-12 D122 Veli Mäkinen 14.03.2011-27.04.2011
Wed 10-12 D122 Veli Mäkinen 14.03.2011-27.04.2011

Exercise groups

Group: 1
Time Room Instructor Date Observe
Wed 12-14 D122 Veli Mäkinen 16.03.2011—16.03.2011
Wed 12-14 D122 Veli Mäkinen 21.03.2011—29.04.2011

The lectures and exercise sessions are cancelled on 4th, 6th and 11th of April!


The course covers the basic probabilistic methods for modelling and analysis of biological sequences. Special focus will be on the feasibility issues related to the analysis of modern high-throughput sequencing data.

Course starts with Hidden Markov Model (HMM) techniques following loosely Chapters 3-5 of the book Durbin R., Eddy S., Krogh A. and Mitchinson G.: Biological sequence analysis, Cambridge University Press, 1998. The course then continues with techniques essential for efficient analysis of high-throughput sequencing (HTS) data, such as Burrows-Wheeler Transform (BWT) -based indexing for read alignment. The probabilistic framework is again applied to the further steps in HTS data analysis, such as  variation calling, alternative splicing, gene expression, and transcription factor binding site studies.  Course conludes with further applications of BWT indexing, such as computation of maximal exact matches and maximal unique matches, that are useful for comparative genomics studies.

Prerequisities: Knowledge on basic bioinformatics concepts such as alignment algorithms and score matrices (e.g. from Elements of Bioinformatics course).
Useful background: Knowledge of full-text indexing concepts (e.g. from String processing methods course).

Completing the course

Course consists of lectures and exercises, and of a compulsory project work. Exam gives maximum 40 points, weekly exercises 5 points, and project work 15 points.  Grading is based on total points gathered.  However, to pass the course, one must pass the exam with at least 20 points, and pass the project work.

Literature and material

We follow the course script.  The  script is still preliminary and will be updated during the course (especially the latter chapters).

Exercise solutions may appear here (based on voluntary activity).

Schedule (tentative):

  • 14.3. Alignments revisited, sparse dynamic programming, invariant technique and its application to affine gap score computation:
    Course script Sections 3.1 and 3.4 quickly revisited, 3.2 and 3.4.5 in detail
  • 16.3. PAM matrices, Hidden Markov Models (HMM):
    Course script Sections 3.4.6, 4-4.6
  • 16.3. Exercise 1:
    This will be study group, we will solve selected exercises at the end of the course script Chapter 3 on the fly.
  • 21.3. Baum-Welch, More complex HMMs, pair HMMs, suboptimal alignments, sampling:
    Course script Sections 4.7-4.10
  • 23.3.  Profile HMMs and their application to multiple alignment, jumping alignments:
    Course script Sections 4.11-4.12
  • 23.3. Exercise 2:  
    Assignments 5, 8, 13 from end of Chapter 3 and 1, 2, 3 from end of Chapter 4 in course script
  • 28.3. High-throughput sequencing (HTS) overview:
    Course script Section 7-7.1.2 (note: a mistake corrected from read filtering part)
  • 30.3. HTS analysis fundamentals: Burrows-Wheeler transform, FM-index, search space pruning:
    Lecture slides & Course script  Sections 5.6., 6.2.2, 6.3.1, 6.4.1, 7.1.3
  • 30.3. Exercise 3:
    Assignments 4,5,7,9,10,11  from end of Chapter 4 in course script
  • 4.4. No lecture (overlapping course in Viikki)
  • 6.4. No lecture (overlapping course in Viikki)
  • 11.4. No lecture (overlapping course in Viikki)
  • 13.4. RNA-sequencing, splice variants:
    Course script  Sections 7.1.4-7.1.5, 7.3.3
  • 13.4. Exercise 4:
    Assignments 1-6 from the end of Chapter 7 in course script
    (Note: BWA pruning was incorrectly defined in the script (now corrected) and the correspondin assignment changed accordingly. An exercise added to the end of the script to show that now it works correctly (with a solution).)
  • 18.4. Variant calling, Peak detection, Co-linear chaining,
    Course script  Sections 7.3.1, 7.3.2, 7.3.4 
  • 20.4. Whole genome alignment, MEMs and MUMs and related algorithms on suffix tree and on FM-index variants
    Course script sections 7.2.1. and 7.3.5 and slides on MUMs
  • 20.4. Exercise 5:
    Assignments 7-9,11,13 from the end of Chapter 7 in course script
  • Data for the project work: Genome resequencing competition (team work recommended):
    • You are given reference genome A and simulated short read data D from genome B. Genome B is variant of A with some simulated  SNPs, and short and large indels.
    • The task is to reconstruct B' from A and D such that B' is as close to B as possible.
    • Winner is the one obtaining best whole genome alignment score between B' and B.
    • Download project data from here
  • Course exam 2.5: You can choose among two alternatives a) and b):
    a) There will be 4 questions on the topics covered during lectures / exercises -> 4 cr with the passed project work.
    b) There will be 5th question on parts of lecture script not covered during lectures -> 4 cr without the project work if getting half of the maximum points for 5th question. Project work can then be done separately, giving extra 2 cr.
  • 2.5.-19.5. Guidance for the project work (B218, pop in, or contact by email)
  • 20.5. Genome resequencing competition deadline
  • Renewal exam / Separate exams: The first separate exam will be renewal exam and you can choose options a) and b) as in course exam 2.5. (see above). 
    Later separate exams will examine the whole course script and give 4 cr without the project work. Project work can then be done separately on request.