Data Structures in Genome Analysis (guided self study)

Syventävät opinnot
The course covers fundamental data structures that are used for high-throughout sequencing applications in genome analysis. The focus is on Burrows-Wheeler transform -based succinct/compressed index structures for representing, querying, and analyzing collections of DNA sequences. Applications/topics include read alignment, maximal repeat detection, succinct de Bruijn graphs, Lempel-Ziv compression, string kernels, and metagenomics analysis. Prerequisites: Basics of algorithms and data structures. The course follows selected chapters from the textbook Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu: Genome-Scale Algorithms Design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press. 2015.
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2016 kevät 14.03-02.05. 4-4 Englanti Veli Mäkinen


Aika Huone Luennoija Päivämäärä
Ma 12-14 B119 Veli Mäkinen 14.03.2016-02.05.2016


Group: 1
Aika Huone Ohjaaja Päivämäärä Huomioitavaa
To 10-12 B119 Veli Mäkinen 14.03.2016—06.05.2016

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.

Registration for this course starts on Tuesday 16th of February at 9.00.


Week I:

Week II:

  • Monday 21.3: Exercise 1 (Jarno Alanko): 3.1, 3.10, 3.12, 3.9, 3.6 (from course book pages 27-29)
    • Hint for question 3.6. General Update-operation will be difficult (impossible?) with this solution, but one can update a value to a smaller one. Consider storing to vEB tree only keys whose value may be an answer to some semi-infinite range mininimum query. You can assume the vEB tree supports also successor operation (inverse of predecessor) to make things easier.
  • Easter break 24.3-30.3
  • Thursday 31.3: Study group on k-mer indexing and suffix arrays, Sections 8-8.2 (Veli Mäkinen)

Week III:

  • Monday 4.4: Exercise 2 (Jarno Alanko): 8.2, 8.3, 8.5, 8.10, 8.11
  • Wednesday 6.4 12.30-14 B222: Study group on suffix trees, rest of Chapter 8 (Veli Mäkinen)

Week IV:

  • Monday 11.4: Exercise 3 (Jarno Alanko): 8.7, 8.15, 8.16, 8.20 (typo in I[...], should be I[1..d]), 8.19 OR 5b simulation below
    • 5b: Simulate suffix-prefix overlap computation (Sect. 8.4.4) on some simple but illustrative input; see the images from Wednesday study group above.
  • Thursday 17.4: Study group on Burrows-Wheeler indexing, Sections 9-9.4 (Veli Mäkinen)

Week V:

  • Monday 18.4: Exercise 4 (Jarno Alanko): 9.2, 9.3, 9.5, 9.6, 9.7
  • Thursday 21.4: Study group on de Bruijn graphs and overlap computation, Section 9.7 and Section 13.2.3 (Veli Mäkinen)

Week VI:

  • Monday 25.4: Exercise 5 (Jarno Alanko): 9.18, 9.19, 9.21, 13.11, 13.10
  • Thursday 28.4: Study group on maximal unique matches etc., Sections 11-11.1 (Veli Mäkinen)

Week VII:

  • Monday 2.5: Study group on string kernels etc., Sections 11.2.1 and 11.2.3 (Veli Mäkinen): black board on telescoping technique
  • Thursday 5.5:  Helatorstai (holiday)
  • Exercise 6: Return by email to Jarno Alanko by 6.5. Solve any 5 of the following: 11.1, 11.2, 11.6, 11.10, 11.11, 11.13, 11.7, 11.17, 11.21, 11.20, 11.22
    • Extra 1-2 points from solutions to 13.10



Kurssin suorittaminen

There is no course exam. Grading is based on the weekly exercises. There are five assignments per week. One should mark one assignment each week to pass. 12 points gives grade 1, 15 points gives grade 2, 18 points gives grade 3, 21 points gives grade 4, and 24 points gives  grade 5. (Limits updated as we had no solution to Exercise 13.10; 29 is new maximum)

Kirjallisuus ja materiaali

Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press. [Book webpage] [HY institutional e-book]