String Processing Algorithms
Koe
Vuosi  Lukukausi  Päivämäärä  Periodi  Kieli  Vastuuhenkilö 

2016  syksy  01.1115.12.  22  Englanti  Juha Kärkkäinen 
Luennot
Aika  Huone  Luennoija  Päivämäärä 

Ti 1214  B222  Juha Kärkkäinen  01.11.201615.12.2016 
To 1214  B222  Juha Kärkkäinen  01.11.201615.12.2016 
Harjoitusryhmät
Aika  Huone  Ohjaaja  Päivämäärä  Huomioitavaa 

Ti 1012  C222  Jarno Alanko  31.10.2016—16.12.2016 
Aika  Huone  Ohjaaja  Päivämäärä  Huomioitavaa 

Ti 1416  D123  Jarno Alanko  31.10.2016—02.12.2016  
Ti 1416  C123  Jarno Alanko  13.12.2016—13.12.2016 
Registration for this course starts on Tuesday 4th of October at 9.00.
Yleistä
COURSE EXAM
 Solutions and scoring
 Please contact the lecturer to see the exam papers and get feedback
The course introduces basic algorithms and data structures for string processing including: exact and approximate string matching, string sorting, dictionary data structures and text indexing.
The course is one of the elective courses on the subprogram of Algorithms, Data Analytics and Machine Learning and its predecessor Algorithms and Machine Learning.
The course is also useful for students in the Algorithmic Bioinformatics subprogram (and its predecessor Master's degree program for Bioinformatics), particularly for those interested in biological sequence analysis.
The students are expected to have basic knowledge on algorithms, data structures, finite automata and algorithm analysis. Recommened prerequisite courses are Data Structures, Models of Computation, and Design and Analysis of Algorithms.
The course is followed by Project in String Processing Algorithms in period III.
Kurssin suorittaminen
The course consists of lectures, study groups, exercises and an exam.
Lectures: Tue and Thu 1214 in B222 (except when replaced by a study group meeting).
Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on this page, but there will be additional examples and explanations during the lectures.
Study groups: Four lectures are replaced by a study group: Thu 10.11., Thu 24.11., Thu 01.12., and Thu 15.12.
The students read some material in advance and then discuss the material in groups during the meeting. Attending the study group meetings is mandatory. If you cannot attend, please contact the lecturer as soon as possible for a replacement assignment.
Exercises: Tue 1012 in C222 and Tue 1416 in D123.
Exercise problems will be available on this page about a week before each exercise session. Model solutions will be available after the exercise session.
The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are not required to solve all problems, but additional points are awarded according to how many problems have been solved:
 7/35 marked problems is required and gives 1 point.
 30/35 marked problems gives the maximum of 10 points.
EXAMS
Course exam Mon 19.12. at 14:00 (tentative, check later)
The exam covers the lectures and the exercises. There will be no questions on the study group material. See last year's course for examples of exam problems.
The exam lasts 2.5 hours. No notes or other material is allowed in the exam.
The grading is based on the sum of the points from the exercises (max. 10 points) and the exam (max. 50 points).
 30 points is required to pass and gives the lowest grade 1.
 50 points or more gives the highest grade 5.
Renewal Exam
The renewal exam requires participation to the course and can be taken only if eligible for the course exam:
 Participation to all four study groups during the course (or completion of the replacement assignments).
 At least 7 solved exercise problems.
The renewal exam covers the same material as the course exam (lectures and exercise but not study groups). The exercise points (max. 10) will be added to the exam score when determining the grade.
The renewal exam is organized together with a separate exam and there will be a joint question paper with four joint questions, one question for renewal exam only and one question for separate exam only. The question for separate exam only is about the study group material.
Separate exams
The separate exams do not require course participation and the grade is based on the exam score only.
The separate exams cover lectures, exercises and study groups.
To prepare for the separate exam study group question:
 For each of the four study group assignments, choose one of the groups.
 Read the material assigned to the groups you chose.
 Prepare to answer questions related to the main discussion topics assigned to your chosen groups.
The first separate exam is organized together with a renewal exam and have a joint question paper with four joint questions, one question for renewal exam only and one question for separate exam only. The question for separate exam only is about the study group material.
For an example question, see last year's exam.
Kirjallisuus ja materiaali
All essential content can be found in the lecture notes and other material that will be posted here during the course.
The course will be similar (but not necessarily identical) to last year's course.
WEEK 1
Tue 01.11.
 1012 and 1416: Exercises 1: Problems

1214: Lecture 1: Slides (2x4 format)
 Introduction, strings and model of computation, search trees for strings
Thu 03.11.

1214: Lecture 2: Slides (2x4 format) Edit: Fix labels of proofs (slides 35 and 41).
 Longest common prefixes, string sorting, string quicksort, radix sort
WEEK 2
Tue 08.11.
 1012 and 1416: Exercises 2: Problems

1214: Lecture 3: Slides (2x4 format)
 Lcpcomparisons, string mergesort, string binary search, KarpRabin hashing/fingerprints
Thu 10.11.

1214: Study Group 1: Assignments
 Please contact the lecturer if your name is not mentioned in the assignments.
WEEK 3
Tue 15.11.
 1012 and 1416: Exercises 3: Problems

1214: Lecture 4: Slides (2x4 format)
 Exact string matching, (Knuth)MorrisPratt, ShiftAnd, KarpRabin, Horspool
Thu 17.11.

1214: Lecture 5: Slides (2x4 format)
 BNDM, Crochemore, AhoCorasick
WEEK 4
Tue 22.11.
 1012 and 1416: Exercises 4: Problems

1214: Lecture 6: Slides (2x4 format)
 Edit distance, dynamic programming, approximate string matching, Ukkonen's cutoff algorithm
Thu 24.11.
 1214: Study group 2: Assignments
WEEK 5
Tue 29.11.
 1012 and 1416: Exercises 5: Problems

1214: Lecture 7: Slides (2x4 format)
 Myers' bitparallel algorithm, filtering, BaezaYatesPerleberg
Thu 01.12.
 1214: Study group 3: Assignments
WEEK 6
Wed 07.12.
 1012 (DK112) and 1416 (B222): Exercises 6: Problems

1214: Lecture 8 (D112): Slides (2x4 format)
 Suffix tree, McCreight's algorithm, applications of suffix trees
Thu 08.12.

1214: Lecture 9: Slides (2x4 format)
 Suffix array, LCP array, applications of suffix array, BWT, backward search
WEEK 7
Tue 13.12.
 1012 (C222) and 1416 (C123): Exercises 7: Problems
 1214: Lecture 10: Slides (2x4 format)
Thu 15.12.
 1214: Study group 4: Assignments
The course does not follow any book. The books and articles below can be useful supplementary material but are not required reading (except when a part of a study group assignment).
Books
 M. Crochemore, C. Hancart and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007.
 M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific Publishing, 2002.
 D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
 E. Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction, 2013.
 V. Mäkinen, D. Belazzougui, F. Cunial and A. I. Tomescu: GenomeScale Algorithm Design: Biological Sequence Analysis in the Era of HighThroughput Sequencing. Cambridge University Press, 2015.
 G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings. Cambridge University Press, 2002.
 B. Smyth. Computing Patterns in Strings. Addison Wesley, 2003.
Survey articles
 S. Faro and T. Lecroq. The exact online string matching problem: A review of the most recent results. ACM Computing Surveys 45, 2, Article 13 (March 2013), 42 pages.
 G. Navarro. A Guided tour to approximate string matching. ACM Computing Surveys 33(1), 2001, pp. 31  88.
 S.J. Puglisi, W.F. Smyth, A.H. Turpin. A Taxonomy of Suffix Array Construction Algorithms. ACM Computing Surveys 39(2), Article 4, 2007.
 G. Navarro and V. Mäkinen. Compressed fulltext indexes. ACM Computing Surveys 39(1), Article 2, 2007.
Links
 Exact String Matching Algorithms
 SMART: String Matching Research Tool
 Pizza & Chili Corpus: Compressed Indexes and their Testbeds
 Yuta Mori's implementations of suffix array construction algorithms: SAIS and divsufsort
 PADS: external memory algorithms for suffix array and LCP array construction