String Processing Algorithms

58093
5
Algorithms and machine learning
Advanced studies
Basic algorithms and data structures for string processing: exact and approximate string matching, string sorting, dictionary data structures, text indexing.

Exam

17.12.2014 17.00 CK112
Year Semester Date Period Language In charge
2014 autumn 28.10-11.12. 2-2 English Juha Kärkkäinen

Lectures

Time Room Lecturer Date
Tue 12-14 B222 Juha Kärkkäinen 28.10.2014-11.12.2014
Thu 12-14 B222 Juha Kärkkäinen 28.10.2014-11.12.2014

Exercise groups

Group: 1
Time Room Instructor Date Observe
Tue 14-16 B119 Dominik Kempa 27.10.2014—12.12.2014

Registration for this course starts on Tuesday 7th of October at 9.00.

General

NEWS

Course exam 17.12.2014

Renewal/separate exam 06.02.2015

  • Exam | Solutions
  • The grade is based on max ( exam_points, exam_points/1.2 + exercise_points ).
  • Please contact the lecturer to see your exam papers and get feedback.

The course introduces basic algorithms and data structures for string processing including: exact and approximate string matching, string sorting, dictionary data structures and text indexing.

The course is one of the elective courses on the subprogram of Algorithms, Data Analytics and Machine Learning and its predecessor Algorithms and Machine Learning.

The course is also useful for students in the Algorithmic Bioinformatics subprogram (and its predecessor Master's degree program for Bioinformatics), particularly for those interested in biological sequence analysis.

The students are expected to have basic knowledge on algorithms, data structures, finite automata and algorithm analysis. Recommened prerequisite courses are Data Structures, Models of Computation, and Design and Analysis of Algorithms.

The course is followed by Project in String Processing Algorithms in period III.

Completing the course

The course consists of lectures, study groups, exercises and an exam.

Lectures: Tue and Thu 12-14 in B222 (except when replaced by a study group meeting).

Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on this page, but there will be additional examples and explanations during the lectures.

Study groups: Thu 6.11., Tue 18.11., Thu 27.11. and Thu 11.12., always 12-14 in B222.

The students read some material in advance and then discuss the material in groups during the meeting. Attending the study group meetings is mandatory. If you cannot attend, please contact the lecturer as soon as possible.

The course exam and the renewal exam will not have questions on the study group material but the separate exam will. To prepare for a separate exam:

  • For each of the four study group assignments, choose one of the groups.
  • Read the material assigned to the groups you chose.
  • Prepare to answer questions related to the main discussion topics assigned to your chosen groups.

Exercises: Tue 14-16 in B119.

Exercise problems will be available on this page about a week before each exercise session. Model solutions will be available after the exercise session.

The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are not required to solve all problems, but additional points are awarded according to how many problems have been solved:

  • 7/35 marked problems is required and gives 1 point.
  • 30/35 marked problems gives the maximum of 10 points.

EXAMS

Course Exam (Wed 17.12. at 17-20 in CK112): Course Exam | Solutions and scoring

The exam covers the lectures and the exercises. There will be no questions on the study group material. See last year's course for examples of exam problems.

The exam lasts 2.5 hours. No notes or other material is allowed in the exam.

The grading is based on the sum of the points from the exercises (max. 10 points) and the exam (max. 50 points).

  • 30 points is required to pass and gives the lowest grade 1.
  • 50 points or more gives the highest grade 5.

Renewal Exam: Fri 06.02.2015 at 16-20 in B123 (Please re-check the time and place a few days before the exam.)

The renewal exam requires participation to the course and can be taken only if eligible for the course exam:

  • Participation to all four study groups during the course (or completion of the replacement assignments).
  • At least 7 solved exercise problems.

The renewal exam covers the same material as the course exam (lectures and exercise but not study groups). The exercise points (max. 10) will be added to the exam score when determining the grade.

The renewal exam is organized together with a separate exam and there will be a joint question paper with four joint questions, one question for renewal exam only and one question for separate exam only. The question for separate exam only is about the study group material.

Separate exams

The separate exams do not require course participation and the grade is based on the exam score only.

The separate exams cover lectures, exercises and study groups. See above how to prepare for the study group part.

Some separate exams are organized together with a renewal exam and have a joint question paper with four joint questions, one question for renewal exam only and one question for separate exam only. The question for separate exam only is about the study group material.

Literature and material

All essential content can be found in the lecture notes and other material that will be posted here during the course.

The course will be similar (but not necessarily identical) to last year's course.

WEEK 1

  • Lecture 1 (Tue 28.10. at 12-14 in B222): Slides (2x4 format) Change: Numbering of examples etc. has changed.
    • Introduction, strings, search trees for strings
  • Exercises 1 (Tue 28.10. at 14-16 in B119): Problems
  • Lecture 2 (Thu 30.10. at 12-14 in B222): Slides (2x4 format)
    • Longest common prefixes, string sorting, string quicksort, radix sort

WEEK 2

  • Lecture 3 (Tue 04.11. at 12-14 in B222): Slides (2x4 format)
    • Lcp-comparisons, string mergesort, string binary search, Karp-Rabin hashing/fingerprints
  • Exercises 2 (Tue 04.11. at 14-16 in B119): Problems
  • Study Groups 1 (Thu 06.11. at 12-14 in B222): Groups and assignments

WEEK 3

  • Lecture 4 (Tue 11.11. at 12-14 in B222): Slides (2x4 format)
    • Exact string matching, (Knuth-)Morris-Pratt, Shift-And, Karp-Rabin, Horspool
  • Exercises 3 (Tue 11.11. at 14-16 in B119): Problems
  • Lecture 5 (Thu 13.11. at 12-14 in B222): Slides (2x4 format)
    • BNDM, Crochemore, Aho-Corasick

WEEK 4

  • Study Groups 2 (Tue 18.11. at 12-14 in B222): Groups and assignments
  • Exercises 4 (Tue 18.11. at 14-16 in B119): Problems
  • Lecture 6 (Thu 20.11. at 12-14 in B222): Slides (2x4 format)
    • Edit distance, dynamic programming, approximate string matching, Ukkonen's cut-off algorithm

WEEK 5

  • Lecture 7 (Tue 25.11. at 12-14 in B222): Slides (2x4 format)
    • Myers' bitparallel algorithm, filtering, Baeza-Yates-Perleberg
  • Exercises 5 (Tue 25.11. at 14-16 in B119): Problems
  • Study Groups 3 (Thu 27.11. at 12-14 in B222): Groups and assignments

WEEK 6

  • Lecture 8 (Tue 02.12. at 12-14 in B222): Slides (2x4 format)
    • Suffix tree, McCreight's algorithm, applications of suffix trees
  • Exercises 6 (Tue 02.11. at 14-16 in B119): Problems
  • Lecture 9 (Thu 04.12. at 12-14 in B222): Slides (2x4 format)
    • Suffix array, LCP array, applications of suffix array, RMQ preprocessing, BWT, prefix doubling

WEEK 7

WEEK 8

  • Exam (Wed 17.12. at 17-20 in CK112)
    • The exam starts 17:00 and lasts 2.5 hours. No notes or other material is allowed in the exam.
    • The exam covers the lectures and the exercises. There will be no questions on the study group material.
    • See last year's course for examples of exam problems.

The course does not follow any book. The books and articles below can be useful supplementary material but are not required reading (except when a part of a study group assignment).

Books

  • M. Crochemore, C. Hancart and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007.
  • M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific Publishing, 2002.
  • D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
  • E. Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction, 2013.
  • G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings. Cambridge University Press, 2002.
  • B. Smyth. Computing Patterns in Strings. Addison Wesley, 2003.

Survey articles

Links