Data Mining (guided self study)

582634
5
Algorithms and machine learning
Advanced studies
This course focuses on concepts and methods for frequent pattern discovery, also known as association analysis. This edition of the course is a structured and guided self-study course with weekly tasks and supervision, with mandatory attendance. Prerequisites: BSc degree and the course Introduction to Machine Learning or equivalent. Course book: Tan P., Steinbach M. & Kumar V.: Introduction to Data Mining, Chapters 6 and 7. Addison Wesley, 2006.
Year Semester Date Period Language In charge
2017 spring 20.01-03.03. 3-3 English Hannu Toivonen

Lectures

Time Room Lecturer Date
Fri 10-12 B222 Hannu Toivonen 20.01.2017-03.03.2017

Information for international students

This course is given in English.

General

This course will familiarize the participants with concepts and methods for identifying interesting patterns from large datasets. Data mining is about trying to make sense of data, usually without clear questions or clear success criteria. The course will focus on discovery of frequent patters in data, a fundamental data mining task that can help extract knowledge and previously unknown patterns also from largely unstructured data.

Completing the course

This instance of the course is based on self studies, according to a given study schedule and supported by weekly mentoring by the professor. Mentoring is based on so-called flipped classroom: students study the material first, and the meetings on Fridays are used to answer questions by the students, fill the gaps etc.

The course is completed solely by taking a final exam on 10 March 2017 (or 25 April). Check out https://www.cs.helsinki.fi/en/exams for possible changes on exam schedules. Participation in Friday sessions in voluntary. There are no exercise sessions.

NEW (24 Mar 2017): The exam has been graded and results are available at https://ilmo.cs.helsinki.fi/tulokset/studies. You should be able to see your points for each task in the exam. If you have any questions, contact Hannu by dropping in in his lab (rooms B233/B232) wihtout an appointment. Good times to find him: Wed (29 Mar) 9:30-12, Thu (30 Mar) 14-16, Fri (31 Mar) 10-14.

Schedule

The following topics are to be studied before the respective meeting date. The meetings are based on students' needs, not on planned lectures.  

  • Week 2: Frequent itemset generation (Sections 6.1-6.2 except 6.2.4)
  • Week 3: Compact representation of frequent itemsets (Section 6.4)
  • Week 4: Alternative methods for generating frequent itemsets and FP-growth (Sections 6.5-6.6)
  • Week 5: Rule generation and evaluation of association patterns (Sections 6.3 and 6.7 except 6.3.2)
  • Week 6: Handling categorical and continuous attributes and a concept hierarchy (Sections 7.1-7.3) (NB: Hannu will not be present this week)
  • Week 7: Sequential patterns (Section 7.4)

The course has a closed FaceBook group where students can share hints as well as ask and give advice regarding the course. 

Feedback

Please give feedback for the course using the department's anonymous feedback form (look for "Data Mining" under Advanced studies). Thank you.

Literature and material

Course book: Tan P., Steinbach M. & Kumar V.: Introduction to Data Mining, Chapters 6 and 7. Addison Wesley, 2006. Links:

Additional material on the same topics (note: notations may differ):

Useful material for self studies can also be found from previous editions of this course:

Many of the exercises done in the classroom are from the "weekly tests" of course of 2016. The 2016 course page also contains links to their solutions.
 

Definitive course contents (covered in exams) 

Chapters 6 and 7 of Tan et al, except not the following: Sections 6.2.4 (Support Counting), 6.3.2 (Rule Generation in Apriori Algorithm), 6.8 (Effect of Skewed Support Distribution), 7.5 (Subgraph Patterns), 7.6 (Infrequent Patterns).