Introduction to Machine Learning

582631
5
Algorithms and machine learning
Advanced studies
Basic concepts and methods of machine learning, in theory and in practice. Supervised learning (classification, regression) and unsupervised learning (clustering). The course serves as preparation for various courses on data analysis, machine learning and bioinformatics. Course book: Course book: An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013.

Exam

11.12.2013 09.00 B123
Year Semester Date Period Language In charge
2013 autumn 29.10-06.12. 2-2 English Jyrki Kivinen

Lectures

Time Room Lecturer Date
Tue 10-12 D122 Jyrki Kivinen 29.10.2013-06.12.2013
Fri 10-12 D122 Jyrki Kivinen 29.10.2013-06.12.2013
Mon 12-14 C220 Jyrki Kivinen 09.12.2013-09.12.2013

Exercise groups

Group: 1
Time Room Instructor Date Observe
Fri 14-16 B222 Yuan Zou 04.11.2013—06.12.2013

On Tuesday 19th of November the lecture is moved to room B222!

Registration for this course starts on Tuesday October 8th at 9.00. There is additional guidance for Matlab/R on Tue October 29th at 12-14 in B221 and Fri November 1st at 12-14 in B221.

Information for international students

The course will be taught in English.  All materials will appear on the English version of this page.

Announcements

  • The course has been graded.  The course results are available in the department intranet. 
  • Please fill in a feedback form for the course!
  • Details of the programming assignment for separate examinations are now available in the Examinations tab.

General

The course will cover the basics of machine learning.  The course consists of lectures, homework exercises and a course examination.

Machine learning emplyes a lot of concepts and techniques from mathematics.  Students are expected to know the basics  of probability theory, linear algebra and calculus.  For this course we do not need any advanced techniques, but a general familiarity with mathematical manipulations will make the course easier.

A significant proportion of the exercises will require the use of computer to implement machine learning algorithms and experiment with them.  Most students will probably find it easiest to solve these problems using Matlab, R or similar tools.  During the first week, there will be some instruction in the use of such tools (see below for details).  It is assumed that all students already have fairly good skills in computer programming in general.

Completing the course

There are two ways of completing the course.

  1. Taking the lecture course in Period II, including homework exercises and a course exam.  This option is the main focus of these pages.  The homework makes up 40% of the grade, the course exam 60%.  In order to pass, you must score at least half the points both from homework and the exam.  If you have done the homework but are unable to attend the course exam, or do not pass it, you may replace it by a separate examination.
  2. There will be separate examinations according to the usual policy of the department.  This option requires that you additionally complete a programming project.  The details of the project will be made available well in advance of the first separate exam (currently planned for 4 February 2014).

See the Examinations tab to get an idea of the type of questions in the exams.  (Notice that the option of replacing this course with the similar online course offered by Coursera is no longer available.)

Literature and material

Textbook

The course textbook is Introduction to Data Mining (2005) by Tan, Steinbach and Kumar.  We will mainly cover Chapters 1–5 and 8–10.  More detailed pointers to the textbook will be posted here as the course progresses.  However, the course does not follow the textbook precisely.  Students are expected to learn both the material in the assigned parts of the textbook, and the material presented in lectures and exercises.

Lectures

Lecture notes will appear here as the course progresses. They are mainly based on material from previous instances of this course, created by Patrik Hoyer and others.

  • Tutorial on multivariate distributions (P. Hoyer)
  • some really quick notes about Bayes error
  • Week 1: Notes for lecture 1 and lecture 2 are available. Corresponding to this, you should read Chapters 1 and 2 of the textbook.
  • Week 2: Notes for lecture 3 and lecture 4 are available. Corresponding to this, you should read Sections 3.1, 3.2, 3.3, 4.1, 4.2, 5.2, 5.3.1, 5.3.2, 5.3.4, 5.7 and 5.8 of the textbook. This looks a bit fragmented because we cover topics in a different order than the textbook. Parts of this may become clearer when we get further on the course.
  • Week 3: Notes for lecture 5 and lecture 6 are available now.  From the textbook, you should read Sections 4.4, 4.5, 5.3.3 and 5.6.3.  Pages 1–17 of Patrik Hoyer's tutorial on multivariate distributions may also be helpful.
  • Week 4: Notes for lecture 7 and lecture 8 are available now.  From the textbook, you should read Sections 4.3 and 5.1, and Appendix D.
  • Week 5: Notes for lecture 9 and lecture 10 are available now.  From the textbook, you should read Sections 5.4.0–5.4.1 and 8.0–8.3.
  • Week 6: Notes for lecture 11 are available now.  From the textbook, you should read Sections 9.2.2 and 8.5, and Chapter 10.

Homework exercises

There will be compulsory weekly homework consisting of both pen-and-paper and computer exercises. The exercise sessions will be held on each Friday beginning from lecture week 2, and cover mainly topics of the previous week's lectures. Attendance at the exercise sessions is voluntary, but to get credit you need to hand in our solutions following the instructions on the problem sheet. The deadline is Wednesday at 9:00am before the session.

Exercise points (on department intranet, listed by student number). "LH" means pen-and-paper problems and "HT" programming problems, both with a running numbering so that, for example, "LH5" is pen-and-paper problem 2 in set 2. Column "LH15" is the extra points awarded for willingness to present solutions at the exercise sessions.

Additional material (data sets, tutorials etc.) has been collected on a separate tab.