Introduction to Machine Learning

582631
5
Algorithms and machine learning
Advanced studies
Basic concepts and methods of machine learning, in theory and in practice. Supervised learning (classification, regression) and unsupervised learning (clustering). The course serves as preparation for various courses on data analysis, machine learning and bioinformatics. Course book: Course book: An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013.
Year Semester Date Period Language In charge
2010 autumn 02.11-10.12. 2-2 English

Lectures

Time Room Lecturer Date
Tue 10-12 D122 Patrik Hoyer 02.11.2010-10.12.2010
Fri 10-12 D122 Patrik Hoyer 02.11.2010-10.12.2010

Exercise groups

Group: 1
Time Room Instructor Date Observe
Thu 16-18 B222 Antti Hyttinen 08.11.2010—10.12.2010
Group: 2
Time Room Instructor Date Observe
Fri 12-14 D122 Doris Entner 08.11.2010—10.12.2010

Registration for this course starts on Tuesday 12th of October at 9.00. There"s additional guidance for Matlab on Tue 2nd at 12-14 in B221 and Fri 5th at 12-14 in B121.

General

Quick link: Course page in Moodle

Note: All information on this page refers to the 2010 course. For up-to-date information see the autumn 2011 course page.

News (22.12.2010): The full details of what needs to be done to take a renewal exam or a separate exam are now provided below (see the section 'completing the course').

News (3.11.2010): The regular exercise session on Fridays (starting 12.11.2010) has been moved to room D122 (same room as the lectures) to make sure there is enough space. (Note that this does not affect the Matlab/Octave/R practice session on Friday 5.11.2010.)

Machine learning and data mining deals with designing computer algorithms that find interesting patterns in data and that can learn from experience. As the cost of measuring, storing, and transmitting data has plummeted in recent years, the amount of data being collected and analyzed has grown at an amazing pace in both business and scientific applications.

For example, today internet search engine companies routinely use techniques from machine learning to help users find the information they seek, while the financial sector uses data mining techniques to identify fraudulent credit card transactions and medical companies use statistical methods in drug development. In this day and age, almost any business utilizes some form of data analysis or another.

Similarly, much of modern science today depends on computational methods for discovering relationships between variables in high-dimensional datasets. In bioinformatics, the advent of measurement technology for sequencing whole genomes and measuring the expression of thousands of genes has required the development of completely new data analysis methods. In many other fields as well sensors have become cheap to the point where the main bottleneck is the analysis of the resulting data, rather than the measurement technology.

This course provides an introduction to machine learning and data mining techniques, and serves as preparation for a variety of courses on data analysis, machine learning, and bioinformatics. While one goal is to present a broad overview of the field, the course will also give the students a basic understanding of standard problems such as classification, regression, data clustering, and anomaly detection. The students will obtain an understanding of the relevant techniques by applying them to real-world data sets.

Main themes and learning objectives: Detailed here in English and in Finnish

Course staff: The lectures will be given by Dr. Patrik Hoyer, while the exercises will be held by Doris Entner and Antti Hyttinen. There are no designated office hours, please set up an appointment by e-mail if necessary.

Exercises first week: Instead of the regular exercises, during the first week (1.11-5.11) there will be guidance on Matlab/Octave/R on Tuesday 12-14 (in B221) and Friday 12-14 (in B121) (Note: these are identical so please attend only one.) The purpose is to familiarize the students with the software packages which will be used in the course for implementing the various algorithms. These extra exercises are voluntary and there will exceptionally be no exercise points. Students without any prior familiarity with Matlab, Octave or R are strongly encouraged to attend.

Completing the course

In addition to the lectures, the course consists of weekly exercises and a final exam. The exercises constitute 40% of the course total, while the exam makes up 60% of the total points. 

To pass the course, the student must

  • pass the final exam (obtain at least half the available points in the exam), and
  • obtain at least half of the total available points (exercises + exam)

To obtain points for the weekly exercises, the students must each week turn in their solutions to the organizers. More details will be given here at the start of the course.

Prerequisites:

  • Some basic probability theory and linear algebra (the course textbook provides a refresher of the most basic concepts in the appendices).
  • Some programming skills (we will use Matlab/Octave/R but no prior exposure to these particular environments is needed)

Please register for the course using the university registration system (see the link on the left). Only registered students can be assigned credits.

Please also sign up for the course in Moodle. All the course material will be available in Moodle and students will be kept informed of current course events using email from Moodle.

News: For those wishing to take a renewal exam or a separate exam in the spring/summer of 2011, all the details and instructions are provided in Moodle. Please 'sign up' for the course to get access to the material. In brief: anybody who was eligible to take the 14.12.2010 exam can retake it in the spring/summer of 2011 without additional exercises (with 40% of the final grade based on the weekly exercises from the course in the autumn). Anybody who wishes to take part in a separate exam (in which the weekly exercise points are not counted) needs to first successfully complete some programming exercises. These (and the instructions) are in Moodle.

Literature and material

The textbook for the course will consist of (selected parts from): Tan, Steinbach, Kumar (2005): Introduction to Data Mining (publisher, amazon.co.uk, bookplus)

There are a total of 12 copies available in the Kumpula Science Library (of these, one copy is a "reading room copy" and so cannot be borrowed; it should always be there).

Other material (e.g. reading lists, lecture slides, exercise sets, instructions and documentation for Matlab, Octave, and R, and links to the datasets used) will be put in Moodle as the course progresses.