Data Mining Project (guided self study)

582635
2
Algorithms and machine learning
Advanced studies
Application of data mining to a data analysis problem. The project covers the whole data mining process, and includes either implementing a data mining algorithm or using a wider range of available implementations. The project is completed by a research report describing and justifying the steps taken and decisions made, and discussing the results obtained. Prerequisites: The course Data Mining. The project can only be taken during the specified period. There are no final exams.
Year Semester Date Period Language In charge
2017 spring 13.03-13.03. 4-4 English Simo Linkola

Lectures

Time Room Lecturer Date
Mon 15-16 B119 Simo Linkola 13.03.2017-13.03.2017

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00. Aloitusluento MA 13.3. 15-16 B119 on kaikille pakolinen!

Registration for this course starts on Tuesday 16th of February at 9.00. The first lecture on MON 13.3. 15-16 B119 is obligatory for everybody!

General

The participants in the data mining project will work on a topic of their own choosing. The projects should contain two main components: implementation of an algorithm for frequent pattern mining and application of it to real data, including interpretation and assessment of the results.

The project will be done either in teams of size 2-4 individuals or individually. Also in team work, each student must participate both in implementation and application of data mining algorithms. If a participant wishes to work in a team, the teams will be formed during the first meeting. Some example topics and datasets will be provided by the course staff.

Course duration and grading

  • The project is 2 credits, but larger projects with extra credit can also be undertaken. If you choose to do so, ask Simo if the topic you are considering is good and keep track of the hours you are using. All the projects should be finished by the end of the 4th period.
  • The project will be graded fail / pass / 5, where 5 corresponds to excellent, pass to good and fail to fail.

Submissions

All submissions during the course are done in Moodle. Enrolment key can be found from the starting lecture's slides.

Reserving Guidance

You can reserve individual or project guidance from Simo (slinkola@cs.helsinki)

Simo is also available weekly in B233 on Wednesdays 13-15 (you can also try your luck any other time and drop by B233).

Project timeline

  • Mon 13.3. 15-16: Starting lecture, slides
  • Finding a team (or deciding to work alone)
  • DL Fri 17.3.:Enrol to the course on Moodle. All the messages considering the course are send through Moodle.
  • Selecting a topic: task/algorithm and the data to be used
  • Working on the topic to decide whether it is feasible to do in a few credits
  • DL 31.3. 23.59: Reporting the topic of the project on Moodle
  • Working on the topic -- individual or project guidance hours can be reserved or asked from Simo (slinkola@cs.helsinki.fi)
  • Presenting your work on Wed 3.5. 12-15 in CK107, Exactum. There will be a computer with an internet connection, where you can download and show PDFs. Each group / presentation should be around 10-15 minutes, after which there will be time for questions. Overall, the presentation should be on a level that can be followed by anyone who took the DM course earlier.
  • DL 5.5. 23.59: Submitting the source code and the report on the project on Moodle. Use https://github.com/UniversityHelsinkiTKTL/tktltiki2 as the Latex template for the report.
  • Finish

About Report

The report on the project should contain:

  • Project overview: What was your goal, and how you acquired it?
  • Description of your task: patterns that are mined and algorithm used for them, pseudocode of the algorithm
  • Implementation details: Any preprocessing done for the data, optimisations, etc.
  • Compiling/running instructions of the code: How others can replicate your results?
  • Analysis of the results: How your results should be understood? e.g. Do not list all the frequent itemsets, but give some examples of them and analyse what they mean for the data at hand!
  • Conclusions: What was good and what went wrong? Any possible directions for the future work?
  • Time allocations used for the project for each group member and a short description of what each member did.

The report should not be too long, but understandable for others who have taken the DM course earlier. Focus on the things specific for your project. Make your analysis meaningful but concise.

 

Literature and material

Possible Topics:

  • All the data mining tasks and algorithms covered in the course
  • Itemsets: Apriori, FP-Growth, Depth-first methods
  • Association rule sets
  • Sequence mining: text mining from a set of documents (tweets, wikipedia, novels, etc.)
  • Graph mining: frequent subgraphs, etc. (from social graphs, molecular graphs, etc)

Datasets:

Select a dataset that you are confident working with. Familiar datasets make analysis and debugging easier. Remember, that in this project we want to find interesting patterns and not use machine learning to, e.g., predict values of some variables. Select the dataset accordingly.