Data Mining Project (guided self study)
Year | Semester | Date | Period | Language | In charge |
---|---|---|---|---|---|
2017 | spring | 13.03-13.03. | 4-4 | English | Simo Linkola |
Lectures
Time | Room | Lecturer | Date |
---|---|---|---|
Mon 15-16 | B119 | Simo Linkola | 13.03.2017-13.03.2017 |
Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00. Aloitusluento MA 13.3. 15-16 B119 on kaikille pakolinen!
Registration for this course starts on Tuesday 16th of February at 9.00. The first lecture on MON 13.3. 15-16 B119 is obligatory for everybody!
General
The participants in the data mining project will work on a topic of their own choosing. The projects should contain two main components: implementation of an algorithm for frequent pattern mining and application of it to real data, including interpretation and assessment of the results.
The project will be done either in teams of size 2-4 individuals or individually. Also in team work, each student must participate both in implementation and application of data mining algorithms. If a participant wishes to work in a team, the teams will be formed during the first meeting. Some example topics and datasets will be provided by the course staff.
Course duration and grading
- The project is 2 credits, but larger projects with extra credit can also be undertaken. If you choose to do so, ask Simo if the topic you are considering is good and keep track of the hours you are using. All the projects should be finished by the end of the 4th period.
- The project will be graded fail / pass / 5, where 5 corresponds to excellent, pass to good and fail to fail.
Submissions
All submissions during the course are done in Moodle. Enrolment key can be found from the starting lecture's slides.
Reserving Guidance
You can reserve individual or project guidance from Simo (slinkola@cs.helsinki)
Simo is also available weekly in B233 on Wednesdays 13-15 (you can also try your luck any other time and drop by B233).
Project timeline
- Mon 13.3. 15-16: Starting lecture, slides
- Finding a team (or deciding to work alone)
- DL Fri 17.3.:Enrol to the course on Moodle. All the messages considering the course are send through Moodle.
- Selecting a topic: task/algorithm and the data to be used
- Working on the topic to decide whether it is feasible to do in a few credits
- DL 31.3. 23.59: Reporting the topic of the project on Moodle
- Working on the topic -- individual or project guidance hours can be reserved or asked from Simo (slinkola@cs.helsinki.fi)
- Presenting your work on Wed 3.5. 12-15 in CK107, Exactum. There will be a computer with an internet connection, where you can download and show PDFs. Each group / presentation should be around 10-15 minutes, after which there will be time for questions. Overall, the presentation should be on a level that can be followed by anyone who took the DM course earlier.
- DL 5.5. 23.59: Submitting the source code and the report on the project on Moodle. Use https://github.com/UniversityHelsinkiTKTL/tktltiki2 as the Latex template for the report.
- Finish
About Report
The report on the project should contain:
- Project overview: What was your goal, and how you acquired it?
- Description of your task: patterns that are mined and algorithm used for them, pseudocode of the algorithm
- Implementation details: Any preprocessing done for the data, optimisations, etc.
- Compiling/running instructions of the code: How others can replicate your results?
- Analysis of the results: How your results should be understood? e.g. Do not list all the frequent itemsets, but give some examples of them and analyse what they mean for the data at hand!
- Conclusions: What was good and what went wrong? Any possible directions for the future work?
- Time allocations used for the project for each group member and a short description of what each member did.
The report should not be too long, but understandable for others who have taken the DM course earlier. Focus on the things specific for your project. Make your analysis meaningful but concise.
Literature and material
Possible Topics:
- All the data mining tasks and algorithms covered in the course
- Itemsets: Apriori, FP-Growth, Depth-first methods
- Association rule sets
- Sequence mining: text mining from a set of documents (tweets, wikipedia, novels, etc.)
- Graph mining: frequent subgraphs, etc. (from social graphs, molecular graphs, etc)
Datasets:
Select a dataset that you are confident working with. Familiar datasets make analysis and debugging easier. Remember, that in this project we want to find interesting patterns and not use machine learning to, e.g., predict values of some variables. Select the dataset accordingly.
- Some preprocessed datasets to help developing your frequent itemset mining implementation. The small datasets (e.g. chess, retail, etc.) are only usable for testing your algorithm, they are not to be used for the final report!
- UCI Machine Learning Repository (scroll down for more recent examples) many of the datasets are not suitable for the project!
- Movie Lens dataset consisting of movie ratings
- Global election data
- NYC Taxi data (several gigabytes)
- HSL avoin data (only in Finnish)
- FMI avoin data (weather data from Finnish Meteorological Institute, seems to be only in Finnish)
- Various open datasets (names of the datasets seem to be mostly in Finnish)