# Data Mining Project

2

Algoritmit ja koneoppiminen

Syventävät opinnot

Application of data mining to a data analysis problem. The project covers the whole data mining process, and includes either implementing a data mining algorithm or using a wider range of available implementations. The project is completed by a research report describing and justifying the steps taken and decisions made, and discussing the results obtained. Prerequisites: The course Data Mining. The project can only be taken during the specified period. There are no final exams.

Vuosi | Lukukausi | Päivämäärä | Periodi | Kieli | Vastuuhenkilö |
---|---|---|---|---|---|

2014 | kevät | 05.05-16.05. | 4-4 | Englanti | Fabio Cunial |

## Luennot

Aika | Huone | Luennoija | Päivämäärä |
---|---|---|---|

Ti 14-16 | B222 | Fabio Cunial | 06.05.2014-06.05.2014 |

Ma 10-12 | B222 | Fabio Cunial | 12.05.2014-12.05.2014 |

Pe 10-14 | B222 | Fabio Cunial | 16.05.2014-16.05.2014 |

Ilmoittautuminen tälle kurssille alkaa tiistaina 18.2. klo 9.00.

Registration for this course starts on Tuesday 18th of February at 9.00.

## Yleistä

The objectives of this project are:

- to get an exposure to advanced concepts or practices in itemset and association rule discovery;
- to understand where the field is currently going;
*to do something cool that you could write in your CV;*- to have fun :-)

## Kurssin suorittaminen

The project can be completed in one of the following, mutually-exclusive strategies. Regardless of the strategy, the student must submit a detailed report of her activity.

**(Algorithms)**Study one of the papers listed in section "Literature and material", and either:- write a detailed summary on the paper, or
- implement the main idea described in the paper, or
- improve the theoretical results of the paper.

**(Implementations)**Perform an in-depth review of the implementations that are currently available for itemset and association rule discovery. In particular, choose one of the options below:- Review
*the whole state of the art*. What is the architecture of such implementations? Do they support parallelism? How do they handle large datasets? Which implementation choices do they make? Which of them performs best on benchmark datasets? Collect and plot performance metrics. - Study the fine details of
*one specific implementation*. Answer the same questions as in point (2.1), but in greater depth. Read and possibly change the source code.

- Review
**(Datasets)**Using the algorithms studied in the Data Mining course, and possibly interacting with a domain expert, design a controlled set of experiments to find semantically meaningful patterns from the course datasets. Perform a detailed analysis of the discovered patterns.**(Applications)**Design and implement an innovative application of the algorithms studied in the Data Mining course (for example a smartphone app, a facebook app, a gmail app, or a gcalendar app -- for possible inspiration, see e.g. this blog post, this facebook app, this smartphone app, and this example of app integration: can you do better?). The application must be agreed beforehand with the instructor, and it must have a well-defined purpose and a clear utility (but it can use existing algorithms and implementations). The student is expected to have prior working knowledge of the technologies required to implement the application.

Strategies (2), (3) and (4) allow students to form groups of at least two people and to submit a joint report.

## Kirjallisuus ja materiaali

- Summarizing probabilistic frequent patterns: a fast approach
- Mining high utility episodes in complex event sequences
- Permutation-based sequential pattern hiding
- Mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases
- Mining statistically significant sequential patterns
- Enumeration of time series motifs of all lengths
- Dominance programming for itemset mining
- Mining dependent frequent serial episodes from uncertain sequence data
- Efficiently mining top-k high utility sequential patterns
- Mining probabilistic representative frequent patterns from uncertain data
- Itemset based sequence classification
- A relevance criterion for sequential patterns
- Fast and exact mining of probabilistic data streams

Any other paper from the following conferences/journals can be used as well, but the student needs to prove to the instructor that the chosen paper conforms to the learning objectives of the project.

- 2013 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
- 2013 IEEE International Conference on Data Mining (ICDM)
- 2013 SIAM International Conference on Data Mining (SDM)
- 2013 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
*Data Mining and Knowledge Discovery*