Data Mining : Problem #3
Problem description
This time you can choose your own data and specify your own exact problem, as long as they are about finding frequent patterns in sequences. The data can consist of a number of sequences, or one long sequence, depending on the application.
Two possible data sets are provided:
- A new version of the course registration data is provided with temporal information, in the sense that it contains the sequence of courses each student enrolled to. So the order of the courses is now available. Using this additional information, can you deepen your analysis of students enrollement to provide a better understanding of the type of curricula following? You are supposed to formulate your own question, i.e., define yourself precisely what patterns you are looking for in this dataset.
- Phrases extracted from blogs and news articles gathered over a year and grouped by month can be found online: www.memetracker.org/data.html#raw . This data can be analysed to identify catch phrases that have spread quickly around the web and study their evolution. This is known as meme tracking. A first, simple version of meme tracking would be to identify sequences of words that appear frequently within those phrases. The data is very large, you do not need to use the whole of it, nor to consider the full information, url, timestamp, etc. You can for example focus on the raw phrases (lines prefixed with Q) that appeared in April 2008 and try to find frequent word subsequences in that subset.
If you decide to use some other dataset, make sure it is large enough so looking for frequent patterns makes sense.
Instructions
A. The first task is to specify exactly the type of patterns you aim to find and an algorithm for finding them (in addition to choosing a dataset). Deadline: report in 1 week from opening Problem #3.
B The second task is to implement (as a group, or individually if you so choose) the algorithm and to apply it on the data. Deadline: report in 2 weeks from opening Problem #3.
C. Keep a personal learning diary during the work on this problem (see instructions on a separate tab). Deadline: in 2 weeks from opening Problem #3.
Hints
New concepts to be learned and used: frequent pattern problem specification, sequential patterns and generalized Apriori (i.e., levelwise search).
(Note that the course data has a more complex structure: it is not simply a sequence of courses, but a sequence of sets of courses (each set corresponds to courses taken parallel during one term). The course book covers this case. The wiki category data can be seen more simply as a sequence of words or a sequence of characters. Feel free to choose either one of these problems and settings.)
Concepts you will need to refresh and generalize: Apriori algorithm (or another frequent pattern algorithm) and candidate generation. How can their ideas be applied to mining frequent sequences? What is the search space? How can candidates be generated and pruned when dealing with sequences? How about subsequences (allowing gaps between items) versus substrings (no gaps) as sequential patterns?
A1. First specify precisely the type of patterns you are looking for in the data: What are patterns? When does a pattern occur in data? What is the support/frequency of a pattern? These may sound trivial but often are not!. (What about closed or maximal patterns, or rules derived from supports of two patterns?)
A2. Then give an algorithm. Are the patterns and support monotone so that the Apriori principle can be used? What is the search strategy, e.g. Apriori/levelwise or Eclat/breadth-first? How are candidates generated and potentially pruned? Can you show that the algorithm is complete, i.e., finds all frequent patterns as you have specified above? Can you something about its efficiency, e.g., not generating the same candidate many times?
Background reading in Moodle: https://moodle.helsinki.fi/mod/resource/view.php?id=362233

