Data Mining : Problem #2
Problem Description
Essentially the same data and the same question, but let's consider more advanced concepts and different algorithms.
1) More advanced concepts.
Based on results for Problem #1, it is evident that frequent itemsets are not very useful as such. Some of them are simply not interesting, many are redundant, and dependencies between courses are not clearly expressed. Support threshold or minimum and maximum sizes of frequent itemsets are not sufficient to select a reasonable number of most interesting or descriptive itemsets.
This time, consider the use of maximal frequent itemsets and closed frequent itemsets as compact representations of frequent itemsets. Also consider association rules to describe statistical dependencies between courses. Use confidence and some other interestingness measures to identify rules that are potentially more useful.
2) Different algorithms.
Write your own implementation of (A) Apriori, (B) a depth-first method (e.g. ECLAT) or (C) FP-Growth. To find maximal or closed frequent itemsets, either (1) use a suitably modified version of a frequent itemset mining algorithm, or (2) use a basic algorithm to find all frequent itemsets, then postprocess that result to extract the maximal or closed frequent itemsets.
For Apriori-like approaches, see link "Knowledge Discovery in Databases, course notes" in Moodle for efficient implementation of candidate generation.
Each student produces individually at least one implementation. A group would hopefully produce different algorithms and would then be able to compare them. Experiment with different data sets/versions of the data set, with different parameter values, etc, and try to find differences in the behaviour of different methods.
Reporting: This time, each student produces a report of her/his own, based on his implementation and his results. In addition, the group as a whole produces a report that summarizes the findings and compares the approaches taken by students in the group. Note the deadlines for these two components!
A new version of the dataset is available on the Moodle course page. It contains information about the program each the course, the years it was taught, level, etc. Have a look at the README file for more details.
Again, try out different study methods, work out exercises, and experiment with the data!
Additional resources
Here are pointers to material related to the new concepts in this problem:
- Entries for Frequent Pattern, Apriori Algorithm, Association Rule, and Frequent Itemset in Encyclopedia of Machine Learning (http://www.cs.helsinki.fi/u/htoivone/pubs/)
- Slides on closed sets and generators: http://www.cs.helsinki.fi/u/htoivone/teaching/timuS02/closedsets.pdf
And to go deeper in the topic, you might be interested in looking at these additional resources:
- Finding maximal frequent itemsets: Roberto J. Bayardo Jr: Efficiently Mining Long Patterns from Databases. In Proceedings ACM SIGMOD International Conference on Management of Data, June 1998, Seattle, Washington, USA. 85-93.
- Generating "representative" association rules using closed itemsets: Marzena Kryszkiewicz: Closed Set Based Discovery of Representative Association Rules. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science 2189. Springer 2001. 350-359. (The link should be accessible from the university network.)

