Tiedon louhinnan menetelmät S05, kotikoe

582448 Tiedon louhinnan menetelmät
Kotikoe, 21.11.2005
Hannu Toivonen
(In English: see below)

Tämä on kotikoe. Tehtävät saa tehdä omalla ajallaan ja lähdemateriaalia hyväksi käyttäen. Jokaisen täytyy kuitenkin tehdä koe yksin. Yhteistyö muiden kanssa on kielletty, samoin valmiiden ratkaisujen ja tekstien kopioiminen mistään muualtakaan. Vastaa tiiviisti mutta perustellen.

Vastaukset palautetaan sähköisesti (txt, pdf, ps tai doc) osoitteeseen hannu.toivonen.xxx.cs.helsinki.fi viimeistään 5.12.2005. Myöhästyneitä töitä ei arvostella. Kotikoetta ei voi uusia, sen sijaan tavallisia tenttejä tullaan järjestämään.

Tämän kokeen maksimipistemäärä on 60 p. Syksyllä 2005 harjoitustyöt suorittaneet saavat kokeeseen lisäksi kymmenen pisteen ylittäneet harjoitustyöpisteet (siis esim. 18 harjoitustyöpisteellä saa 8 lisäpistettä kokeeseen).

Vertaile koko tiedonlouhintaprosessia klusteroinnissa, luokittelussa ja assosiaatiosääntöjen etsinnässä. Mitä asioita kannattaa suorittaa eri tavalla, miten ja miksi? (19 p)
Seuraavissa tehtävissä kurssilla käsiteltyjä menetelmiä sovelletaan käsin pieniin aineistoihin. Ohjelmia voi halutessaan käyttää tulosten tarkistamiseen. (21 p)
- A. Klusteroi seuraava kaksiulotteinen aineisto valitsemallasi menetelmällä joko kahteen klusteriin tai hiearkisesti. Kuvaa käyttämäsi menetelmä lyhyesti. Anna myös keskeiset välivaiheiden tulokset.
```
2 8
3 8
7 2
4 7
8 1
9 3
1 7
8 3
```
- B. Tee päätöspuuluokittelija seuraavalle aineistolle (viimeinen sarake on ennustettava luokka). Kerro mitä menetelmää käytät viittaamalla oppikirjaan, älä anna algoritmia tässä. Anna myös keskeiset välivaiheiden tulokset.
```
A X 4 1
A X 3 0
B X 2 1
A Y 1 0
B Y 5 1
A Y 8 0
B X 7 1
A Y 6 0
```
- C. Etsi seuraavasta aineistosta Apriori-algoritmin pääperiaatteilla assosiaatiosäännöt, joiden tuki (support) on vähintään 4 kpl ja luottamus (confidence) vähintään 0.8. Anna myös algoritmin kullakin iteraatiolla käyttämät kandidaattijoukot (ne, joiden tuki lasketaan tietokannasta) sekä toistuvat joukot (frequent/large itemsets). Anna myös ne algoritmin testaamat säännöt, joiden luottamus on alle 0.8.
```
A C E F H I K
B G L
A B C D F H J K
C G H K L
B C E F
B D E F H I J K
D E G H K L
A F H K L
```
Helsingin yliopiston opintosuoritusrekisteri sisältää opiskelijoittain tiedot heidän suorittamistaan kursseista, niiden opintoviikkolaajuuksista, suoritusten päivämääristä ja arvosanoista sekä valmistumisajankohdan. Tarkastele kutakin seuraavista hypoteettisista ideoista tämän aineiston hyödyntämiseksi. Kerro perustellen, millaisella menetelmällä lähtisit kussakin tilanteessa tietoa ensisijaisesti louhimaan, millaisia ratkaisuja tekisit aineiston käsittelyssä ja menetelmän soveltamisessa, millaisia ongelmia lähestymistapaan tai dataan saattaa liittyä, jne. (20 p)
- A. Tietojenkäsittelytieteen laitoksen opintoneuvojat haluavat saada yleiskuvan laitoksen opiskelijoista.
- B. Ilmoittautumisjärjestelmän käyttöä halutaan helpottaa siten, että järjestelmä ennustaa opiskelijakohtaisesti mille kursseille hän todennäköisimmin ilmoittautuu ja tekee niihin ilmoittautumisen helpoksi.
- C. Ilmoittautumisjärjestelmää halutaan laajentaa auttamaan opiskelijoita kurssien valinnassa ja niiden arvioinnissa. Tätä varten kullekin opiskelijalle ennustetaan hänen arvosanansa niillä pääaineen kursseilla, joita hän ei ole vielä käynyt.
- D. Laitoksella pyritään vähentämään opintojen keskeytymisiä. Todennäköisimmät keskeyttäjät haluttaisiin tunnistaa jo ennen opintojen keskeytymistä, jotta ohjausta voitaisiin kohdistaa tehokkaammin.

Lisää vastaukseesi teksti "Olen tehnyt vastaukset itse ilman kenenkään apua." ja vakuudeksi kirjoita alle nimesi. (Koska vastaukset palautetaan sähköisesti, tätä ei tarvitse allekirjoittaa käsin.)

Koe arvostellaan vain niiltä, jotka ovat ilmoittautuneet kurssille. Ilmoittaudu siis tarvittaessa.

582448 Data mining methods
Take-home exam, 21st November 2005
Hannu Toivonen

This a take-home exam. You may answer the questions in your own time using reference material. However, you must take the exam alone. Co-operation with other students is forbidden, and so is copying of answers of material from any sources. Be concise in your answers, but remember to justify them.

Return your answers by email to hannu.toivonen.xxx.cs.helsinki.fi (txt, pdf, ps, and doc are acceptable formats), on 5th Dec 2005 at the latest. Late answers are not graded. You cannot take a take-home exam again. However, normal exams will be organized later.

The maximum number of points is 60.

Compare the whole knowledge discovery process when doing clustering, classification or association rule mining. Which parts of the process should be done in different ways, how, and why? (19 p)
In the next tasks, methods covered in the course are manually applied to small datasets. You may use software to check your results. (21 p)
- A. Cluster the following two-dimensional dataset with a method of you own choice, either to two clusters or hierarchically. Briefly give the method you use. Also give the central intermediate results of clustering.
```
2 8
3 8
7 2
4 7
8 1
9 3
1 7
8 3
```
- B. Build a decision tree classifier for the following data (the last column in the class to be predicted). Specify the method you use by referring to the text book, do not write the algorithm here. Also give the central intermediate results.
```
A X 4 1
A X 3 0
B X 2 1
A Y 1 0
B Y 5 1
A Y 8 0
B X 7 1
A Y 6 0
```
- C. Use the principles of Apriori algorithm to find, from the following dataset, association rules with support at least 4 and confidence at least 0.8. Also give the candidate sets of each iteration (those counted in the database), as well as the frequent ("large") itemsets. Also give those association rules that confidence below 0.8 but are tested by the algorithm.
```
A C E F H I K
B G L
A B C D F H J K
C G H K L
B C E F
B D E F H I J K
D E G H K L
A F H K L
```
The Study Register of the University of Helsinki maintains information about courses taken by students. For each student it contains the courses he or she has taken, the numbers of credit units, the dates of completion of the courses, the grades, and the date of MSc graduation, if graduated. Consider the following hypothetical ideas for utilizing this data. Explain which method you would primarily use to mine the data, the choices you would make when handling the data or applying the algorithm, what kind of problems could be associated to the data or method, etc. Justify your answers. (20 p)
- A. Study advisors at the Department of Computer Science want to get an overview of the students at the department.
- B. The usability of the course enrollment system will be improved. The system would predict for each user the courses he or she is most likely to enroll to, and makes it easier to enroll to these courses.
- C. The course enrollment system will be further extended, to help students choose and assess courses: the system would predict for each user his or her grades in the remaining courses of his/her major subject.
- D. The department wants to cut down the number of drop-outs. They would like to be able to identify the most likely drop-outs already before they stop their studies, to improving their study guidance.

Add the following text to your answer: "I certify that the submitted work was done by me without help from any other person", and add your name below this text. (Since the answers are returned electronically, there is no need to sign by hand.)

For taking this exam, you need to register for the course using the course registration system. Please, register if needed.