ECML/PKDD-2002 Tutorial

An Introduction to Quality Assessment in Data Mining

20 August 2002 9:00-13:00

Index

Motivation ·
Objectives ·
Target Audience ·
Outline ·
Presenters ·

Motivation

Data Mining is mainly concerned with methodologies for extracting patterns from large data repositories. There are many data mining methods which accomplishing a limited set of tasks produces a particular enumeration of patterns over datasets. The main data mining tasks are: i) Clustering, ii) Classification, iii) Association Rule Extraction.

Since a data mining system potentially generates large numbers of patterns, questions are raised for the quality of the data mining results such as which of the extracted patterns are interesting and which of them represent valid knowledge.

In general, a pattern is interesting if it is easily understood, valid, potentially useful and novel. A pattern also is considered as interesting if it validates a hypothesis that a user sought to confirm. An interesting pattern represents useful knowledge.

The interestingness of patterns depend on the quality both of the analysed data and the data mining results. Thus several techniques have been developed aiming at evaluating and preparing the data used as input in the data mining process. Also a number of techniques and measures have been developed aiming at evaluating and interpreting the extracted patterns.

Objectives

In this tutorial we address the important issue of assessing the quality of data mining results. We introduce fundamental concepts of this area while we present a review of clustering validity indices as well as approaches and measures for evaluating the classification process and association rules interestingness.

Target Audience

The target audience consists of researchers, practitioners and advanced students with some knowledge of data mining who desire an introduction to data mining quality assessment techniques.

The tutorial is targeted to scientists with a basic understanding of data mining, but no knowledge of quality assessment in data mining. The relevant concepts from data mining will be reviewed while the quality criteria and techniques for evaluating data mining results will be introduced and explained via examples.

Outline

1. Introduction and Motivation

It discusses the issues that are under-addressed by the recent techniques as regards the validity of data mining. It gives the motivations for introducing approaches that gives an indication of the quality of the data mining results. Then it introduces the fundamental concepts of this area.

2. Cluster Validity Fundamental Concepts

It addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity indices and approaches available in the literature is presented. More specifically, this part of tutorial discusses the following sub-topics:

2.1 What is cluster validity?

2.2 Cluster Validity Criteria

2.3 Cluster Validity Indices

A review of cluster validity indices based on:

External Criteria
Internal Criteria
Relative Criteria

2.4 Experimental Study

3. Evaluation of Classification Methods

3.1 Classification Model Accuracy

The most common techniques for assessing classifier accuracy will be discussed:

Hold-out method
k-fold cross-validation
Bootstrapping

3.2 Interestingness Measures of Classification Rules

It discusses some representative measures for ranking the usefulness and utility of discovered classification patterns (classification rules).

Rule-Interest Function
Smyth and Goodman's J-Measure
General Impressions
Gago and Bento's Distance Metric

4. Association Rules Interestingness Measures

A review of measures giving an indication of the association rules' importance and confidence will be presented. These measures could represent the predictive advantage of a rule so as to help to identify interesting patterns of knowledge in data and make decisions.

Strength
Coverage
Support
Leverage
Lift
Other Interestingness Measures

Klemettinen et al Rule Templates ·
Gray and Orlowska's Interestingness ·
Dong and Li's Interestingness ·

5. Summary and Trends

It summarizes the main points of the tutorial regarding the quality assessment of data mining results. Also it gives trends in the filed and directions for further work.

Presenters

Maria Halkidi and Michalis Vazirgiannis
Dept of Informatics
Athens University of Economics & Business
Patision 76 Street, Athens 10434, Greece
Voice: +30-10-8203513(519)
Fax: +30-10-8203517

Dr. Michalis Vazirgiannis
Dr. Vazirgiannis is an Assistance Professor in the dept of Informatics of Athens Univ. of Economics & Business. He holds a degree in Physics (1986), a MSc. in Robotics (1988), and a MSc. in Knowledge Based Systems. In 1994 he obtained a Ph.D. degree in Informatics. Since then, he has conducted research in the Knowledge & DB Lab (of N.T.U. Athens, Greece), in GMD-IPSI (Darmstadt, Germany), in Fern-Universitaet (Hagen, Germany) and in project VERSO in INRIA/Paris. His research interests and work range from Data Mining to Spatiotemporal databases. He has received twice the ERCIM fellowship. He has published two books and over 50 papers in international conferences and journals. Currently he is leading three international basic research projects funded by the EU. He served as a conference committee member and as reviewer for international conferences and journals.

Mrs. M. Halkidi, (MSc) PhD candidate
Maria Halkidi received a B.Sc. degree in Informatics in 1997. In 1999 she received a M.Sc. degree in Information Systems from Athens University of Economics and Business (AUEB). Now, she is a PhD Student in Dept. of Informatics (AUEB). The research area is Quality and Uncertainty handling in Data Mining. Also, she is a member of the DB-NET research group in AUEB, participating in National and European-funded projects. Her research interests include Knowledge & Data Mining, Web Mining, Novel Data Management Systems (pattern-based systems, data management in a mobile environment), Representation & Manipulation of uncertainty in database systems. She received an award from IKY (Greek Fellowships Foundation) for the academic year 1997-98. She has published nine papers in international conferences and two papers in journals. She is a student member of ACM and IEEE.