ECML/PKDD-2002 Tutorial

Inductive Databases and Constraint-based Mining

19 August 2002 14:30-18:00

Tutorial supported by the "cInQ" European Project
cInQ, Consortium on Discovering Knowledge with Inductive Queries

See also 1st International Workshop on Knowledge Discovery in Inductive Databases (KDID'02) in conjunction with ECML/PKDD-2002

Index

Motivation ·
Objectives ·
Target Audience ·
Outline ·
Presenters ·

Motivation

Ever since the start of the field of data mining, it has been realized that the data mining process should be supported by database technology. In recent years, this idea has been formalized in the concept of inductive databases introduced by Imielinski and Mannila in a seminal paper that appeared in CACM 1996 (Vol. 39, Issue 11, pages 58-64). Inductive databases are databases that, in addition to data, also contain generalizations, i.e., patterns, extracted from the data. Within the inductive database framework KDD is modelled as an interactive process in which users can query both data and patterns to gain insight about the data. To this aim a so-called inductive query language is used. Very often inductive queries impose constraints on the patterns of interest (e.g., w.r.t. frequency, syntax, generality, accuracy, significance). The constraint-based approach to data mining is also closely related to the issue of declarative bias in machine learning, e.g., to syntactic bias, which imposes certain (syntactic) constraints on the patterns learned.

Today, a number of specialized inductive query languages have already been proposed and implemented (e.g., MINE RULE by Meo et al., MSQL by Imielinski et al., DMQL by Han et al.). Most of these inductive query languages extend SQL. This is motivated by the industrial perspective of relational database mining, which has focused on efficient and portable implementations of SQL. In addition to specific proposals for inductive query languages, there has also been a large body of research work concerned with identifying key concepts to be used in inductive databases. On the one hand, this includes the many types of constraints that have been employed, and on the other hand, the many efficient algorithms and representations for solving constraint-based queries that have been developed in both machine learning and data mining. This includes, e.g., the level-wise algorithm, Apriori, the version space algorithm, boundary set representations, condensed representations, etc. Despite these many contributions, we are still far away from a deep understanding of the issues in inductive databases. In this respect, one can see an analogy with the state-of-the-art in databases in the early seventies. Today, there exist many different approaches and implementations of data mining systems but our theoretical understanding is still imperfect. In a sense, we are looking for the equivalent of Codd's relational database theory for data mining.

Some Good Reasons for Attending

The presenters consider that ECML and PKDD 2002 is the right forum and timing for presenting a tutorial on inductive databases because of a variety of reasons. First, there is quite much activities on Inductive Databases in Europe, e.g., the IST cInQ project and also through the work of other groups, e.g., in Warsaw, in Pisa, in Marseille. Second, there is an overlap and a link to various machine learning techniques and problems (such as version spaces, borders and declarative bias) that should be of interest to both the KDD and symbolic machine learning communities. In addition, there are strong links to databases. Third, given the increasing attention to this topic in Europe and world wide, it is the right time to inform fellow researchers of the state-of-the-art with the hope of getting them interested in this very basic and central research problem. Last but not the least, there is the related international workshop on Inductive Databases (KDID'02) that gives an opportunity to learn about the latest developments in the field. Indeed, this full-day KDID'02 workshop will consist of invited talks by experts of the field, paper sessions with technical and position papers, and a panel discussion. In addition, short 5-minute presentations of new ideas and visions will be included in the program.

Objectives

The goals of the tutorial are:

To introduce the concept and architecture of inductive databases. This includes the introduction of the key concepts such as inductive queries, query languages, inductive solvers, etc.

To give an overview of the state-of-the-art in inductive databases by reviewing some proposals for inductive query languages, such as Mine Rule, MSQL and DMQL.

To introduce constraint-based mining as a key mechanism for inductive query evaluation and thus knowledge discovery, to introduce the constraints used so far and to investigate their properties, such as monotonicity and anti-monotonicity.

To discuss the key algorithms that have been used to solve constraint-based queries, i.e. the levelwise algorithm, concepts of borders, the use of version spaces, condensed representations based on closures, etc.

To discuss future research and application issues in inductive databases.

Target Audience

This tutorial is meant for researchers and practitioners interested in these key questions about inductive databases and constraint-based mining, it hopes to attract interest from a wide range of possible fields, including: data mining, machine learning, databases, constraint programming, etc.

A solid background in machine learning and/or data mining is useful but no prior knowledge on inductive databases or constraint programming is needed.

Outline

The length of the tutorial is 3 hours.

1. Introduction to inductive databases (45')

What are inductive databases?
A querying approach to KDD processes
Constraint-based mining and inductive querying
Examples of applications (association rule mining, dependency discovery, frequent datalog queries, molecular fragment mining, etc.)

2. A survey of available data mining query languages and their expressivity (45')

MINE RULE (Meo, Psaila and Ceri, DMKD'98)
MSQL (Imielinski and Virmani, DMKD'99)
DMQL (Han et al. KDD'96)
MolFea - Molecular Feature Miner (Kramer, De Raedt and Helma, SIGKDD'01), a domain specific inductive database query language

3. Underlying principles of inductive querying: constraint-based mining (70')

3.1. An analogy with constraint programming

Pattern domains (e.g., item sets, episodes, datalog queries)
Solvers and algorithms
Various types of constraints available (frequency, generality, freeness)
Properties of constraints (monotonicity, anti-monotonicity, succinctness, etc.)
Borders of solution spaces (as in version spaces and the levelwise algorithm)
Relation to declarative bias in machine learning

3.2. Query evaluation for constraint based queries

Principles
The levelwise algorithm by Mannila and Toivonen
Searching for borders only (e.g., MaxMiner, version spaces, etc.)
Advanced algorithms for solving specific combinations of primitive constraints (e.g., Close)
Scaling issues

3.3. What is still missing?

Predictive learning
Accuracy
Optimisation primitives

4. Conclusions (20')

Perspectives: Applications and theory
Related work
Further reading

Presenters

Jean-François Boulicaut is currently associate professor at INSA Lyon. He got his Ph.D. and his Habilitation de Recherche from INSA Lyon in 1992 and 2001. His main research interests are: logic programming, (deductive) databases and knowledge discovery from databases. He has been invited researcher at the Department of Computer Science in the University of Helsinki from November 97 until August 98. He created the Data Mining group at INSA in October 98. He has served as PC member for the annual French-speaking conference on Logic Programming and Constraint Programming (JFPLC) many times. Concerning database mining and machine learning, he has served in the PC of PKDD'00, ECML-PKDD'01, ILP'01, BDA'01, ECML-PKDD'02, DTDM'02, FCAKDD'02 and serves as the ECAI 2002 tutorial chair. He is coordinator of the EU IST project cInQ consortium on discovering knowledge with Inductive queries.

Luc De Raedt is a full professor at the Albert-Ludwigs-University of Freiburg since 1999. Before that he was employed at the Katholieke Universiteit Leuven and the Fund for Scientific Research Flanders, where he was a post-doc and a part-time associate professor. He was local organiser, and program co-chairman of ECML and PKDD 2001, which was organised in Freiburg and which constituted the first co-location of a machine learning and a data mining conference. He was also co-chairman of the programme committee of the 7th European Conference on Machine Learning (Sicily, 1994), and of the IJCAI'97 Workshop on "Frontiers of Inductive Logic Programming", and chairman of the 5th International Inductive Logic Programming Workshop (Leuven, 1995). He is or was a member of the editorial boards of Machine Learning (1997-2000), Journal of AI Research (1997-2000), Journal of Machine Learning Research, New Generation Computing, AI Communications, Intelligent Data Analysis and Informatica. He has delivered tutorials at important conferences such as IJCAI'97, ICML'97, ICML'99, PKDD'99, LOPSTR'96, GP'98, ILPS'93, ACAI'99, etc., and invited talks at PKDD'00, ILP'98, MSL'00, AISC'98, MSL'96, etc. He was a coordinator of the ESPRIT ILP I and II projects, the recent EU IST April project, and was or is a principal organiser of the EU projects Aladin and cInQ. He has been working mainly in the field of inductive logic programming though he has recently started working on inductive databases. The domain specific inductive database MolFea, developed at Freiburg, received the runner up best application award at KDD 2001.

This tutorial is supported by the European Union project cInQ IST-2000-26469 (May 2001 - May 2004), funded by the Future and Emerging Technologies arm of the IST Programme. The partners of cInQ (consortium on discovering knowledge with Inductive Queries) are Nokia Research Center (Helsinki, Finland), Politecnico di Milano (Italy), University of Torino (Italy), University of Freiburg (Germany) and INSA Lyon (France).