Features of knowledge discovery systems

Helena Ahonen, Wilhelm-Schickard-Institut, University of Tuebingen,
Sand 13, D-72076 Tuebingen, Germany;
Phone: +49-7071-2975481, fax: +49-7071-295958,
Email: helena.ahonen@acm.org, http://www.cs.helsinki.fi/~hahonen


1. Knowledge discovery in databases and in document collections

Currently, two trends in information management can be clearly seen:
first, the ease to capture and store digital information make it
attractive to maintain large data collections.  Second, a constantly
increasing proportion of these rapidly growing volumes of digital data
consists of (unstructured or structured) textual data.  Within many
business areas the value of the collected data has been
acknowledged. For instance, data may enclose information about
critical markets, competitors and customers. In manufacturing, data
may enclose performance and optimization opportunities, as well as
keys to improving processes and troubleshooting problems. Therefore,
projects for the automatic discovery of this kind of knowledge from
databases have been initiated. Similarly, document collections may
contain knowledge that exceeds their original purpose, and hence, also
in this field, possibilities for knowledge discovery should be
considered.

In general, knowledge discovery can be defined as the process of
identifying interesting new patterns in data.  These patterns can be,
e.g., relations, events or trends, and they can reveal both
regularities and exceptions. In the core of the process, data mining
methods are used to extract and verify patterns, whereas several pre-
and postprocessing steps form the other phases of the process.

When discovering knowledge from databases, we usually deal with
objects that have a set of attributes. For instance, in market basket
analysis, objects are cash receipts, which contain the products a
customer has purchased. Based on this knowledge, interesting customer
groups can be found, e.g., people that always buy vegetables and bread
only. Similarly, when we want to discover knowledge from documents, we
have to decide what are the target objects and what are the attributes
of them. Some SGML documents may resemble databases to the extent that
this is obvious, but usually the hierarchical structure calls for a
careful consideration. Clearly, the structure adds value to knowledge
discovery. However, no general methodologies for utilization of the
structure exist: it always depends on the semantics of the structure
and the intended knowledge discovery task.

Another document-specific aspect is the form of the content.  Although
the structure of an SGML document is well-defined, the contents of the
elements may not. In order to discover knowledge from freetext,
several normalization and transformation steps, and probably also some
kind of feature extraction phase, are necessary: finally, the data
mining algorithms can handle very rigid formats only.


2. What kinds of knowledge can be discovered?

In principle, two types of knowledge discovery tasks can be found:
description and prediction.  Through description, a system finds
patterns in order to present the patterns to users in an
understandable form.  Examples of descriptive methods include
association rule discovery and clustering.  A clustering tool takes a
collection of objects, e.g. documents, and creates a grouping: objects
that belong to the same cluster are somehow similar to each other,
whereas they differ from the objects in the other clusters.  An
association rule discovery tool reveals co-occurrences, like

A => B,  confidence(0.7), support(12),

which tells us that if A occurs, also B occurs with the probability of
0.7; additionally, A and B occur together 12 times.

With the descriptive methods, the understandability of the pattern
representations is crucial. The results may be visualized graphically.
Moreover, a clustering tool may characterize each cluster with some
concept.  Similarly, the amount of discovered association rules should
not be so overwhelming that all the real pearls of information remain
unnoticed.

The predictive systems find patterns to predict the future behavior of
some objects.  For instance, we could have a categorization tool that
learns to file documents to specific predefined folders.  The
predictive systems need a training phase: A categorization tool gets a
sample set of documents, each document labeled with the respective
categories.  Analyzing these examples, the tool learns the necessary
patterns to be used with new uncategorized documents.  The resulting
patterns of a predictive system may not have to be understandable, if
the prediction seems to work. Understandability may be desirable,
though, so that one can trust the prediction. Usually, the results are
evaluated using a test set, i.e., a new set of documents is given to
the tool, this time without the categories. As the categories of these
documents are known to the evaluators, it is easy to compare the
original categories to the categories attached by the tool.

As the concept of knowledge discovery is not very fixed, it is
unnecessary to exclude some related tasks and methods.  For instance,
many tasks are verificative in nature: we already have a hypothesis
and seek to find support to it from the data. For instance, text
retrieval can be seen this way. We state a query, and if we receive
some answer, we know that documents fulfilling the conditions exist
and we can access them. Furthermore, techniques known as information
extraction aim to fill predefined templates by identifying and
extracting knowledge from freetext. For instance, a template can have
slots for a seller, a buyer, a product, and a price, and the
extraction tool analyzes freetext, identifies fragments that describe
selling actions, and extracts as much of the knowledge needed to fill
the template slots as possible. The filled templates can then be
stored in a database.

As can be seen from the above, knowledge discovery utilizes methods
from several traditional fields, like statistics, database management,
information retrieval, machine learning, and natural language
processing. Knowledge discovery is a process that gives a framework
for applying various methods, and an ideal knowledge discovery system
controls the whole life span from defining the discovery task to
utilizing the results.  Systems that would support multiple tasks for
discovering knowledge from, e.g., SGML documents and that would also
support the entire life span of the discovery process, may not exist
yet. Whereas, single-task tools for extracting features (like
technical terms and names) from text, as well as clustering and
categorization tools are available.


3. Knowledge discovery process

Knowledge discovery is an iterative and interactive process, usually
with many decisions made by the user.  The details of the process
vary, but at least the following phases can be found in most cases.

1) Selecting the goals of the discovery process
2) Selecting of data
3) Data preprocessing
4) Applying the data mining methods
5) Interpretation and evaluation of the results
6) Utilizing the results

3.1 Selecting the goals of the discovery process

At least some rough concepts of the goals of discovery are necessary
to guide the other phases. The final goal may be to construct a
specific tool to be integrated into some product.  Or the goal may
be an overall exploration of the document collection, directed by
prior background knowledge of the collection.  The starting point can
even be some surprising co-occurrences that have been accidentally
detected but whose generality remain uncertain.


3.2 Selecting of data 

As mentioned above, we have to select a set of data objects that are
to be used in discovery. Text retrieval queries can collect data, even
from multiple, heterogeneous sources. Particularly, if the goal of the
discovery is not clear, queries can be used to experiment with the
data, and the first hypotheses of what kind of knowledge could be
discovered can be formulated.  Directing the focus to some specific
features at a time usually produces more understandable
results. Additionally, the data mining techniques are, although able
to handle large datasets, rather time-consuming.


3.3 Data preprocessing 

Data has to be transformed into the form which is required by the data
mining methods.  For instance, a rule discovery method may take as
input a line for each data object, and each line has to contain a set
of attribute values.  The amount of necessary preprocessing depends on
the method used and the quality of the data.  Some single-task text
mining tools may take natural language text and perform the
preprocessing automatically. Data may have to be cleaned if there are
some obvious flaws.  Erroneous items should be removed, or at least
one should be aware of them. If data is gathered from multiple
sources, it has to be normalized.  Preprocessing is usually the most
time-consuming phase in knowledge discovery, covering even up to 75 %
of the overall work.


3.4 Applying the data mining methods

For the actual extraction of patterns, several alternative methods are
often available, for instance categorization tools may be based on
neural nets, decision trees, or rule discovery methods.  Also many
general statistical methods are used.  Clearly, the used data mining
method should be reliable.  As well as statistics can lie, also the
data mining techniques can produce incorrect inferences if not
properly applied.  For instance, with predictive methods the selection
of training and test sets may change the results significantly.


3.5 Interpretation and evaluation of the results.

Depending on the task and method, this phase can contain several
postprocessing steps. The results may be visualized, e.g., with charts
or graphs. A rule discovery tool may produce a large set of rules,
thus an interactive pruning tool that enables the user to focus on his
or her personal interests should be offered.  The results of
predictive methods are evaluated with test sets as explained above.
Often returning to the previous steps is necessary: many methods have
thresholds or other parameters, which have to be adjusted. Also
shifting the focus of data may appear to be helpful.

3.6 Utilization of the results
 
Again, depending on the task, results can be utilized in various
ways. A clustering of documents might be used for creating hypertext
links, and a categorization for filing documents or forwarding emails
for responsible people of different topics. Knowledge revealed by
association rules may affect strategic decisions, and so on.


4. Conclusion


The features that a knowledge discovery system should have heavily
depend on the discovery task and also on the intended user.  There are
a lot of generic, single-task tools available.  Such tools often
support the data mining step only, and significant pre- and
postprocessing is required. These tools typically are targeted to
developers that integrate the tools with other modules as part of a
complete application, and therefore, simple interfaces and
adaptability of the tools are the prerequisites.  If the system is to
be used by end-users directly, it is important that all the tools in
the various phases use concepts and vocabulary that are familiar to
the user. As the process cannot be fully automatized, the
intuitiveness of all the tools is critical. Actually, the biggest
challenge of knowledge discovery may be to combine the application area
knowledge of the content specialists with the discovery expertize of
knowledge discovery tools.