Features of knowledge discovery systems Helena Ahonen, Wilhelm-Schickard-Institut, University of Tuebingen, Sand 13, D-72076 Tuebingen, Germany; Phone: +49-7071-2975481, fax: +49-7071-295958, Email: helena.ahonen@acm.org, http://www.cs.helsinki.fi/~hahonen 1. Knowledge discovery in databases and in document collections Currently, two trends in information management can be clearly seen: first, the ease to capture and store digital information make it attractive to maintain large data collections. Second, a constantly increasing proportion of these rapidly growing volumes of digital data consists of (unstructured or structured) textual data. Within many business areas the value of the collected data has been acknowledged. For instance, data may enclose information about critical markets, competitors and customers. In manufacturing, data may enclose performance and optimization opportunities, as well as keys to improving processes and troubleshooting problems. Therefore, projects for the automatic discovery of this kind of knowledge from databases have been initiated. Similarly, document collections may contain knowledge that exceeds their original purpose, and hence, also in this field, possibilities for knowledge discovery should be considered. In general, knowledge discovery can be defined as the process of identifying interesting new patterns in data. These patterns can be, e.g., relations, events or trends, and they can reveal both regularities and exceptions. In the core of the process, data mining methods are used to extract and verify patterns, whereas several pre- and postprocessing steps form the other phases of the process. When discovering knowledge from databases, we usually deal with objects that have a set of attributes. For instance, in market basket analysis, objects are cash receipts, which contain the products a customer has purchased. Based on this knowledge, interesting customer groups can be found, e.g., people that always buy vegetables and bread only. Similarly, when we want to discover knowledge from documents, we have to decide what are the target objects and what are the attributes of them. Some SGML documents may resemble databases to the extent that this is obvious, but usually the hierarchical structure calls for a careful consideration. Clearly, the structure adds value to knowledge discovery. However, no general methodologies for utilization of the structure exist: it always depends on the semantics of the structure and the intended knowledge discovery task. Another document-specific aspect is the form of the content. Although the structure of an SGML document is well-defined, the contents of the elements may not. In order to discover knowledge from freetext, several normalization and transformation steps, and probably also some kind of feature extraction phase, are necessary: finally, the data mining algorithms can handle very rigid formats only. 2. What kinds of knowledge can be discovered? In principle, two types of knowledge discovery tasks can be found: description and prediction. Through description, a system finds patterns in order to present the patterns to users in an understandable form. Examples of descriptive methods include association rule discovery and clustering. A clustering tool takes a collection of objects, e.g. documents, and creates a grouping: objects that belong to the same cluster are somehow similar to each other, whereas they differ from the objects in the other clusters. An association rule discovery tool reveals co-occurrences, like A => B, confidence(0.7), support(12), which tells us that if A occurs, also B occurs with the probability of 0.7; additionally, A and B occur together 12 times. With the descriptive methods, the understandability of the pattern representations is crucial. The results may be visualized graphically. Moreover, a clustering tool may characterize each cluster with some concept. Similarly, the amount of discovered association rules should not be so overwhelming that all the real pearls of information remain unnoticed. The predictive systems find patterns to predict the future behavior of some objects. For instance, we could have a categorization tool that learns to file documents to specific predefined folders. The predictive systems need a training phase: A categorization tool gets a sample set of documents, each document labeled with the respective categories. Analyzing these examples, the tool learns the necessary patterns to be used with new uncategorized documents. The resulting patterns of a predictive system may not have to be understandable, if the prediction seems to work. Understandability may be desirable, though, so that one can trust the prediction. Usually, the results are evaluated using a test set, i.e., a new set of documents is given to the tool, this time without the categories. As the categories of these documents are known to the evaluators, it is easy to compare the original categories to the categories attached by the tool. As the concept of knowledge discovery is not very fixed, it is unnecessary to exclude some related tasks and methods. For instance, many tasks are verificative in nature: we already have a hypothesis and seek to find support to it from the data. For instance, text retrieval can be seen this way. We state a query, and if we receive some answer, we know that documents fulfilling the conditions exist and we can access them. Furthermore, techniques known as information extraction aim to fill predefined templates by identifying and extracting knowledge from freetext. For instance, a template can have slots for a seller, a buyer, a product, and a price, and the extraction tool analyzes freetext, identifies fragments that describe selling actions, and extracts as much of the knowledge needed to fill the template slots as possible. The filled templates can then be stored in a database. As can be seen from the above, knowledge discovery utilizes methods from several traditional fields, like statistics, database management, information retrieval, machine learning, and natural language processing. Knowledge discovery is a process that gives a framework for applying various methods, and an ideal knowledge discovery system controls the whole life span from defining the discovery task to utilizing the results. Systems that would support multiple tasks for discovering knowledge from, e.g., SGML documents and that would also support the entire life span of the discovery process, may not exist yet. Whereas, single-task tools for extracting features (like technical terms and names) from text, as well as clustering and categorization tools are available. 3. Knowledge discovery process Knowledge discovery is an iterative and interactive process, usually with many decisions made by the user. The details of the process vary, but at least the following phases can be found in most cases. 1) Selecting the goals of the discovery process 2) Selecting of data 3) Data preprocessing 4) Applying the data mining methods 5) Interpretation and evaluation of the results 6) Utilizing the results 3.1 Selecting the goals of the discovery process At least some rough concepts of the goals of discovery are necessary to guide the other phases. The final goal may be to construct a specific tool to be integrated into some product. Or the goal may be an overall exploration of the document collection, directed by prior background knowledge of the collection. The starting point can even be some surprising co-occurrences that have been accidentally detected but whose generality remain uncertain. 3.2 Selecting of data As mentioned above, we have to select a set of data objects that are to be used in discovery. Text retrieval queries can collect data, even from multiple, heterogeneous sources. Particularly, if the goal of the discovery is not clear, queries can be used to experiment with the data, and the first hypotheses of what kind of knowledge could be discovered can be formulated. Directing the focus to some specific features at a time usually produces more understandable results. Additionally, the data mining techniques are, although able to handle large datasets, rather time-consuming. 3.3 Data preprocessing Data has to be transformed into the form which is required by the data mining methods. For instance, a rule discovery method may take as input a line for each data object, and each line has to contain a set of attribute values. The amount of necessary preprocessing depends on the method used and the quality of the data. Some single-task text mining tools may take natural language text and perform the preprocessing automatically. Data may have to be cleaned if there are some obvious flaws. Erroneous items should be removed, or at least one should be aware of them. If data is gathered from multiple sources, it has to be normalized. Preprocessing is usually the most time-consuming phase in knowledge discovery, covering even up to 75 % of the overall work. 3.4 Applying the data mining methods For the actual extraction of patterns, several alternative methods are often available, for instance categorization tools may be based on neural nets, decision trees, or rule discovery methods. Also many general statistical methods are used. Clearly, the used data mining method should be reliable. As well as statistics can lie, also the data mining techniques can produce incorrect inferences if not properly applied. For instance, with predictive methods the selection of training and test sets may change the results significantly. 3.5 Interpretation and evaluation of the results. Depending on the task and method, this phase can contain several postprocessing steps. The results may be visualized, e.g., with charts or graphs. A rule discovery tool may produce a large set of rules, thus an interactive pruning tool that enables the user to focus on his or her personal interests should be offered. The results of predictive methods are evaluated with test sets as explained above. Often returning to the previous steps is necessary: many methods have thresholds or other parameters, which have to be adjusted. Also shifting the focus of data may appear to be helpful. 3.6 Utilization of the results Again, depending on the task, results can be utilized in various ways. A clustering of documents might be used for creating hypertext links, and a categorization for filing documents or forwarding emails for responsible people of different topics. Knowledge revealed by association rules may affect strategic decisions, and so on. 4. Conclusion The features that a knowledge discovery system should have heavily depend on the discovery task and also on the intended user. There are a lot of generic, single-task tools available. Such tools often support the data mining step only, and significant pre- and postprocessing is required. These tools typically are targeted to developers that integrate the tools with other modules as part of a complete application, and therefore, simple interfaces and adaptability of the tools are the prerequisites. If the system is to be used by end-users directly, it is important that all the tools in the various phases use concepts and vocabulary that are familiar to the user. As the process cannot be fully automatized, the intuitiveness of all the tools is critical. Actually, the biggest challenge of knowledge discovery may be to combine the application area knowledge of the content specialists with the discovery expertize of knowledge discovery tools.