ECML/PKDD-2002 Tutorial

Text Mining and Internet Content Filtering

19 August 2002 9:00-13:00

Index

Motivation ·
Objectives ·
Target Audience ·
Outline ·
Presenter ·
References ·

Motivation

In the recent years, we have witnessed an impressive growth of the availability of information in electronic format, mostly in the form of text, due to the Internet and the increasing number and size of digital and corporate libraries. The overwhelming amount of text is hardly to consume for an average human being, who faces an information overload problem. As traditional Data Mining (or more properly, Knowledge Discovery in Databases, KDD) is about finding patterns in data, Text Data Mining (Text Mining, TM for short) is about uncovering patterns in data when the data is text. In other words, the goal of TM is turning the information buried in text into valuable knowledge that alleviates information overload.

TM is an emerging research and development field that address the information overload problem borrowing techniques from data mining, machine learning, information retrieval, natural-language understanding, case-based reasoning, statistics, and knowledge management to help people gain rapid insight into large quantities of semi-structured or unstructured text. TM includes several text processing and classification techniques, as text categorization, clustering and retrieval, information extraction, and others, but it also involves the development of new methods for information analysis, digesting and presentation.

A prototypical application of TM techniques is Internet information filtering. The easiness of Internet-based information publishing and communication makes it prone to misuse. For instance, websites devoted to pornography, racism, terrorism, etc. are daily accessed by easily influenced under age persons. Also, Internet email users have to bear intrusive unsolicited bulk email that makes it less valuable and more expensive as a communication means. Internet filtering through TM techniques is a promising work field that will provide the Internet community with more accurate and cheap systems for limiting youngsters access to illegal and offensive Internet content, and for alleviating the unsolicited bulk email problem.

Objectives

The goal of this tutorial is making the audience familiar to the emerging area of Text Mining, in a practical way. This goal will be achieved by realizing the concepts about the field through two Text Categorization [39, 40, 33] applications, focused on Internet information filtering: the detection of offensive websites [6], and the detection of unsolicited bulk email (see e.g. [2, 11, 21, 29, 30]). Being relatively simple, these applications will allow the audience to understand the main topics in Text Mining.

Target Audience

The tutorial is of interest for both researchers and practitioners of KDD and machine learning (and thus, for those attending to ECML or PKDD). Researchers will get a practical overview of the TM field from the point of view of applied, interactive KDD proccess. Practitioners will get a better understanding of the specific problems of KDD when the data is text, and their relation with the recurrent problems in KDD.

A basic knowledge of machine learning and KDD is recommended. Familiarity with the Java programming language is interesting.

Outline

The tutorial is divided into two main parts. The first part of the tutorial is an overview of TM topics, focusing in the specific problems of TM in relation to KDD. The concepts will be covered in a classification task oriented fashion, where a number of supervised and unsupervised learning tasks will be reviewed. The second part will realize the concepts in TM through the detailed analysis of the two previously mentioned Internet filtering tasks. Indeed, regarding the detection of offensive websites, an operational system will be quickly produced by reusing a number of open-source tools, including the Muffin proxy system and the Waikato Environment for Knowledge Analysis (WEKA) learning library.

In particular, the tutorial will cover the following topics:

1. TM: what is it and what is it not? This section will cover introductory topics (see e.g. [17, 20, 37]), will state the main specific problems in TM (in relation to KDD), and will include a review of hot Text Mining applications.

2. Learning from text when we know what about to learn: document categorization (e.g. [39, 40, 33]) and filtering (e.g. [3, 4, 16, 28]); topic detection and tracking (e.g. [1, 36]); term identification, extraction and categorization, including text representation models [32, 14], Part-Of-Speech Tagging [7] and Word Sense Disambiguation [8, 25, 24, 18]; information extraction (e.g. [5]).

3. Learning from text when we do not know what about to learn: document clustering [14, 27] and term clustering (including Latent Semantic Indexing [10, 12], automatic thesaurus construction [31, 32], etc.); discovering relations among documents and terms, and key phrase extraction [38, 15]; document summarization [26, 19].

4. Tools for TM: Review of available commercial and research tools; the Waikato Environment for Knowledge Analysis.

5. Application to the detection of offensive websites: motivation, web pages analysis and processing, learning useful regularities among offensive web pages, evaluating detection systems, an operational solution based on open-source software.

6. Application to the detection of unsolicited bulk email: motivation, email messages analysis and processing, learning useful regularities among unsolicited email messages, evaluating detection systems.

7. Challenges in TM: exploratory text analysis with the aid of visualization tools for finding relations among facts.

Presenter

José María Gómez Hidalgo is a lecturer and researcher at the Computer Science School of the Universidad Europea CEES, in Madrid, Spain. He has been developing his research work on the area of Natural Language Engineering for around eight years, in which he has taken part in several R&D projects, most of which involving text content analysis, user profiling, information filtering and related topics. In 2002/03 he will be leading a team at the Universidad Europea CEES in a European Commission funded R&D project focused on the development of a offensive web content filtering tool, called POESIA. He has published a number of research reports and articles related to the topics covered in the tutorial (including [21, 34, 13, 23, 22, 9, 35, 26]).

José María has been a lecturer for seven years at the Computer Science Schools of the Universidad Complutense de Madrid, Colegio Universitario Domingo de Soto, and Universidad Europea CEES. He has also given several courses by demand of corporate firms. In the present term, he is teaching a Natural Language Processing course at the Universidad Europea CEES, among others.

References

This list of references is provided as a sample of the stuff covered in the tutorial.

[1] J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report. In Proceedings of the Broadcast News Transcription and Understranding Workshop (Sponsored by DARPA), 1998.

[2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach. In H. Zaragoza, P. Gallinari, , and M. Rajman, editors, Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pages 1-13, Lyon, France, 2000.

[3] N.J. Belkin and W.B Croft. Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29-38, 1992.

[4] Eric Bloedorn, Inderjeet Mani, and T. Richard MacMillan. Machine learning of user profiles: Representational issues. In AAAI/IAAI Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 433-438, 1996.

[5] C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65-80, 1997.

[6] Konstantinos V. Chandrinos, Ion Androutsopoulos, Georgios Paliouras, and Constantine D. Spyropoulos. Automatic Web rating: Filtering obscene content on the Web. In Jose L. Borbinha and Thomas Baker, editors, Proceedings of ECDL00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 403-406, Lisbon, PT, 2000. Springer Verlag, Heidelberg, DE. Published in the "Lecture Notes in Computer Science" series, number 1923.

[7] K. Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of Second Conference on Applied Natural Language Processing (ANLP'88), 1988.

[8] Scott Cotton, Phil Edmonds, Adam Kilgarriff, and Martha Palmer. SENSEVAL2: Second International Workshop on Evaluating Word Sense Disambiguation Systems. Association for Computational Linguistics, 2001.

[9] M. de Buenaga, J.M. Gómez, and B. Díaz. Using wordnet to complement training information in text categorization. In N. Nicolov and R. Mitkov, editors, Recent Advances in Natural Language Processing II: Selected Papers from RANLP'97, volume 189 of Current Issues in Linguistic Theory (CILT), pages 353-364. John Benjamins, 2000.

[10] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391-407, 1990.

[11] Harris Drucker, Vladimir Vapnik, and Dongui Wu. Automatic text categorization and its applications to text retrieval. IEEE Transactions on Neural Networks, 10(5):1048-1054, 1999.

[12] Susan T. Dumais. Latent semantic indexing (lsi): Trec3 report. In Proceedings of TREC, 1994.

[13] A. Díaz Esteban, M. Maña López, J.M. Gómez Hidalgo, and P. Gervás. Using linear classifiers in the integration of user modeling and text content analysis in the personalization of a webbased spanish news service. In Proceedings of the Workshop on Machine Learning, Information Retrieval and User Modeling, 8th International Conference on User Modeling, 2001.

[14] W. Frakes and R. BaezaYates. Information Retrieval : Data Structures and Algorithms. Englewood Cliffs, N.J. : PrenticeHall., 1992.

[15] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. NevillManning. Domainspecific keyphrase extraction. In Proc. Sixteenth International Joint Conference on Artificial Intelligence, 1999.

[16] Nathaniel Good, J. Ben Schafer, Joseph A. Konstan, Al Borchers, Badrul M. Sarwar, Jonathan L. Herlocker, and John Riedl. Combining collaborative filtering with personal agents for better recommendations. In AAAI/IAAI Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 439-446, 1999.

[17] Marko Grobelnik, Dunja Mladenic, and Natasa MilicFrayling. Text mining as integration of several related research areas: Report on kdd2000 workshop on text mining. SIGKDD Explorations, 2(2), 2001.

[18] Louise Guthrie, James Pustejovsky, Yorik Wilks, and Brian M. Slator. The role of lexicons in natural language processing. Communications of the ACM, 39(1):63-72, 1996.

[19] Udo Hahn and Inderjeet Mani. The challenges of automatic summarization. Computer, 33(11):29-36, 2000.

[20] Marti A. Hearst. Untangling text data mining. In Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, 1999.

[21] J.M. Gómez Hidalgo. Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization. ACM Symposium on Applied Computing, Special Track on Information Access and Retrieval, 2002.

[22] J.M. Gómez Hidalgo, M. Maña López, and E. Puertas Sanz. Combining text and heuristics for cos-tsensitive spam filtering. In Proceedings of the Fourth Computational Natural Language Learning Workshop, CoNLL2000. Association for Computational Linguistics, 2000.

[23] J.M. Gómez Hidalgo, R. Murciano Quejido, A. Díaz Esteban, M. de Buenaga Rodríguez, and E. Puertas Sanz. Categorizing photographs for user-adapted searching in a news agency ecommerce application. In Proceedings of the 1st International Workshop on New Developments in Digital Libraries (NDDL2001), International Conference on Enterprise Information Systems (ICEIS 2001), 2001.

[24] Nancy Ide and Jean Veronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24:1-40, 1998.

[25] A. Kilgarriff. Senseval: An exercise in evaluating word sense disambiguation programs. In Proc. LREC, 1998.

[26] M.J. Maña López, M. de Buenaga Rodríguez, and J.M. Gómez Hidalgo. Using and evaluating user directed summaries to improve information access. In Research and Advanced Technology for Digital Libraries (Lecture Notes in Computer Science, Vol. 1696). SpringerVerlag, 1999.

[27] Chris Manning and Hinrich Schtze. Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA, 1999.

[28] Douglas W. Oard and Gary Marchionini. A conceptual framework for text filtering process. Technical Report CSTR3643, 1996.

[29] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk email. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. AAAI Technical Report WS9805.

[30] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In Proceedings of EMNLP01, 6th Conference on Empirical Methods in Natural Language Processing, Pittsburgh, US, 2001. Association for Computational Linguistics, Morristown, US.

[31] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGrawHill, 1983.

[32] Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison Wesley, 1989.

[33] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002. Forthcoming.

[34] L.A. Ureña, M. de Buenaga, and J.M. Gómez Hidalgo. Integrating linguistic resources in TC through WSD. Computers and the Humanities, May 2001.

[35] L.A. Ureña, J.M. Gómez, and M. de Buenaga. Information retrieval by means of word sense disambiguation. In Proceedings of the TSD 2000 Third International Workshop on TEXT, SPEECH and DIALOGUE, 2000.

[36] Charles L. Wayne. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. In Proc. LREC, 2000.

[37] S. M. Weiss, C. Apte, F. J. Damerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Hampp. Maximizing text-mining performance. IEEE Intelligent Systems, JulyAugust 1999.

[38] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.

[39] Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of ICML97, 14th International Conference on Machine Learning, 1997.

[40] Yiming Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(12), 1999.