Department of Computer Science

58304304 Research Seminar on Intelligent Systems (1-2 cu), Fall 2004

Description of the topic area

The strength of the Internet is that there are billions of pages of information available waiting to present information on an amazing variety of topics in an amazing variety of styles (e.g., newsgroups, magazines, references, technical data, tutorials, sales literature, etc.). The downside of the Internet is that there are billions of pages of information, most of them titled according to the whim of their author using subtly different terminology to fool keyword search. Subject-specific search sites have emerged to provide help for this situation, yet they are time consuming to maintain, only sometimes provide good coverage (Citeseer for computer science research papers is one successful example), and rarely provide a sophisticated interface. Moreover, more sophisticated methods such as the analysis and structure of pages with their mixed topics, stylistic variations, and choice of terminology are just beginning to be understood.

Consequently, it is evident that considering the vast amount of information available on the Internet, document search systems will constitute in the future a fundamental part of the Internet. However, the abundance of available information sets new challenges for even the best current search engines, and what is needed in the future is qualitatively better ways to answer user queries. In this seminar we will study mathematical models and computing techniques that are needed for the next generation of Internet search services. The topics studied focus on statistical modeling techniques such as the multinomial Principal Component Analysis (mPCA), and address both the theoretical development and the applied aspects required for being capable of handling very large (giga and terabyte level) document data sets. Such methods are needed to implement new features like hierarchical multi-aspect clustering, automatic extraction of subject-specific topic hierarchies and intelligent query matching.

Suggested reading


Petri Myllymäki


Basic knowledge of probability theory is needed in the seminar.


The seminar will be held in English.


16.09.-09.12.2004, Thursdays at 16-18 in B222.

Date Speaker Topic
16.09. Petri Myllymäki Seminar kick-off.
23.09. No seminar.
30.09. Wray Buntine S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Hypersearching the Web. Scientific American, June 1999.
Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, v30, 1998.
07.10. Sami Perttu W. Buntine, J. Löfström, J. Perkiö, S. Perttu, V. Poroshin, T. Silander, H. Tirri, A. Tuominen, V. Tuulos, A Scalable Topic-Based Open Source Search Engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).
14.10. Ville Tuulos V. Tuulos, H. Tirri, Combining Topic Models and Social Networks for Chat Data Mining. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).
21.10. Petteri Nurmi John Canny, GaP: A Factor Model for Discrete Data.
28.10. Jaakko Löfström Cancelled. (S. Chakrabarti, M. van den Berg and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discover. 8th World Wide Web, 1999.)
04.11. Jukka Perkiö J. Perkiö, W. Buntine, S. Perttu, Exploring Independent Trends in a Topic-Based Search Engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).
11.11. Mirva Salminen S-B. Kim, H-C. Seo, H-C. Rim, Information retrieval using word senses: root sense tagging approach. SIGIR 2004.
18.11. No seminar.
25.11. Vladimir Poroshin Semantic analysis of natural language (based on Prof. Tuzov's book "Computational Linguistics. Developing experience of computer-based dictionaries" and on Vladimir's previous work with him in Saint-Petersburg).
02.12. Petri Uusitalo Cancelled.
09.12. Miikka Miettinen On the Feasibility of Supporting Encyclopedia Navigation with Proactive Search

