58304304 Research Seminar on Intelligent Systems (1-2 cu), Fall 2004
Description of the topic area
The strength of the Internet is that there are billions of pages of information available waiting to present information on an amazing variety of topics in an amazing variety of styles (e.g., newsgroups, magazines, references, technical data, tutorials, sales literature, etc.). The downside of the Internet is that there are billions of pages of information, most of them titled according to the whim of their author using subtly different terminology to fool keyword search. Subject-specific search sites have emerged to provide help for this situation, yet they are time consuming to maintain, only sometimes provide good coverage (Citeseer for computer science research papers is one successful example), and rarely provide a sophisticated interface. Moreover, more sophisticated methods such as the analysis and structure of pages with their mixed topics, stylistic variations, and choice of terminology are just beginning to be understood.
Consequently, it is evident that considering the vast amount of information available on the Internet, document search systems will constitute in the future a fundamental part of the Internet. However, the abundance of available information sets new challenges for even the best current search engines, and what is needed in the future is qualitatively better ways to answer user queries. In this seminar we will study mathematical models and computing techniques that are needed for the next generation of Internet search services. The topics studied focus on statistical modeling techniques such as the multinomial Principal Component Analysis (mPCA), and address both the theoretical development and the applied aspects required for being capable of handling very large (giga and terabyte level) document data sets. Such methods are needed to implement new features like hierarchical multi-aspect clustering, automatic extraction of subject-specific topic hierarchies and intelligent query matching.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Hypersearching the Web. Scientific American, June 1999.
- Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, v30, 1998.
- A.N. Langville and C.D. Meyer, Deeper Inside Pagerank. Internet Mathematics, 2004
- T.H. Haveliwala, Topic Sensitive PageRank. WWW 2002.
- X. Long and T. Suel, Optimized Query Execution in Large Search Engines with Global Page Ordering. 29th VLDB, 2003
- S-B. Kim, H-C. Seo, H-C. Rim, Information retrieval using word senses: root sense tagging approach. SIGIR 2004.
- S. Chakrabarti, M. Joshi, K. Punera and D. Pennock, The Structure of Broad Topics on the Web. WWW2002.
- O. Drori, Display of Search Results in Google-based Yahoo! vs. LCC&K Interfaces: A Comparison Study. Informing Science, 2003.
- P. Ogilvie, J. Callan, Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. TREC 2003.
- S. Chakrabarti, M. van den Berg and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discover. 8th World Wide Web, 1999
- W. Buntine, A. Jakulin, Applying Discrete PCA in Data Analysis. UAI-2004.
- M. Keller and S. Bengio, Theme Topic Mixture Model: A Graphical Model for Document Representation. Technical Report IDIAP-RR 04-05.
- Thomas Hofmann, Gaussian Latent Semantic Models for Collaborative Filtering. 26th Annual International ACM SIGIR Conference, 2003.
- T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela. Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks, Special Issue on Neural Networks for Data Mining and Knowledge Discovery, volume 11, number 3, pages 574-585. May 2000.
- Justin Basicilico, Thomas Hofmann, Unifying Collaborative and Content-Based Filtering.
Basic knowledge of probability theory is needed in the seminar.
The seminar will be held in English.
16.09.-09.12.2004, Thursdays at 16-18 in B222.
|16.09.||Petri Myllymäki||Seminar kick-off.|
|30.09.||Wray Buntine||S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Hypersearching the Web. Scientific American, June 1999.|
|Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, v30, 1998.|
|07.10.||Sami Perttu||W. Buntine, J. Löfström, J. Perkiö, S. Perttu, V. Poroshin, T. Silander, H. Tirri, A. Tuominen, V. Tuulos, A Scalable Topic-Based Open Source Search Engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).|
|14.10.||Ville Tuulos||V. Tuulos, H. Tirri, Combining Topic Models and Social Networks for Chat Data Mining. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).|
|21.10.||Petteri Nurmi||John Canny, GaP: A Factor Model for Discrete Data.|
|28.10.||Jaakko Löfström||Cancelled. (S. Chakrabarti, M. van den Berg and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discover. 8th World Wide Web, 1999.)|
|04.11.||Jukka Perkiö||J. Perkiö, W. Buntine, S. Perttu, Exploring Independent Trends in a Topic-Based Search Engine. In Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI'04) (Beijing, China, September 2004).|
|11.11.||Mirva Salminen||S-B. Kim, H-C. Seo, H-C. Rim, Information retrieval using word senses: root sense tagging approach. SIGIR 2004.|
|25.11.||Vladimir Poroshin||Semantic analysis of natural language (based on Prof. Tuzov's book "Computational Linguistics. Developing experience of computer-based dictionaries" and on Vladimir's previous work with him in Saint-Petersburg).|
|09.12.||Miikka Miettinen||On the Feasibility of Supporting Encyclopedia Navigation with Proactive Search|
Service addresses at the Department of Computer Science