Plenty of high-quality research occurs at Algodan Centre of Excellence. This page aims to highlight some of the most significant results and the most interesting lines of research of the centre researchers.
- Computational creativity
- Estimation of unnormalized probabilistic models
- Fast evaluation of Tutte polynomials
- Fast zeta transforms for lattices
- Global map of human gene expression
- Indexing repetitive collections and their automata representations
- Methods for network abstraction
- New methods for estimating linear and nonlinear causal models from data
- Randomization methods for sequences
- Sinuhe - machine translation by machine learning
- Subgraph mining
- Testing independent components
Computational creativity
The goal of computational creativity is to model, simulate or enhance creativity. It is a modern research area: the first international conference (ICCC) dedicated to it was held in 2010.
We have recently opened computational creativity as a new research area. It is interesting in its own right but also as an application area for data mining and machine learning methods. We have already developed data mining inspired novel methods for computational poetry that minimize the need for manually coded or language specific knowledge (ICCC 2012). We also have proposed methods for computational humour (ICAART 2012). Our existing activities in network analysis and browsing also have proved useful in supporting creative problem solving (ICCC 2010). We currently explore automated composition of music based on sleep measurements (manuscript). We also contributed to the computational creativity community by organizing an international autumn school on it (http://www.cs.helsinki.fi/en/ascc/).
Estimation of unnormalized probabilistic models
Based on our initial work on score matching in 2005, we developed a new approach for estimating statistical models which are not properly normalized (i.e. the partition function is not known analytically and it is very difficult to compute). This new method is based on the intuitive idea of learning a classifier to distinguish between the real data and artificially generated noise. Our project resulted in a long JMLR paper which also shows improvements of known results on modelling of natural image statistics. Further generalizations of the framework were presented in two UAI papers.
References
- M. U. Gutmann and A. Hyvärinen. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. J. Machine Learning Research 13:307-361, 2012.
- M. Pihlaja, M. Gutmann and A. Hyvärinen. A Family of Computationally Efficient and Simple Estimators for Unnormalized Statistical Models. Proc. Conf. on Uncertainty in Artificial Intelligence (UAI2010), Catalina Island, California, 2010.
- M. U. Gutmann and J. Hirayama. Bregman Divergence as General Framework to Estimate Unnormalized Statistical Models. Proc. Conf. on Uncertainty in Artificial Intelligence (UAI2011), Barcelona, Spain, 2010.
Fast evaluation of Tutte polynomials
The Tutte polynomial of a graph is among the most fundamental invariants in graph theory, with connections to areas such as graph coloring, network reliability, the mathematical theory of knots, and statistical physics. These connections essentially stem from the fact the Tutte polynomial is a universal invariant in the sense that any graph invariant admitting a linear recurrence in terms of deletions and contractions of edges reduces to an evaluation of the Tutte polynomial.
Compared with the classical deletion-contraction algorithm for computing the Tutte polynomial, we [Björklund, Husfeldt, Kaski, Koivisto; FOCS 2008] have developed a substantially faster algorithm that runs in time within a polynomial factor of the number of connected vertex sets. The algorithm evaluates a multivariate generalization of the Tutte polynomial by (i) making use of a combinatorial identity due to Fortuin and Kasteleyn; and (ii) deriving from our [Björklund, Husfeldt, Kaski, Koivisto; STOC 2007] prior "fast subset convolution" techniques.
We anticipate that this result will be included in the future text books in the field.
Fast zeta transforms for lattices
The zeta transform and its inverse, Möbius transform, play a central role in the combinatorics of various partial orders. The recent success of fast, FFT-like algorithms to compute these transforms on the subset lattice of finite sets have raised the question, whether fast algorithms exist also for some other classes of partial orders. We have discovered such fast algorithms: first for the lattice of the ideals of a partial order (a sublattice of the subset lattice), which has direct applications in structure learning in Bayesian networks (Parviainen & Koivisto, AISTATS'10); and then more generally for arbitrary lattices that have a relatively small number of join-irreducible elements (Björklund, Husfeldt, Kaski, Koivisto, Nederlof, & Parviainen, SODA'12).
Global map of human gene expression
In a joint project with European Bioinformatics Institute (Alvis Brazma, Margus Lukk), we analyzed a globally collected dataset of gene expression in humans (Affymetrix U133A microarray data on 5372 samples from 206 different studies generated in 163 different laboratories). By applying various clustering and dimension reduction approaches to this very high dimensional (18609 dimensions) data we found quite unexpectedly a ‘continent’ structure, i.e., a small number of distinct expression profile classes, having quite natural biological interpretation (http://www.genomeweb.com/informatics/ebi-helsinki-team-integrates-array-data-thousands-samples-map-global-expression-).
Publications
- M. Lukk, M. Kapushesky, J. Nikkilä, H. Parkinson, A. Goncalves, W. Huber, E. Ukkonen & Alvis Brazma. A global map of human gene expression. Nature Biotechnology 28, 4 (April 2010), 322–324.
Indexing Repetitive Collections and their Automata Representations
Our long-standing research on compressed index structures has evolved towards answering the intriguing question of how to manage the massive sequence collections resulting from the next-generation DNA sequencing technologies. We focused on sequence collections containing highly repetitive structure like individual genomes that share large portions in common. We managed to develop self-indexes with space depending on number of mutations in the collection [MNSV10]. Our self-indexes provide the very elementary operations on accessing the collection, namely, fast extraction of any substring and fast searching of exact/approximate occurrences of a pattern. For more complex tasks, the self-indexes can work as a basis; one can build a compressed suffix tree on top of the self-index that can simulate many classical suffix tree -based algorithms for efficient sequence analysis and pattern discovery [FMN09,FMN08]. More recently, we extended our approach to work with an acyclic finite automaton representing genome with its common variants [SVM11]. As an illustration of the benefit of the approach, the index built on the Human Genome with the Finnish subset of variants reported in the 1000 genomes project occupies 3.3 GB and enables fast alignment of sequencing reads directly to known variants, enhancing the standard pipelines of variant calling significantly.
Publications
- [SVM11] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing Finite Language Representation of Population Genotypes. In Proc. WABI 2011, Saarbrücken, Germany, September 5-7, 2011.
- [MNSV10] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and Retrieval of Highly Repetitive Sequence Collections. Journal of Computational Biology 17(3):281-308, 2010. Earlier in RECOMB 2009 and SPIRE 2008.
- [FMV08] Johannes Fischer, Veli Mäkinen, and Niko Välimäki. Space-Efficient String Mining under Frequency Constraints. ICDM 2008, Pisa, Italy, 2008.
- [FMN09] Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster Entropy-Bounded Compressed Suffix Trees. Theoretical Computer Science, 410(51):5354-5364, 2009.
Methods for network abstraction
Graphs and networks are used in numerous applications to describe relationships between entities, such as social relations between persons, links between web pages, flow of traffic, or interactions between proteins. Graphs with a few dozens of nodes and edges may already be difficult to visualize and understand. Therefore, techniques to simplify graphs are needed.
We have developed a range of novel methods for making large, weighted graphs simpler or for extracting relevant information from them.
One approach is based on simplifying weighted graphs by pruning least important edges from them. We have given concepts, bounds, and algorithms for the problem (ICDM 2010, IDA 2010). A more elaborate approach is to compress graphs into smaller ones by grouping nodes and edges such that the original graph can be approximately recovered (ACM SIGKDD 2011, manuscript). These techniques also have applications in privacy preserving graph mining (manuscript).
Often the user is interested in those parts of a network that are related to some given nodes. We have defined the problem of most reliable subgraph extraction and given several algorithms for it (DMKD 2009, ASONAM 2010, PAKDD 2010). We have also proposed flexible models for retrieving a non-redundant set of relevant objects (ACM SAC 2012).
The more applied line of this research has produced Biomine, a search engine prototype that integrates and indexes data from several publicly available biological databases (BMC Bioinformatics, accepted). Biomine represents the data as a large weighted graph, and its query tools aid explorative discovery of non-trivial connections between biological entities, such as genes and phenotypes. See http://biomine.cs.helsinki.fi/
New methods for estimating linear and nonlinear causal models from data
Linear causal models based on directed acyclic graphs have been studied for approximately 20 years and many methods have been developed for estimating these models. However, a common assumption of these methods has been that the data is normally distributed, an assumption that has significantly restricted the identifiability of the models. In 2006, we showed that if the data is non-Gaussian one can obtain much more information on the underlying data generating process. Extending our earlier work on causal discovery, we have developed novel measures of causal directionality for just two variables, i.e. does x cause y or does y cause x, measures for inferring causal relationships among high-dimensional variables, a procedure to estimate the strength of causal effects in the presence of hidden variables, and identifiability results of linear cyclic causal models based on randomized experiments.
Publications
- A. Hyvärinen and S. M. Smith. Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models. Under revision for J. of Machine Learning Research.
- D. Janzing, P. O. Hoyer, and B. Schölkopf. Telling cause from effect based on high-dimensional observations. Proceedings of the 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel, 2010.
- D. Entner, P. O. Hoyer, and P. Spirtes. Statistical test for consistent estimation of causal effects in linear non-Gaussian models. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS-2012), La Palma, Canary Islands, 2012.
- A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Learning linear cyclic causal models with latent variables. Under revision for J. of Machine Learning Research.
Randomization methods for sequences
Based on our earlier work on randomization in data analysis, we have applied the randomization and statistical testing methodology in the analysis of sequences. As an example of the analysis of (word) sequences, we have shown that even the simplest statistic such as the word frequency in the text may be misleading: many words are falsely identified as frequent if typical assumptions such as bag of words that ignore the spatial patterns in the language are used (Lijffijt et al. 2011). This is part of our focus on computational linguistics in collaboration with the group of Prof Terttu Nevalainen (Nevalainen et al. 2011; others). We have analyzed sequences also in bioinformatics and time series: Kallio et al. (2011) applies randomization methods to assess the significance of gene periodicity results, and Papapetrou et al. (2011) studies subsequence matching in time series databases.
Publications
- Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila. Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In Proc ECML PKDD, 2011.
- Terttu Nevalainen, Helena Raumolin-Brunberg, and Heikki Mannila. The diffusion of language change in real time: Progressive and conservative individuals and the time depth of change. Language Variation and Change, 23: 1-43, 2011.
- Aleksi Kallio, Niko Vuokko, Markus Ojala, Niina Haiminen, and Heikki Mannila Randomization Techniques for Assessing the Significance of Gene Periodicity Results. BMC Bioinformatics, 12:330, 2011
- Panagiotis Papapetrou, Vassilis Athitsos, Michalis Potamias, George Kollios, and Dimitrios Gunopulos. Embedding-based Subsequence Matching in Time Series Databases. ACM TODS, 36(3): 17, 2011
Sinuhe - machine translation by machine learning
The goal in statistical machine translation (SMT) is to automatically learn from a parallel bilingual corpus a model that can translate natural language A to natural language B (say, French to English). SMT is an active research topic and also widely used in practice (e.g., Google Translate). Still, the machine translation problem is far from solved.
Sinuhe is a phrase-based SMT system developed within the SMART EU project by Matti Kääriäinen. The key component in Sinuhe is a new translation model which replaces the naive independence assumption and local parameter estimation schemes used in current SMT systems by well-founded machine learning technologies. The hope is that the better statistical foundations and clear model structure will lead to improved understanding of the translation process, straightforward integration of abstract (linguistically motivated) local features, and eventually to better translation quality. The phrase-based translation model in Sinuhe takes the form of a conditional exponential family probability distribution over a phrase-level feature representation.
The Sinuhe translation system is released under the GPLv3 Open Source license. The release includes C++ source code for tools for training new translation models and producing translations with them, and some pre-trained models to ease experimenting with the system. Sinuhe scales well to large data: We have already trained models with almost 100M parameters on corpora with more than 22M sentence pairs and 10G words. In prior work with the similar but simpler conditional random field models, the number of model parameters has to our knowledge always been at least an order of magnitude smaller. To make training feasible, we have implemented a distributed client-server architechture with which the training can be parallelized to hundreds of machines. Also predicting translations can be distributed to low-end end-used machines.
Earlier versions of Sinuhe have been already used in SMART project's Wikipedia user trials, and newer versions will be used in the forthcoming SMART use cases. Sinuhe is currently being used for research purposes outside the SMART project, also outside universities.
So far, the main benefits of Sinuhe are fast translation speed, relatively small memory usage, and the ability to produce internal confidenceestimates for translations. The possibility to produce reasonable raw translations extremely fast (about 40 translations per second with a single CPU on Europarl data) makes it possible to translate very large corpora in reasonable time. This may open up new application areas in, e.g., semantic indexing for cross-lingual information access.
Publications
- Matti Kääräinen. Statistical Machine Translation using a Globally Trained Conditional Exponential Family Translation Model. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1027–1036, Singapore, 2009
Subgraph mining
We have defined novel frameworks for subgraph extraction [1, 2]. The models are based on simple probabilistic graphs [1] or their first-order generalizations [2], which allows an elegant formulation of the problem of locating a subgraph of maximum relevance to some user specified nodes. We also introduced efficient algorithms for these tasks.
Publications
- [1] Petteri Hintsanen, Hannu Toivonen. Finding Reliable Subgraphs from Large Probabilistic Graphs. Data Mining and Knowledge Discovery 17 (1): 3-23. 2008.
- [2] Luc De Raedt, Kristian Kersting, Angelika Kimmig, Kate Revoredo, Hannu Toivonen. Compressing Probabilistic Prolog Programs. Machine Learning 70 (2-3): 151-168. 2008.
Testing independent components
In the ICA research community, the almost exclusive focus has been on estimation, and testing of the results has received almost zero attention. However, in any practical application it would be extremely important to be able to test some kind of statistical significance of the components. Otherwise, we don't know if the results are just due to random noise, or local minima. We have developed a new testing framework for ICA based on the idea of having many datasets (or splitting one dataset into many), applying ICA separately on each dataset, and then investigating if the results are similar enough in the different datasets. We were able to formulate a proper null hypothesis and use the conventional machinery of the theory of statistical hypothesis testing. A matlab package was made available on the web as well.
For more information, see http://www.cs.helsinki.fi/u/ahyvarin/code/isctest/