Algodan - Software Library

BEANDisco - Bayesian Exact and Approximate Network Discovery

BEANDisco implements exact dynamic programming algorithms and approximate Markov chain Monte Carlo (MCMC) methods for Bayesian structure discovery in Bayesian networks. The notion of partial orders is used to reduce space requirements, to enable easy and massive parallelization, and to construct an efficient sampling space for MCMC.

BernoulliMix

BernoulliMix program package provides tools to work with finite mixture models of multivariate Bernoulli distributions, also known as Bernoulli mixtures. The program package can be used for probabilistic modeling of 0-1 data. The target audience includes researchers, teachers, and students in machine learning and data mining.

Biomine - A biological search engine

We view biological databases of sequences, proteins, genes etc. as a weighted graph and develop methods for information search and discovery in such graphs.

Coral

Coral is an error correction algorithm for correcting reads from DNA sequencing platforms such as the Illumina Genome Analyzer or HiSeq platforms or Roche/454 Genome Sequencer.

EEL - Enhancer Element Locator

Enhancer Element Locator, or EEL, is a tool for locating distal gene enhancer elements in mammalian genomes by comparative genomics.

FiD - Fragment iDentificator

Fragment iDentificator (FiD) is a windows applications for identification of molecular fragments from tandem mass spectrometry data. FiD is aimed at mass spectrometrists and chemist to assist in interpreting and analysing ms/ms spectra. FiD exhaustively lists suitable fragment structures for each measured mass-to-ratio peak, and also uses mixed integer linear programming techniques to suggest the whole fragmentation pattern, i.e. the set of fragments which explain the whole spectra with minimal number of bond changes.

FourierICA

FourierICA is an unsupervised learning method suitable for the analysis of rhythmic activity in EEG/MEG recordings. The method performs independent component analysis (ICA) on short-time Fourier transforms of the data. As a result, more "interesting" sources with (amplitude modulated) oscillatory behaviour are uncovered and appropriately ranked. The method is also capable to reveal spatially-distributed sources appearing with different phases in different EEG/MEG channels by means of a complex-valued mixing matrix.

GCSA - Generalized Compressed Suffix Array

Compressed full-text indexes based on the Burrows-Wheeler transform are widely used in bioinformatics. Their most succesful application so far has been mapping short reads to a reference sequence. GCSA is a generalization of these indexes to handle finite automata, e.g. those representing the known genetic variation within a population.

Hybrid SHREC

Hybrid SHREC is an error correction algorithm for correcting reads from various DNA sequencing platforms. The code builds on an earlier version of SHREC intended for SOLEXA/Illumina reads.

InvCoal - a coalescent simulator

InvCoal is a coalescent simulator for generating synthetic SNP data sets with a simulated inversion. It also uses a multiple crossover model with a chiasma interference model for the modelling of gene flow between inverted and noninverted haplotypes.

LabelsOnBd

Connects features on a map to labels on the map boundary. Maintains optimal placement of the labels as a user zooms and pans the map.

The details appear in the paper

M. Nöllenburg, V. Polishchuk, M. Sysikaski. Dynamic One-Sided Boundary Labeling, 18th International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL GIS'10

Maplab

This Java application performs intelligent placement of line numbers on a public transit map. It loads Google Transit data, adds route numbers, and produces an overlay on Google Maps.

The details appear in the paper

V. Polishchuk, A. Vihavainen. Periodic Multi-Labeling of Public Transit Lines. 6th International Conference on Geographic Information Science, GIScience 2010

MIP Scaffolder

MIP Scaffolder is a program for scaffolding contigs produced by fragment assemblers using mate pair data such as those generated by ABI SOLiD or Illumina Genome Analyzer.

MOODS - Motif Occurrence Detection Suite

MOODS is a suite of algorithms for matching position weight matrices (PWM) against DNA sequences. It features advanced matrix matching algorithms implemented in C++ that can be used to scan hundreds of matrices against chromosome-sized sequences in few seconds.

readaligner

A tool for mapping (short) DNA reads into reference sequences. It consists of algorithms based on Burrows-Wheeler transform and backward backtracking. It also includes a novel data structure called the rotation index that finds alignments having higher number of mismatches in feasible time (at the cost of a larger index and fixed pattern length).

ReMatch

ReMatch is a web-based tool for integration of user-given stoichiometric metabolic models into a database collected from public data sources such as KEGG, MetaCyc, CheBI and ARM. ReMatch is geared particularly towards 13C metabolic flux analysis: it is possible to augment the model with carbon mappings and export the model to analysis in 13C flux analysis software.

Re[al-valued] Re[description] Mi[ning]

ReReMi is a python implementation of a greedy algorithm for redescription mining. It uses on-the-fly discretization to handle numerical and categorical attributes. ReReMi is available for download here.

Details can be found in:

Esther Galbrun and Pauli Miettinen. From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World. In Statistical Analysis and Data Mining, 2012.

ReTrace

ReTrace is a computational method for inferring branching pathways in genome-scale metabolic networks.

RLCSA - Run-Length Compressed Suffix Array

The RLCSA is a compressed suffix array implementation that has been optimized for highly repetitive text collections. Examples of such collections include version control data and individual genomes. This implementation has also been used as a testbed for various techniques related to compressed suffix arrays, such as space-efficient construction of the Burrows-Wheeler transform for large collections of texts.

Sinuhe - Statistical Machine Translation tool

Sinuhe is a Statistical Machine Translation tool developed by Dr. Matti Kääriäinen. Its main characteristics are a conditional exponential family translation model utilizing parallel machine learning and a very fast decoder making it well suited for online information retrieval. Sinuhe is the default SMT engine in the SMART Search Engine. Sinuhe is freely available for download under GPL.

Internals of Sinuhe are described in the following paper:

Matti Kääräinen. Statistical Machine Translation using a Globally Trained Conditional Exponential Family Translation Model. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1027–1036, Singapore, 2009

SMART Search Engine

SMART Search Engine is a web-based demonstrator for searching the Wikipedia in one language using queries in another, and translating relevant pages on-the-fly back into the query language. The search engine was developed by Algodan Machine Learning team and the HIIT/CosCo group as part of EU FP6 STREP Statistical Multilingual Analysis for Retrieval and Translation. It integrates several cross-lingual information retrieval engines (CLIR) with a statistical machine translation (SMT) tools.

SuDS project cst - compressed suffix tree implementation

Our implementation of compressed suffix trees (Sadakane, 2007) supports all typical suffix tree operations, including suffix links and lowest common ancestor queries, and requires less memory than a plain suffix array.

Testing Independent Components

A method for testing the statistical significance of independent components, based on doing ICA on many related datasets (e.g. for different subjects in neuroimaging) and evaluating the consistency of the results.