Data mining in genetics

Topics and introduction | Group | Publications

Research topics

  • Biomine project: Knowledge discovery in biological databases
  • HPM, Haplotype Pattern Mining and TreeDT: association analysis using haplotype data or genotype data
  • HaploRec: haplotyping population-based genotype data
  • Defining and utilising haplotype block structure of the human genome; what constitutes a block? How to extract the block structure from haplotype and genotype data?
  • Oligogenic models for binary traits, in particular, Bayesian inference using recurrence risk data and Markov chain Monte Carlo (MCMC) simulation methods
  • Population genetic studies based on population simulations
  • AsVis: Visualization of association rules in SNP neighborhoods (on-line demo)


Locating genes that predispose to diseases is highly important in understanding the etiology of complex common diseases, such as heart disease, or asthma. Gene mapping is the process of locating likely genes for a given disease given phenotypic and marker data for a sample of people.

At the same time, public biological databases contain huge amounts of rich data, such as annotated sequences, proteins, domains, and orthology groups, genes and gene expressions, gene and protein interactions, scientific articles, and ontologies. The Biomine project develops methods for the analysis of such collections of data, with candidate gene analysis as an example problem.

Mapping of a disease can result in tens or hundreds of candidate genes. The next problem is then to identify the most promising genes for further research. The current state of the art consists largely of manual exploration of public databases, for instance to find connections between genes and phenotypes. The Biomine project develops methods for automated discovery and prediction of previously unknown and potentially biologically relevant connections. The methods we develop help geneticists assess the potential relationship of their candidate genes to the disease under study.

For gene mapping with association analysis, the sample of patients and controls, and potentially their relatives, is genotyped and haplotyped, i.e. the two alleles at each marker locus in each individual are ordered according to parental origin. Then, alleles and short strings of alleles of nearby (consecutive) markers correlating with the patient-control status are searched by means of association methods. The aim is to pinpoint the location of the disease susceptibility (DS) mutation as accurately as possible. The population history also plays an important role affecting the possibilities to locate any particular gene, and it's effect should be taken into consideration by e.g. the means of population simulations. We have developed methods, HPM and TreeDT, for computationally efficient and accurate gene mapping based on association analysis. We also have developed tools for population simulations.

Most association methods rely on the availability of haplotype data, which requires either relatives of the study subjects to be recruited and genotyped, or use of population- based haplotyping methods. However, this tends to increase the study costs and time spent on recruitment, and sometimes is not possible to obtain at all. In order to facilitate efficient association analyses, methods are needed to statistically reconstruct haplotypes from population-based genotype data without extra sampling of relatives. In our group we have addressed this issue by developing HaploRec, a method for accurate and efficient reconstruction of haplotypes over long genetic distances.

HIIT Basic Research Unit Department of Computer Science