Data mining in geneticsTopics and introduction | Group | Publications
HPM and TreeDT -- Data Mining in Linkage Disequilibrium Mapping
Gene Localization by Haplotype Pattern Mining and Tree Disequilibrium Test
Haplotype Pattern Mining (HPM) is the first data mining based approach to gene localization. It is particularly powerful in localization of susceptibility genes for human multifactorial diseases. It uses an efficient data mining algorithm to search for frequent patterns that are associated to the trait of interest (pseudocode representation). Unlike many other methods, HPM does not require the scientist to explicitly specify the disease model. Haplotype Pattern Mining has been extended to quantitative phenotypes, accommodating covariates, and using pure genotype data instead of haplotype data. Furthermore, the method is highly robust to missing and erroneous data.
An implementation of the HPM method (in C programming language) is available at http://www.cs.helsinki.fi/group/genetics/licentia_final.zip. The source files may be used freely for non-commercial purposes as long as you give credit to the first reference below ("Data mining applied to linkage disequilibrium mapping" by Toivonen et al.). The source files are provided "as is" without any warranty. You should compile them by yourself for your specific environment, with commands such as
cd hpm_sources gcc -lm *.c -o hpm
See Laitinen et al, Science, 9 April 2004 for a breakthrough: an asthma gene was located using HPM!
Tree Disequilibrium Test (TreeDT) extracts, essentially in the form of substrings and prefix trees, information about historical recombinations in the population. The information is used to locate fragments potentially inherited from a common diseased founder, and to map the disease gene into the most likely fragment. Like HPM, TreeDT does not require explicit specification of the disease model.
An implementation of the TreeDT method (in C programming language) is available at http://www.cs.helsinki.fi/group/genetics/treedt.tar. The source files may be used freely for non-commercial purposes as long as you give credit to the last reference below ("TreeDT: Tree pattern mining for gene mapping" by Sevon et al.). The source files are provided "as is" without any warranty.
Methodology papers on HPM and TreeDT
For more publications, see this more complete list of publications by the research group.
Data sets and figures referred to in the AJHG 2000 article
Simulated data sets (tarred and zipped):
Figures in the article:
Address for correspondence: