Data mining in genetics -- HPM and TreeDT

Data mining in genetics

Topics and introduction | Group | Publications

HPM and TreeDT -- Data Mining in Linkage Disequilibrium Mapping

Gene Localization by Haplotype Pattern Mining and Tree Disequilibrium Test

Haplotype Pattern Mining (HPM) is the first data mining based approach to gene localization. It is particularly powerful in localization of susceptibility genes for human multifactorial diseases. It uses an efficient data mining algorithm to search for frequent patterns that are associated to the trait of interest (pseudocode representation). Unlike many other methods, HPM does not require the scientist to explicitly specify the disease model. Haplotype Pattern Mining has been extended to quantitative phenotypes, accommodating covariates, and using pure genotype data instead of haplotype data. Furthermore, the method is highly robust to missing and erroneous data.

An implementation of the HPM method (in C programming language) is available at http://www.cs.helsinki.fi/group/genetics/licentia_final.zip. The source files may be used freely for non-commercial purposes as long as you give credit to the first reference below ("Data mining applied to linkage disequilibrium mapping" by Toivonen et al.). The source files are provided "as is" without any warranty. You should compile them by yourself for your specific environment, with commands such as

cd hpm_sources
gcc -lm *.c -o hpm

See Laitinen et al, Science, 9 April 2004 for a breakthrough: an asthma gene was located using HPM!

Tree Disequilibrium Test (TreeDT) extracts, essentially in the form of substrings and prefix trees, information about historical recombinations in the population. The information is used to locate fragments potentially inherited from a common diseased founder, and to map the disease gene into the most likely fragment. Like HPM, TreeDT does not require explicit specification of the disease model.

An implementation of the TreeDT method (in C programming language) is available at http://www.cs.helsinki.fi/group/genetics/treedt.tar. The source files may be used freely for non-commercial purposes as long as you give credit to the last reference below ("TreeDT: Tree pattern mining for gene mapping" by Sevon et al.). The source files are provided "as is" without any warranty.

Methodology papers on HPM and TreeDT

Data mining applied to linkage disequilibrium mapping by Toivonen HTT, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M and Kere J. American Journal of Human Genetics 67:133-145, 2000.
Gene mapping by haplotype pattern mining by Hannu TT Toivonen, Päivi Onkamo, Kari Vasko, Vesa Ollikainen, Petteri Sevon, Heikki Mannila, and Juha Kere. In IEEE International Symposium on Bio-Informatics and Biomedical Engineering (BIBE 2000), 99 - 108, Arlington, Virginia, November 2000. IEEE.
TreeDT: Gene mapping by tree disequilibrium test by Petteri Sevon, Hannu TT Toivonen, and Vesa Ollikainen. In The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), 365 - 370, San Francisco, California, August 2001. ACM.
Mining associations between genetic markers, phenotypes and covariates by Petteri Sevon, Vesa Ollikainen, Päivi Onkamo, Hannu TT Toivonen, Heikki Mannila, and Juha Kere. Genetic Epidemiology, 21(Suppl 1): S588 - S593, 2001.
Association analysis by data mining tools by Päivi Onkamo, Petteri Sevon, Vesa Ollikainen, Hannu TT Toivonen, Heikki Mannila, and Juha Kere. American Journal of Human Genetics 69(4, Suppl. 1): 1320, October 2001.
Association analysis for quantitative traits by data mining: QHPM by Onkamo P, Ollikainen V, Sevon P, Toivonen HTT, Mannila H, Kere J. Annals of Human Genetics 66:419-429, 2002.
Algorithms for Association-Based Gene mapping, PhD thesis, Petteri Sevon. Department of Computer Science, Report A-2004-4.
Gene Mapping by Pattern Discovery by Petteri Sevon, Hannu T.T. Toivonen, and Päivi Onkamo. In J. Wang et al (Eds.), Data Mining in Bioinformatics, 105-126. Springer, 2005. (manuscript)
TreeDT: Tree pattern mining for gene mapping by Petteri Sevon, Hannu Toivonen, Vesa Ollikainen. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3 (2): 174-185, April-June 2006.

For more publications, see this more complete list of publications by the research group.

Data sets and figures referred to in the AJHG 2000 article

Simulated data sets (tarred and zipped):

HPM algorithm:

Pseudocode representation
An implementation of the program (in C programming language) can be requested by sending e-mail to Jyrki Ingman at Licentia Ltd.

Figures in the article:

Results:

Tables in numeric format

Address for correspondence:

Hannu Toivonen
Department of Computer Science
P.O. Box 68
FI-00014 University of Helsinki
Finland
E-mail: firstname.lastname@cs.helsinki.fi