Lauri Eronen, HIIT/BRU
Haplotypes consisting of genetic markers (polymophisms in dna) are important for many genetic applications, such as gene mapping. Current laboratory methods produce genotypes, i.e., an unordered pair of alleles for each marker - from the respective chromosome pair - but without information about which allele is from which chromosome (haplotype) in the pair.
We consider population-based haplotyping: given a set of unrelated, genotyped subjects, the task is to infer their haplotypes. The search space is exponential already for a single individual: with k heterozygous markers, the number of different possible haplotype configurations is 2^{k-1}. Under certain realistic assumptions on the genetic processes behind the data, it is, however, often possible to infer the haplotypes with high confidence.
Several methods have been recently developed for haplotyping under the assumption of a small number of different haplotypes, and they are usually based on counting frequencies of different haplotypes. However, when the markers are located farther apart of the population is older, more recombinations will have occurred between the markers, and the number of different haplotypes is potentially very large; in the extreme case, each haplotype in the sample can actually be unique. Previuos methods also tend to have high computational complexities in the number of markers.
In this talk we introduce models and algorithms for haplotyping that are based on local haplotype fragments instead of compelete haplotypes. The motivation is to obtain a better generalization ability also when recombinations can be relatively abundant in the haplotypes. We give a family of Markovian probability models for haplotype distributions. The models are based on simple Markov chains (of order 1), Markov chains of higher order, and Markov chains of variable order. We also outline methods that approximate the most likely haplotype configurations under a given model. The methods scale up to large numbers of markers.
Experimental results with simulated and real data show that the proposed approach performs very well compared to existing methods, especially when the markers are spaced sparsely or the number of markers is large.