--------------------------------------
- HaploRec version 2.3 -
- Documentation -
--------------------------------------
1. Introduction
---------------
HaploRec is a statistical haplotype reconstruction algorithm targeted for
large-scale disease association studies. Its most detailed and up-to-date
description is reference [1].
1.1 Platform requirements
The program is implemented with java, JDK version 1.5.
It should work on all platforms with a java 1.5 compatible virtual
machine. To see if your platform has java 1.5, type "java -version"
on the command line. Any version higher or equal to 1.5.0
should be OK. A java virtual machine can be freely downloaded from
http://www.java.com/en/download/.
2. Installing the program
-------------------------
The program is bundled as an executable java archive (jar) file "HaploRec.jar".
No specific installation actions are required; once copied to a local
directory, the software can be run with the following command line:
java -Xmxm -jar HaploRec.jar
Here, is the amount of memory (in megabytes)
allocated to the java virtual machine. For example, use -Xmx1024m to allocate
1GB of memory. Options may be left empty (see section 4 for details on
the options).
For convenience, a wrapper script "haplorec" is provided (works only on
linux and unix). Running HaploRec using the script can be done more shortly as:
haplorec
In unix-based systems, it may be necessary to give execution rights to the wrapper
script. This can be done with the following command:
chmod u+x haplorec
To handle data sets with a large number of markers, the program needs to be
run in "windowing mode", where the data set is partitioned into and haplotyped
in smaller, partially overlapping windows and then combined together into
complete haplotypes. This is described in section 5 of the instructions.
3. File Formats
---------------
3.1 Input file
The program takes as an argument a single file containing a (multilocus)
genotype for each subject. First line of the input file must be a header,
which contains the following items:
- text "Id" (column title for subject id)
- Names of markers. (Note that markers must be ordered according to
their order on the chromosome)
The rest of the file contains the genotype data. Each genotype is divided to
two lines of the input file, with the first field denoting the subject id,
the rest of the fields denoting alleles at each marker. The allele pair at
each locus is thus divided to two lines. Alleles smaller than 10
are given as integers starting from 1 (0 is not permitted). Alleles
starting from 10 are represented with letters starting from A (A for
10:th allele, B for 11:th allele and so on). The maximum number of
different alleles is thus 35; 9 for numeric characters and 26 for
letters of the alphabet. Note that memory usage is more efficient
when using only numbers starting from 1 (e.g, {1,2,3,4} instead
of e.g. {A,B,C,D}). Missing alleles are denoted by a zero.
Fields can be separated by an arbitrary amount of white space.
A simple example of a genotype file follows:
Id M1 M2 M3
1 1 1 4
1 2 1 3
2 2 1 1
2 2 2 2
3 3 0 3
3 1 0 3
Any strings of non-white space characters can be used as names of markers
and subject id:s.
3.2 Output files
There are 3 types of output files. The main output file "XXX.reconstructed"
is a list of best reconstructions (1 haplotype pair for each genotype). File
"XXX.pairs" lists a number (specified as a parameter) of most probable
reconstructions for each genotype.
File "XXX.summary" gives summary information about the results.
The output file format is almost identical to the input file format.
Again, there are two lines for each subject. This time, each line
denotes one haplotype (in the genotype file format the alleles were
arbitrarily divided between the two lines of the subject).
The pairs and summary files are only output, if specified
with the corresponding options.
There are following additional fields:
* CONFIDENCE = probability of the reconstructed haplotype
configuration for this subject, according to the learned model.
* CONFID_ID = index of the reconstructed configuration; 1 for the
best configuration, 2 for the second-best and so on.
This field only occurs in the ".pairs" file.
* CONFIG_PRIOR_LOG_PROB = logarithm of the probability of drawing
the reconstucted haplotype configuration
from the distribution specified by
the learned model. Note that this field
is only output when using option -od.
* HAPLO_PRIOR_LOG_PROB = logarithm of the probability of drawing the
(reconstructed) invidual haplotype from the
distribution specified by the learned model.
Note that this field is only output when using
option -od.
* GENOTYPE_LOG_PROB = logarithm of the probability of the drawing the
genotype from the distribution specified by the
learned model. Note that this field is only output
when using option -od.
The summary file contains three fields:
* LOG_LIKELIHOOD_OF_DATA=logarithm of the probability of drawing
the sample of genotypes from the distribution of haplotypes specified
by the current model. It can be interpreted as a "goodness of fit" for
the model. It is obtained by multiplying the probabilities of the invidual
genotypes. The probability of a genotype is just the sum of the
probabilities of all haplotype configurations consistent with the
genotype; however, as it is usually computationally infeasible to check all
compatible configurations, the summation in practice only includes all
configurations evaluated by the used pruning strategy.
* LOG_LIKELIHOOD_OF_RECONSTRUCTED_CONFIGURATIONS=logarithm
of the probability for obtaining the reconstructed set of haplotype
configurations by a random draw from the haplotype distribution
specified by the current model.
* MODEL_SIZE=size of the data structure used to store the
model (fragments and their frequencies).
The summary file is only output when using option -s or -summaryfile.
4. Running the program
----------------------
The program can be run without any additional arguments besides the
input file; in this case default parameters are used. It is,
however, advisable to use options to tune the algorithm, depending on
the properties of the data set. For large data sets, this may be necessary
to obtain a reasonable trade-off between the quality of results and running
time, or to prevent running out of memory.
If other arguments are specified, they are specified using command-line options.
For each option there is a long name and possibly a short name and a default
value. For some options there is a fixed set of possible values [val1, val2,
val2, ...]. If this is not the case, the option requires either a numeric
or file name argument, or no argument at all.
In the following list, each option is described as follows:
-longname (-shortname) [val1, val2 val3, ...] defaultval
4.1 Modeling options
-haplotype_probability_model (-p) [VMM, VMM_AVG, S], default=VMM_AVG
This option selects the model that is used to combine haplotype fragment
frequencies into a probability for complete haplotype.
Possible values:
* VMM=variable order Markov model
* VMM_AVG=variable order Markov model, probabilities smoothed
over several different context lengths
* S=segmentation model
The set of fragments included in the model governed by the following parameters:
-minfreq_abs (-mfa) default=2.0 for VMM models, 1.2 for the Segmentation model
Minimum absolute frequency for fragments included in the model.
Cannot be used with minfreq_rel.
-minfreq_rel (-mfr)
Minimum relative frequency for fragments included in the model.
Cannot be used with minfreq_abs.
-max_pattern_length (-maxpl, default=30 with the VMM models, 50 with the
segmentation model)
Maximum length of fragments to be included in the model.
-max_num_missing_alleles (-m, default=3)
Maximum number of missing alleles in a fragment.
Allowing missing alleles significantly increases memory usage,
so reducing this further from the default value of 3 may be
necessary for large data sets with many missing alleles. On the
other hand, for smaller data sets, increasing this may slightly
increase accuracy.
4.2 Reconstruction algorithm options
-num_iters (-n), default=-1
Number or iterations to perform. Each iteration consists of learning
a model, according to current haplotype estimates, and
reconstructing the haplotypes, according to the newly learned model.
If -1 is given, the number of iterations is not predefined, but
instead the iteration terminates when the likelihood of the
data does not improve any more.
First iteration is special in the respect that there is not
yet any information about the haplotypes; model is learned directly
from the genotype data, as described in [2]. Note that
when number of iterations=1 and pruning_strategy=pl, the program works
as the algorithm described in [2].
-pruning_strategy (-ps) [sequential, pl], default=sequential
Method for restricting the search of most probable configurations
for a single genotype. Two options are provided:
* sequential=building configuration from left to right one marker
at a time
* pl=hierarchical divide and conquer approach
- max_atomic_segment_length (-a), default=8
Maximum length for a partial configuration that is reconstructed
by exhaustive search. For the pl strategy, the whole map is
broken into atomic segments. For the sequential strategy,
this gives the number of leftmost markers that are reconstructed
exhaustively; rest of markers are ligated one by one.
-num_best_partial_configurations_to_ligate (-B), default=25
For the pl pruning strategy, the maximum number of partial
configurations to ligate with the neighboring segment.
For the sequential strategy, the maximum number of
already-constructed configurations to ligate with each new marker.
Specify "all" for no restriction. Increasing this may lead to
slightly more accurate results.
-num_best_configurations_one_iter (-C), default=10
Number of best configurations of each genotype to use for learning
the fragment frequencies for the next iteration.
Specify "all" for no restriction. Increasing this may lead decrease
overfitting and lead to slightly more accurate results, especially
for data sets having a small number of subjects.
4.3 Output options
-num_best_configurations_to_output (-o), default=1
If a value greater than 1 is given, then another output file,
"basename.pairs" is produced, listing the given number of best
configurations for each genotype
-output_files_prefix (-op) default=
By default, the output file name is just the input file name with
the corresponding suffix (".reconstructed", ".pairs", or ".summary")
appended
-pattern_format [short, long], default=long
Format to use when outputting fragment files. "long" is more
readable than "short", but takes more space
-output_intermediate_results (-i)
Output result files for each iteration, instead of just the final
reconstructed haplotypes. Note that is the only way to output
the pattern frequencies.
-compute_patterns_for_last_iter (-cp)
Usually the last step of the algorithm is reconstructing haplotypes. This
option will make the program learn the fragment frequencies one more
time, from the final reconstructed haplotypes.
-output_detailed_probabilities (-od)
Write to output file also the following fields:
CONFIG_PRIOR_LOG_PROB, HAPLO_PRIOR_LOG_PROB, GENOTYPE_LOG_PROB
-summaryfile
Writes summary information to file
-output_summary_info (-s)
Write summary information to file .summary
4.4 Misc. options
-loglevel [1, 2, 3, 4], default=2
1=debug, 2=info, 3=warning, 4=error
It is advisable to use the default "info" log level. Debug level
is mainly used for software development; it is possible that using
this level will provide more information in case of unexpected
error. Using 3 or 4 is not advised, as then no normal progress
output will be produced; only warnings/errors will be shown.
-help (-h)
Output short help.
5. Handling data sets with a large number of markers
----------------------------------------------------
The HaploRec program is directly applicable for data files containing up to
500--1500 markers. This practical limit depends on the number of genotypes,
amount of missing data, complexity of the haplotype distribution, program
parameters and the amount of available main memory.
We have extended the basic algorithm by a simple but efficient feature that
allows it to handle an arbitrarily large number of markers. Briefly,
the idea is to sequentially haplotype a window of w markers
at a time. In each window except the first one, phased haplotypes for the
first o markers are obtained from the haplotypes in the previous, overlapping
window. After a window is haplotyped, r of its last markers are discarded;
they are only used to get better estimates for the w-o-r markers in
the middle of the window. I.e., each window has length w=o+u+o+r, where u is
the number of markers unique to this window, the first o represents the
haplotypes obtained from the previous window and the second o those that are
used by the next window. The window is moved o+u markers at a time.
This "windowing" version of haplorec is used by giving three additional parameters on the command line:
- w, the maximum window size (number of markers haplotyped by one run of the algorithm)
- o, the number of markers obtained from the previous window
- r, number of "extra" markers in each window
The program is run in the windowing mode with the following command line:
haplorec
Here, are options for the basic HaploRec algoritm (these may
be left empty). Reasonable default values for the windowing parameters are r=100, o=10 and r=10.
An example command line (using default options for the basic algorithm):
haplorec data.hpm 100 10 10
6. Additional Notes
-------------------
- Alghough the program allows missing genotypes in the input data, it does not perform inference of missing genotypes;
markers missing in the input file will be missing also in the result.
- The software uses term "pattern" instead of the term
"fragment" used in [1].
- If the programs seems to halt in the pattern search stage with no sign of
progress, the problem is probably that the set of patterns used in the
model is not restricted enough. This is governed by the following
options:
-num_patterns (-np)
-minfreq_abs (-mfa)
-minfreq_rel (-mfr)
-max_pattern_length (-maxpl)
- If the program seems to halt while reconstructing an atomic segment, decrease
-max_atomic_segment_length (-a)
- If the program seems to halt while ligating configurations, decrease
-num_best_partial_configurations_to_ligate (-B)
- If the program fails with an OutOfMemoryException, there are several
possibilities:
* allocate more memory to the java virtual machine.
For example, command "jar -Xmx1024m -jar HaploRec infile"
runs the virtual machine with 1024 megabytes of memory.
* use smaller model by using more stringent value for one of:
-num_patterns (-np)
-minfreq_abs (-mfa)
-minfreq_rel (-mfr)
-max_pattern_length (-maxpl)
-max_num_missing_alleles (-m)
* restrict the number of best configurations considered for each subject
by reducing one of:
-num_best_partial_configurations_to_ligate (-B)
-num_best_configurations_one_iter (-C)
7. License
-------------
This version of HaploRec is licensed for acedemic use only. For terms
of the license agreement, see the file "academiclicense.pdf".
8. Support
----------
Please send all questions, comments and bug reports to Lauri Eronen at
Department of Computer Science, University of Helsinki.
E-mail: lauri.eronen@cs.helsinki.fi
9. Version history
------------------
2.3 (February 2008) Current version. Introduces an improved Markov model,
where the conditional probabilities are smoothed over several different
context lengths. This is now the default model. More efficient splitting
and combining of files in the windowing version intended for chromosome-
wide data sets. Minor improvements in memory management.
2.2 (May 2007) Conceptually identical to version 2.1.
Extension program which enables handling of chromosome-wide data sets
is now implemented in java instead of perl, improving platform
independence.
2.1 (September 2006) The version described in [1]. Introduces an
improved segmentation model that gives more accurate results than
the segmentation model used in version 2.0.
2.0 (June 2005) This version introduced an EM-based algorithm and
a segmentation-based haplotype probability model, and featured several
computational improvements, allowing the use of larger data sets.
1.0 (January 2004) Original implementation of the algorithm introduced in [2].
Used the VMM model. Used a simple "optimistic matching" strategy for
fragment frequency estimation (instead of the EM algorithm used by
the later versions).
10. References
--------------
[1] L. Eronen, F. Geerts, H. Toivonen. HaploRec: Efficient
and accurate large-scale reconstruction of haplotypes.
BMC Bioinformatics 7:542, 2006.
[2] L. Eronen, F. Geerts, H. Toivonen. A Markov Chain
Approach to Reconstruction of Long Haplotypes. In
Proceedings of the 9th Pacific Symposium on
Biocomputing(PSB), 104-115, January 2004. World Scientific.