--------------------------------------
-   HaploRec version 2.3             - 
-      Documentation                 -
--------------------------------------


1. Introduction
---------------

HaploRec is a statistical haplotype reconstruction algorithm targeted for 
large-scale disease association studies. Its most detailed and up-to-date 
description is reference [1].

1.1 Platform requirements

The program is implemented with java, JDK version 1.5.
It should work on all platforms with a java 1.5 compatible virtual
machine. To see if your platform has java 1.5, type "java -version"
on the command line. Any version higher or equal to 1.5.0 
should be OK. A java virtual machine can be freely downloaded from 
http://www.java.com/en/download/.


2. Installing the program
-------------------------

The program is bundled as an executable java archive (jar) file "HaploRec.jar". 
No specific installation actions are required; once copied to a local
directory, the software can be run with the following command line:

  java -Xmx<java_memory_allocation>m -jar HaploRec.jar <options> <input_file> 
  
Here, <java_memory_allocation> is the amount of memory (in megabytes) 
allocated to the java virtual machine. For example, use -Xmx1024m to allocate 
1GB of memory. Options may be left empty (see section 4 for details on 
the options).

For convenience, a wrapper script "haplorec" is provided (works only on 
linux and unix). Running HaploRec using the script can be done more shortly as:
  haplorec <options> <input_file>

In unix-based systems, it may be necessary to give execution rights to the wrapper 
script. This can be done with the following command:  
  chmod u+x haplorec

To handle data sets with a large number of markers, the program needs to be 
run in "windowing mode", where the data set is partitioned into and haplotyped 
in smaller, partially overlapping windows and then combined together into 
complete haplotypes. This is described in section 5 of the instructions.



3. File Formats
---------------

3.1 Input file

The program takes as an argument a single file containing a (multilocus)
genotype for each subject. First line of the input file must be a header,
which contains the following items:

- text "Id" (column title for subject id)
- Names of markers. (Note that markers must be ordered according to
  their order on the chromosome)
 
 The rest of the file contains the genotype data. Each genotype is divided to
two lines of the input file, with the first field denoting the subject id,
the rest of the fields denoting alleles at each marker. The allele pair at 
each locus is thus divided to two lines. Alleles smaller than 10
are given as integers starting from 1 (0 is not permitted). Alleles
starting from 10 are represented with letters starting from A (A for
10:th allele, B for 11:th allele and so on). The maximum number of
different alleles is thus 35; 9 for numeric characters and 26 for 
letters of the alphabet. Note that memory usage is more efficient 
when using only numbers starting  from 1 (e.g, {1,2,3,4} instead
of e.g. {A,B,C,D}). Missing alleles are denoted by a zero. 
Fields can be separated by an arbitrary amount of white space. 

A simple example of a genotype file follows:

Id M1 M2 M3
1  1  1  4
1  2  1  3
2  2  1  1
2  2  2  2
3  3  0  3
3  1  0  3

Any strings of non-white space characters can be used as names of markers
and subject id:s.

3.2 Output files

There are 3 types of output files. The main output file "XXX.reconstructed" 
is a list of best reconstructions (1 haplotype pair for each genotype). File
"XXX.pairs" lists a number (specified as a parameter) of most probable 
reconstructions for each genotype.
File "XXX.summary" gives summary information about the results.
The output file format is almost identical to the input file format.
Again, there are two lines for each subject. This time, each line
denotes one haplotype (in the genotype file format the alleles were
arbitrarily divided between the two lines of the subject).
The pairs and summary files are only output, if specified
with the corresponding options.

There are following additional fields:

 * CONFIDENCE = probability of the reconstructed haplotype 
                configuration for this subject, according to the learned model.
  
 * CONFID_ID = index of the reconstructed configuration; 1 for the 
               best configuration, 2 for the second-best and so on. 
               This field only occurs in the ".pairs" file.

 * CONFIG_PRIOR_LOG_PROB = logarithm of the probability of drawing
                           the reconstucted haplotype configuration
                           from the distribution specified by 
                           the learned model. Note that this field 
                           is only output when using option -od.

 * HAPLO_PRIOR_LOG_PROB =  logarithm of the probability of drawing the 
                           (reconstructed) invidual haplotype from the 
                           distribution specified by the learned model.
                           Note that this field is only output when using
                           option -od.

 * GENOTYPE_LOG_PROB = logarithm of the probability of the drawing the 
                       genotype from the distribution specified by the
                       learned model. Note that this field is only output
                       when using option -od.

The summary file contains three fields:

  * LOG_LIKELIHOOD_OF_DATA=logarithm of the probability of drawing
    the sample of genotypes from the distribution of haplotypes specified
    by the current model. It can be interpreted as a "goodness of fit" for

    the model. It is obtained by multiplying the probabilities of the invidual
    genotypes. The probability of a genotype is just the sum of the 
    probabilities of all haplotype configurations consistent with the 
    genotype; however, as it is usually computationally infeasible to check all
    compatible configurations, the summation in practice only includes all
    configurations evaluated by the used pruning strategy.

  * LOG_LIKELIHOOD_OF_RECONSTRUCTED_CONFIGURATIONS=logarithm 
    of the probability for obtaining the reconstructed set of haplotype
    configurations by a random draw from the haplotype distribution
    specified by the current model.

  * MODEL_SIZE=size of the data structure used to store the 
    model (fragments and their frequencies).

  The summary file is only output when using option -s or -summaryfile.



4. Running the program
----------------------

The program can be run without any additional arguments besides the 
input file; in this case default parameters are used. It is,
however, advisable  to use options to tune the algorithm, depending on
the properties of the data set. For large data sets, this may be necessary
to obtain a reasonable trade-off between the quality of results and running 
time, or to prevent running out of memory.

If other arguments are specified, they are specified using command-line options. 
For each option there is a long name and possibly a short name and a default
value. For some options there is a fixed set of possible values [val1, val2,
val2, ...]. If this is not the case, the option requires either a numeric
or file name argument, or no argument at all.

In the following list, each option is described as follows:
 
-longname (-shortname) [val1, val2 val3, ...] defaultval


4.1 Modeling options

-haplotype_probability_model (-p) [VMM, VMM_AVG, S], default=VMM_AVG

    This option selects the model that is used to combine haplotype fragment
    frequencies into a probability for complete haplotype.
    Possible values: 
      *  VMM=variable order Markov model
      *  VMM_AVG=variable order Markov model, probabilities smoothed 
         over several different context lengths    
      *  S=segmentation model

  The set of fragments included in the model governed by the following parameters:

-minfreq_abs (-mfa) default=2.0 for VMM models, 1.2 for the Segmentation model
    Minimum absolute frequency for fragments included in the model.
    Cannot be used with minfreq_rel.

-minfreq_rel (-mfr)
    Minimum relative frequency for fragments included in the model.
    Cannot be used with minfreq_abs.

-max_pattern_length (-maxpl, default=30 with the VMM models, 50 with the
 segmentation model) 
    Maximum length of fragments to be included in the model.

-max_num_missing_alleles (-m, default=3)
    Maximum number of missing alleles in a fragment. 
    Allowing missing alleles significantly increases memory usage, 
    so reducing this further from the default value of 3 may be
    necessary for large data sets with many missing alleles. On the 
    other hand, for smaller data sets, increasing this may slightly 
    increase accuracy.

4.2 Reconstruction algorithm options

-num_iters (-n), default=-1
    Number or iterations to perform. Each iteration consists of learning
    a model, according to current haplotype estimates, and
    reconstructing the haplotypes, according to the newly learned model.

    If -1 is given, the number of iterations is not predefined, but
    instead the iteration terminates when the likelihood of the 
    data does not improve any more. 

    First iteration is special in the respect that there is not 
    yet any information about the haplotypes; model is learned directly
    from the genotype data, as described in [2]. Note that
    when number of iterations=1 and pruning_strategy=pl, the program works
    as the algorithm described in [2].

-pruning_strategy (-ps) [sequential, pl], default=sequential
    Method for restricting the search of most probable configurations 
    for a single genotype. Two options are provided: 
      * sequential=building configuration from left to right one marker
      at a time
     * pl=hierarchical divide and conquer approach

- max_atomic_segment_length (-a), default=8
    Maximum length for a partial configuration that is reconstructed
    by exhaustive search. For the pl strategy, the whole map is
    broken into atomic segments. For the sequential strategy,
    this gives the number of leftmost markers that are reconstructed
    exhaustively; rest of markers are ligated one by one.

-num_best_partial_configurations_to_ligate (-B), default=25
    For the pl pruning strategy, the maximum number of partial 
    configurations to ligate with the neighboring segment.
    For the sequential strategy, the maximum number of
    already-constructed configurations to ligate with each new marker.
    Specify "all" for no restriction. Increasing this may lead to
    slightly more accurate results.

-num_best_configurations_one_iter (-C), default=10
    Number of best configurations of each genotype to use for learning
    the fragment frequencies for the next iteration. 
    Specify "all" for no restriction. Increasing this may lead decrease 
    overfitting and lead to slightly more accurate results, especially 
    for data sets having a small number of subjects.

4.3 Output options

-num_best_configurations_to_output (-o), default=1
    If a value greater than 1 is given, then another output file, 
    "basename.pairs" is produced, listing the given number of best
     configurations for each genotype

-output_files_prefix (-op) default=<input file name>
    By default, the output file name is just the input file name with 
    the corresponding suffix (".reconstructed", ".pairs", or ".summary")
    appended

-pattern_format [short, long], default=long
    Format to use when outputting fragment files. "long" is more
    readable than "short", but takes more space

-output_intermediate_results (-i)
    Output result files for each iteration, instead of just the final 
    reconstructed haplotypes. Note that is the only way to output 
    the pattern frequencies.

-compute_patterns_for_last_iter (-cp)
    Usually the last step of the algorithm is reconstructing haplotypes. This
    option will make the program learn the fragment frequencies one more
    time, from the final reconstructed haplotypes.

-output_detailed_probabilities (-od)
    Write to output file also the following fields:
    CONFIG_PRIOR_LOG_PROB, HAPLO_PRIOR_LOG_PROB, GENOTYPE_LOG_PROB

-summaryfile <filename>
    Writes summary information to file <filename>

-output_summary_info (-s)
    Write summary information to file <basename>.summary


4.4 Misc. options

-loglevel [1, 2, 3, 4], default=2
    1=debug, 2=info, 3=warning, 4=error

    It is advisable to use the default "info" log level. Debug level
    is mainly used for software development; it is possible that using
    this level will provide more information in case of unexpected
    error. Using 3 or 4 is not advised, as then no normal progress
    output will be produced; only warnings/errors will be shown.

-help (-h)
    Output short help.


5. Handling data sets with a large number of markers
----------------------------------------------------

The HaploRec program is directly applicable for data files containing up to 
500--1500 markers. This practical limit depends on the number of genotypes, 
amount of missing data, complexity of the haplotype distribution, program 
parameters and the amount of available main memory. 

We have extended the basic algorithm by a simple but efficient feature that 
allows it to handle an arbitrarily large number of markers. Briefly, 
the idea is to sequentially haplotype a window of w markers
at a time. In each window except the first one, phased haplotypes for the 
first o markers are obtained from the haplotypes in the previous, overlapping 
window. After a window is haplotyped, r of its last markers are discarded;
they are only used to get better estimates for the w-o-r markers in
the middle of the window. I.e., each window has length w=o+u+o+r, where u is
the number of markers unique to this window, the first o represents the
haplotypes obtained from the previous window and the second o those that are 
used by the next window. The window is moved o+u markers at a time.

This "windowing" version of haplorec is used by giving three additional parameters on the command line:
  - w, the maximum window size (number of markers haplotyped by one run of the algorithm)
  - o, the number of markers obtained from the previous window
  - r, number of "extra" markers in each window  

The program is run in the windowing mode with the following command line:
  haplorec <options> <data_file> <w> <o> <r>

Here, <options> are options for the basic HaploRec algoritm (these may 
be left empty). Reasonable default values for the windowing parameters are r=100, o=10 and r=10.

An example command line (using default options for the basic algorithm):
  haplorec data.hpm 100 10 10
  
 
6. Additional Notes
-------------------

- Alghough the program allows missing genotypes in the input data, it does not perform inference of missing genotypes;
  markers missing in the input file will be missing also in the result.
  
- The software uses term "pattern" instead of the term
  "fragment" used in [1].

- If the programs seems to halt in the pattern search stage with no sign of
  progress,  the problem  is probably that the set of patterns used in the
  model is not restricted enough. This is governed by the following
  options: 
         -num_patterns (-np)
         -minfreq_abs (-mfa)
         -minfreq_rel (-mfr)
         -max_pattern_length (-maxpl)

- If the program seems to halt while reconstructing an atomic segment, decrease
     -max_atomic_segment_length (-a)

- If the program seems to halt while ligating configurations, decrease
     -num_best_partial_configurations_to_ligate (-B)

- If the program fails with an OutOfMemoryException, there are several 
  possibilities:
    * allocate more memory to the java virtual machine.
      For example, command "jar -Xmx1024m -jar HaploRec infile" 
      runs the virtual machine with 1024 megabytes of memory.

    * use smaller model by using more stringent value for one of:
         -num_patterns (-np)
         -minfreq_abs (-mfa)
         -minfreq_rel (-mfr)
         -max_pattern_length (-maxpl)
         -max_num_missing_alleles (-m)

    * restrict the number of best configurations considered for each subject
      by reducing one of:
         -num_best_partial_configurations_to_ligate (-B)
         -num_best_configurations_one_iter (-C)


7. License
-------------
This version of HaploRec is licensed for acedemic use only. For terms 
of the license agreement, see the file "academiclicense.pdf".


8. Support
----------
Please send all questions, comments and bug reports to Lauri Eronen at
Department of Computer Science, University of Helsinki.
E-mail: lauri.eronen@cs.helsinki.fi


9. Version history
------------------
2.3 (February 2008) Current version. Introduces an improved Markov model,
    where the conditional probabilities are smoothed over several different
    context lengths. This is now the default model. More efficient splitting 
    and combining of files in the windowing version intended for chromosome-
    wide data sets. Minor improvements in memory management.

2.2 (May 2007) Conceptually identical to version 2.1. 
    Extension program which enables handling of chromosome-wide data sets
    is now implemented in java instead of perl, improving platform
    independence.

2.1 (September 2006) The version described in [1]. Introduces an
    improved segmentation model that gives more accurate results than
    the segmentation model used in version 2.0.

2.0 (June 2005) This version introduced an EM-based algorithm and
    a segmentation-based haplotype probability model, and featured several
    computational improvements, allowing the use of larger data sets.

1.0 (January 2004) Original implementation of the algorithm introduced in [2].
    Used the VMM model. Used a simple "optimistic matching" strategy for
    fragment frequency estimation (instead of the EM algorithm used by 
    the later versions).

10. References
--------------

[1] L. Eronen, F. Geerts, H. Toivonen. HaploRec: Efficient
    and accurate large-scale reconstruction of haplotypes.
    BMC Bioinformatics 7:542, 2006.

[2] L. Eronen, F. Geerts, H. Toivonen. A Markov Chain
    Approach to Reconstruction of Long Haplotypes. In
    Proceedings of the 9th Pacific Symposium on
    Biocomputing(PSB), 104-115, January 2004. World Scientific.