-------------------------------------- - HaploRec version 2.3 - - Documentation - -------------------------------------- 1. Introduction --------------- HaploRec is a statistical haplotype reconstruction algorithm targeted for large-scale disease association studies. Its most detailed and up-to-date description is reference [1]. 1.1 Platform requirements The program is implemented with java, JDK version 1.5. It should work on all platforms with a java 1.5 compatible virtual machine. To see if your platform has java 1.5, type "java -version" on the command line. Any version higher or equal to 1.5.0 should be OK. A java virtual machine can be freely downloaded from http://www.java.com/en/download/. 2. Installing the program ------------------------- The program is bundled as an executable java archive (jar) file "HaploRec.jar". No specific installation actions are required; once copied to a local directory, the software can be run with the following command line: java -Xmxm -jar HaploRec.jar Here, is the amount of memory (in megabytes) allocated to the java virtual machine. For example, use -Xmx1024m to allocate 1GB of memory. Options may be left empty (see section 4 for details on the options). For convenience, a wrapper script "haplorec" is provided (works only on linux and unix). Running HaploRec using the script can be done more shortly as: haplorec In unix-based systems, it may be necessary to give execution rights to the wrapper script. This can be done with the following command: chmod u+x haplorec To handle data sets with a large number of markers, the program needs to be run in "windowing mode", where the data set is partitioned into and haplotyped in smaller, partially overlapping windows and then combined together into complete haplotypes. This is described in section 5 of the instructions. 3. File Formats --------------- 3.1 Input file The program takes as an argument a single file containing a (multilocus) genotype for each subject. First line of the input file must be a header, which contains the following items: - text "Id" (column title for subject id) - Names of markers. (Note that markers must be ordered according to their order on the chromosome) The rest of the file contains the genotype data. Each genotype is divided to two lines of the input file, with the first field denoting the subject id, the rest of the fields denoting alleles at each marker. The allele pair at each locus is thus divided to two lines. Alleles smaller than 10 are given as integers starting from 1 (0 is not permitted). Alleles starting from 10 are represented with letters starting from A (A for 10:th allele, B for 11:th allele and so on). The maximum number of different alleles is thus 35; 9 for numeric characters and 26 for letters of the alphabet. Note that memory usage is more efficient when using only numbers starting from 1 (e.g, {1,2,3,4} instead of e.g. {A,B,C,D}). Missing alleles are denoted by a zero. Fields can be separated by an arbitrary amount of white space. A simple example of a genotype file follows: Id M1 M2 M3 1 1 1 4 1 2 1 3 2 2 1 1 2 2 2 2 3 3 0 3 3 1 0 3 Any strings of non-white space characters can be used as names of markers and subject id:s. 3.2 Output files There are 3 types of output files. The main output file "XXX.reconstructed" is a list of best reconstructions (1 haplotype pair for each genotype). File "XXX.pairs" lists a number (specified as a parameter) of most probable reconstructions for each genotype. File "XXX.summary" gives summary information about the results. The output file format is almost identical to the input file format. Again, there are two lines for each subject. This time, each line denotes one haplotype (in the genotype file format the alleles were arbitrarily divided between the two lines of the subject). The pairs and summary files are only output, if specified with the corresponding options. There are following additional fields: * CONFIDENCE = probability of the reconstructed haplotype configuration for this subject, according to the learned model. * CONFID_ID = index of the reconstructed configuration; 1 for the best configuration, 2 for the second-best and so on. This field only occurs in the ".pairs" file. * CONFIG_PRIOR_LOG_PROB = logarithm of the probability of drawing the reconstucted haplotype configuration from the distribution specified by the learned model. Note that this field is only output when using option -od. * HAPLO_PRIOR_LOG_PROB = logarithm of the probability of drawing the (reconstructed) invidual haplotype from the distribution specified by the learned model. Note that this field is only output when using option -od. * GENOTYPE_LOG_PROB = logarithm of the probability of the drawing the genotype from the distribution specified by the learned model. Note that this field is only output when using option -od. The summary file contains three fields: * LOG_LIKELIHOOD_OF_DATA=logarithm of the probability of drawing the sample of genotypes from the distribution of haplotypes specified by the current model. It can be interpreted as a "goodness of fit" for the model. It is obtained by multiplying the probabilities of the invidual genotypes. The probability of a genotype is just the sum of the probabilities of all haplotype configurations consistent with the genotype; however, as it is usually computationally infeasible to check all compatible configurations, the summation in practice only includes all configurations evaluated by the used pruning strategy. * LOG_LIKELIHOOD_OF_RECONSTRUCTED_CONFIGURATIONS=logarithm of the probability for obtaining the reconstructed set of haplotype configurations by a random draw from the haplotype distribution specified by the current model. * MODEL_SIZE=size of the data structure used to store the model (fragments and their frequencies). The summary file is only output when using option -s or -summaryfile. 4. Running the program ---------------------- The program can be run without any additional arguments besides the input file; in this case default parameters are used. It is, however, advisable to use options to tune the algorithm, depending on the properties of the data set. For large data sets, this may be necessary to obtain a reasonable trade-off between the quality of results and running time, or to prevent running out of memory. If other arguments are specified, they are specified using command-line options. For each option there is a long name and possibly a short name and a default value. For some options there is a fixed set of possible values [val1, val2, val2, ...]. If this is not the case, the option requires either a numeric or file name argument, or no argument at all. In the following list, each option is described as follows: -longname (-shortname) [val1, val2 val3, ...] defaultval 4.1 Modeling options -haplotype_probability_model (-p) [VMM, VMM_AVG, S], default=VMM_AVG This option selects the model that is used to combine haplotype fragment frequencies into a probability for complete haplotype. Possible values: * VMM=variable order Markov model * VMM_AVG=variable order Markov model, probabilities smoothed over several different context lengths * S=segmentation model The set of fragments included in the model governed by the following parameters: -minfreq_abs (-mfa) default=2.0 for VMM models, 1.2 for the Segmentation model Minimum absolute frequency for fragments included in the model. Cannot be used with minfreq_rel. -minfreq_rel (-mfr) Minimum relative frequency for fragments included in the model. Cannot be used with minfreq_abs. -max_pattern_length (-maxpl, default=30 with the VMM models, 50 with the segmentation model) Maximum length of fragments to be included in the model. -max_num_missing_alleles (-m, default=3) Maximum number of missing alleles in a fragment. Allowing missing alleles significantly increases memory usage, so reducing this further from the default value of 3 may be necessary for large data sets with many missing alleles. On the other hand, for smaller data sets, increasing this may slightly increase accuracy. 4.2 Reconstruction algorithm options -num_iters (-n), default=-1 Number or iterations to perform. Each iteration consists of learning a model, according to current haplotype estimates, and reconstructing the haplotypes, according to the newly learned model. If -1 is given, the number of iterations is not predefined, but instead the iteration terminates when the likelihood of the data does not improve any more. First iteration is special in the respect that there is not yet any information about the haplotypes; model is learned directly from the genotype data, as described in [2]. Note that when number of iterations=1 and pruning_strategy=pl, the program works as the algorithm described in [2]. -pruning_strategy (-ps) [sequential, pl], default=sequential Method for restricting the search of most probable configurations for a single genotype. Two options are provided: * sequential=building configuration from left to right one marker at a time * pl=hierarchical divide and conquer approach - max_atomic_segment_length (-a), default=8 Maximum length for a partial configuration that is reconstructed by exhaustive search. For the pl strategy, the whole map is broken into atomic segments. For the sequential strategy, this gives the number of leftmost markers that are reconstructed exhaustively; rest of markers are ligated one by one. -num_best_partial_configurations_to_ligate (-B), default=25 For the pl pruning strategy, the maximum number of partial configurations to ligate with the neighboring segment. For the sequential strategy, the maximum number of already-constructed configurations to ligate with each new marker. Specify "all" for no restriction. Increasing this may lead to slightly more accurate results. -num_best_configurations_one_iter (-C), default=10 Number of best configurations of each genotype to use for learning the fragment frequencies for the next iteration. Specify "all" for no restriction. Increasing this may lead decrease overfitting and lead to slightly more accurate results, especially for data sets having a small number of subjects. 4.3 Output options -num_best_configurations_to_output (-o), default=1 If a value greater than 1 is given, then another output file, "basename.pairs" is produced, listing the given number of best configurations for each genotype -output_files_prefix (-op) default= By default, the output file name is just the input file name with the corresponding suffix (".reconstructed", ".pairs", or ".summary") appended -pattern_format [short, long], default=long Format to use when outputting fragment files. "long" is more readable than "short", but takes more space -output_intermediate_results (-i) Output result files for each iteration, instead of just the final reconstructed haplotypes. Note that is the only way to output the pattern frequencies. -compute_patterns_for_last_iter (-cp) Usually the last step of the algorithm is reconstructing haplotypes. This option will make the program learn the fragment frequencies one more time, from the final reconstructed haplotypes. -output_detailed_probabilities (-od) Write to output file also the following fields: CONFIG_PRIOR_LOG_PROB, HAPLO_PRIOR_LOG_PROB, GENOTYPE_LOG_PROB -summaryfile Writes summary information to file -output_summary_info (-s) Write summary information to file .summary 4.4 Misc. options -loglevel [1, 2, 3, 4], default=2 1=debug, 2=info, 3=warning, 4=error It is advisable to use the default "info" log level. Debug level is mainly used for software development; it is possible that using this level will provide more information in case of unexpected error. Using 3 or 4 is not advised, as then no normal progress output will be produced; only warnings/errors will be shown. -help (-h) Output short help. 5. Handling data sets with a large number of markers ---------------------------------------------------- The HaploRec program is directly applicable for data files containing up to 500--1500 markers. This practical limit depends on the number of genotypes, amount of missing data, complexity of the haplotype distribution, program parameters and the amount of available main memory. We have extended the basic algorithm by a simple but efficient feature that allows it to handle an arbitrarily large number of markers. Briefly, the idea is to sequentially haplotype a window of w markers at a time. In each window except the first one, phased haplotypes for the first o markers are obtained from the haplotypes in the previous, overlapping window. After a window is haplotyped, r of its last markers are discarded; they are only used to get better estimates for the w-o-r markers in the middle of the window. I.e., each window has length w=o+u+o+r, where u is the number of markers unique to this window, the first o represents the haplotypes obtained from the previous window and the second o those that are used by the next window. The window is moved o+u markers at a time. This "windowing" version of haplorec is used by giving three additional parameters on the command line: - w, the maximum window size (number of markers haplotyped by one run of the algorithm) - o, the number of markers obtained from the previous window - r, number of "extra" markers in each window The program is run in the windowing mode with the following command line: haplorec Here, are options for the basic HaploRec algoritm (these may be left empty). Reasonable default values for the windowing parameters are r=100, o=10 and r=10. An example command line (using default options for the basic algorithm): haplorec data.hpm 100 10 10 6. Additional Notes ------------------- - Alghough the program allows missing genotypes in the input data, it does not perform inference of missing genotypes; markers missing in the input file will be missing also in the result. - The software uses term "pattern" instead of the term "fragment" used in [1]. - If the programs seems to halt in the pattern search stage with no sign of progress, the problem is probably that the set of patterns used in the model is not restricted enough. This is governed by the following options: -num_patterns (-np) -minfreq_abs (-mfa) -minfreq_rel (-mfr) -max_pattern_length (-maxpl) - If the program seems to halt while reconstructing an atomic segment, decrease -max_atomic_segment_length (-a) - If the program seems to halt while ligating configurations, decrease -num_best_partial_configurations_to_ligate (-B) - If the program fails with an OutOfMemoryException, there are several possibilities: * allocate more memory to the java virtual machine. For example, command "jar -Xmx1024m -jar HaploRec infile" runs the virtual machine with 1024 megabytes of memory. * use smaller model by using more stringent value for one of: -num_patterns (-np) -minfreq_abs (-mfa) -minfreq_rel (-mfr) -max_pattern_length (-maxpl) -max_num_missing_alleles (-m) * restrict the number of best configurations considered for each subject by reducing one of: -num_best_partial_configurations_to_ligate (-B) -num_best_configurations_one_iter (-C) 7. License ------------- This version of HaploRec is licensed for acedemic use only. For terms of the license agreement, see the file "academiclicense.pdf". 8. Support ---------- Please send all questions, comments and bug reports to Lauri Eronen at Department of Computer Science, University of Helsinki. E-mail: lauri.eronen@cs.helsinki.fi 9. Version history ------------------ 2.3 (February 2008) Current version. Introduces an improved Markov model, where the conditional probabilities are smoothed over several different context lengths. This is now the default model. More efficient splitting and combining of files in the windowing version intended for chromosome- wide data sets. Minor improvements in memory management. 2.2 (May 2007) Conceptually identical to version 2.1. Extension program which enables handling of chromosome-wide data sets is now implemented in java instead of perl, improving platform independence. 2.1 (September 2006) The version described in [1]. Introduces an improved segmentation model that gives more accurate results than the segmentation model used in version 2.0. 2.0 (June 2005) This version introduced an EM-based algorithm and a segmentation-based haplotype probability model, and featured several computational improvements, allowing the use of larger data sets. 1.0 (January 2004) Original implementation of the algorithm introduced in [2]. Used the VMM model. Used a simple "optimistic matching" strategy for fragment frequency estimation (instead of the EM algorithm used by the later versions). 10. References -------------- [1] L. Eronen, F. Geerts, H. Toivonen. HaploRec: Efficient and accurate large-scale reconstruction of haplotypes. BMC Bioinformatics 7:542, 2006. [2] L. Eronen, F. Geerts, H. Toivonen. A Markov Chain Approach to Reconstruction of Long Haplotypes. In Proceedings of the 9th Pacific Symposium on Biocomputing(PSB), 104-115, January 2004. World Scientific.