System requirements =================== The program has been tested on x86_64 Linux machines. The source code is shell script and c++, and requires gcc and cmake for compilation. Requirements for distributed computation ======================================== Python (tested with version 2.6.5) The program and the validation working directory must be on a NFS accessible to all nodes. The local working directory must be able to have the same absolute path in all nodes. Sadly, the distribution script is not very robust in handling unexpected errors like a node running out of disk space, so if something like that happens you may have to manually kill hanging processes. Installation ============ 1. Unpack the archive 2. Run make Usage ===== The program takes reference and query sequences in FASTA-format. Parameters explained: ==================== validate_local.sh: A script for running validation on a single machine. Parameter (Default value) -------------------------------------------- -h | --help Display this text --maxerror VALUE (=0.05) Allow error rate of VALUE when doing alignments with swift --minlen VALUE (=30) Only report alignments with length >= VALUE --maxgap VALUE (=0) Restrict gaps to at max VALUE when doing colinear chaining (0 = unlimited) --gram VALUE (=11) Use VALUE-gram for swift -r | --reference PATH (=reference.fasta) Path to reference -s | --scaffolds PATH (=scaffolds.fasta) Path to query -w | --work PATH (=validation_work/) Where to store intermediary and final results validate_noswift.sh: A script for running validation based on a precomputed swift alignments file Parameter (Default value) -------------------------------------------- -h | --help Display this text --maxgap VALUE (=0) Restrict gaps to at max VALUE when doing colinear chaining (0 = unlimited) -t | --swift PATH (=swift.results) Path to swift results file -r | --reference PATH (=reference.fasta) Path to reference sequence -s | --scaffolds PATH (=scaffolds.fasta) Path to scaffolds -w | --work PATH (=validation_work/) Where to store intermediary and final results validate_distributed.sh: A script for running validation distributed on multiple computers. Parameter (Default value) -------------------------------------------- -h | --help Display this text --maxerror VALUE (=0.05) Allow error rate of VALUE when doing alignments with swift --minlen VALUE (=30) Only report alignments with length >= VALUE --maxgap VALUE (=0) Restrict gaps to at max VALUE when doing col. chaining (0 = unlimited) --gram VALUE (=11) Use VALUE-gram for swift -r | --reference PATH (=reference.fasta) Path to reference sequence -s | --scaffolds PATH (=scaffolds.fasta) Path to scaffolds -w | --work PATH (=validation_work/) Absolute path to final and intermediary results directory (must be visible to all nodes) -o | --hosts PATH (=hosts) Path to hosts file -l | --localwork PATH (=/node/local/path) Absolute path to local work directory (will be same on all nodes) -n | --numjobs NUMJOBS (=10) Number of distributed jobs to be created Example Usage: ============== Some example data has been provided under the example/ -directory: psyringae.fasta -- Reference sequence for Pseudomonas syringae pv.syringae B728a mip-scaffolds.fasta -- Scaffolds constructed for Psy B728a by MIP-scaffolder Validating the scaffolds with unlimited gap: Local computation: ./validate_local.sh -r example/psyringae.fasta -s example/mip-scaffolds.fasta -w psyringae_unlimgap_validation/ (Running swift should take about 5 minutes for this dataset on a typical desktop computer) Distributed computation: ./validate_distributed.sh -r `pwd`/example/psyringae.fasta -s `pwd`/example/mip-scaffolds.fasta -w `pwd`/psyringae_unlimgap_validation/ -o hostfile -l /your/node/work/directory To use this you need to specify the hostnames of your nodes in 'hostfile'. An example hostfile is found in the example/ directory. After validating with unlimited gap, validating with maximum gap 2000bp without having to recompute alignments: ./validate_noswift.sh -t psyringae_unlimgap_validation/psyringae_mip-scaffolds.results -r example/psyringae.fasta -s psyringae_unlimgap_validation/reordered_scafs.fasta --maxgap 2000 -w psyringae_2000gap_validation/ The results will be printed to stdout and to psyringae_unlimgap_validation/validation_results.txt and psyringae_2000gap_validation/validation_results.txt , respectively. Changes in version 0.4 ====================== -Fixed a bug where long sequence headers containing special symbols were being clipped by swift Changes in version 0.5 ====================== -Now (normalized) N50 is computed/defined using the known genome length rather than overall length of (normalized) scaffolds -- For backward compatibility, remove $GENOME_LENGTH parameter inside .sh scripts in the line calling calc_norm_N50 -- Then overall length of (normalized) scaffolds is used instead as in previous version Changes in version 0.6 ====================== -Bug fixed by Michael Vyverman that caused some alignments to be missed in co-linear chaining. This affects so that the normalized N50 values can grow in some cases compared to earlier version of this software. Known issues ====================== - Not working correctly with circular chromosomes! -- Manual fix: a) Concatenate each reference fasta entry with itself b) Modify $GENOME_LENGTH to $GENOME_LENGTH/2 inside .sh scripts that call calc_norm_N50 c) Genome coverage % will be wrongly calculated but scaffold coverage % and normN50 are ok