MIP Scaffolder 0.6 Contact: leena.salmela@cs.helsinki.fi -------- Overview -------- MIP Scaffolder is a program for scaffolding contigs produced by fragment assemblers using mate pair data such as those generated by ABI SOLiD or Illumina Genome Analyzer. --------- Reference --------- L. Salmela, V. Mäkinen, N. Välimäki, J. Ylinen, and E. Ukkonen: "Fast Scaffolding with Small Independent Mixed Integer Programs", Bioinformatics 27(23):3259-3265, 2011. ------------------- System Requirements ------------------- MIP Scaffolder has been tested on systems running Linux on an X86_64 architeture. MIP Scaffolder consists of several programs and scripts written in Perl, shell script and C++. Compiling MIP Scaffolder requires gcc. MIP Scaffolder uses lp_solve library for solving MIP problems and lemon library for representing graphs. These can be downloaded from: http://sourceforge.net/projects/lpsolve/ http://lemon.cs.elte.hu/ Note that you need the "dev" package of lp_solve which contains the precompiled library. ------------ Installation ------------ Unpack mip-scaffolder-0.6.tar.gz. Set the paths to mip-scaffolder, lp_solve and lemon correctly in the Makefile. Use absolute paths. Run make. ----- Usage ----- All scripts mentioned below are found in the scripts directory. 1. Input data We use the naming convention and the orientation of SOLiD reads. Therefore each mate pair libary should have two files, one for F3 reads and one for R3 reads. The reads should be named so that each mate pair has a unique identifier and the name of the F3 end is the unique identifier concatenated with _F3 and the name of the R3 end is the unique identifier concatenated with _R3. The orientation of the mate pair should be as follows: R3 F3 ------> -------> MIP Scaffolder can also handle Illumina mate pairs / paired end reads if the reads are renamed and the appropriate end is reversed so that the above orientation is obtained. Additionally you should have the contigs in a fasta file and the coverage statistics of the contigs in a file with the following format. The statistics should be given in the same order as contigs are in the fasta file. For each contig the coverage file should contain a line in the following format: The first entry gives the index of the contig starting from 1. 2. Mapping mate pairs to contigs Each mate pair read file should be mapped to the contigs using your favorite read mapper. The output of the mapper should be in SAM format. It is a good idea to tell the read mapper to produce only unique mappings. Use the merge.sh script to merge the mappings of the mate pair ends together: merge.sh This script produces two files .sorted1 and .sorted2 which are needed as input for filtering consistent mappings. Use filter-mappings.sh to filter consistent mappings: filter-mappings.sh [-w -p

] .sorted1 .sorted2 This script produces one file which contains (w,p)-consistent mappings. 3. Run the scaffolder Produce a configuration file according to the example below. You can have as many stages as you wish and each stage can contain as many libraries as you wish. Generally using a new stage for each library produces longer scaffolds. However, if the coverage of a library is low, using it with another library in the same stage may give more reliable results. Finally run the scaffolder: mip-scaffolder.pl -------------------------- Example Configuration File -------------------------- # Upper bound for genome length (required) genome_length=500000000 #parameter specifications for the first stage [STAGE] # Maximum biconnected component size. (optional) maximum_biconnected_component=50 # Maximum allowed degree in scaffolding graph. (optional) maximum_degree=50 # Maximum coverage for nonrepetitive contig. (optional) #maximum_coverage=20 # The maximum overlap between contigs that is allowed without checking for # sequence similarity. By default this is set based on the variablility in # insert size lengths of each library. (optional) #maximum_overlap=100 # The minimum support for an edge. (optional) minimum_support=2 # Should edges with negative estimated distance be checked for sequence # similarity or removed automatically? (optional) check_negative_edges=1 # The maximum allowed error level when checking for sequence similarity # (optional) alignment_error=0.1 # library specification for the first stage [LIBRARY] # File in SAM format containing mappings for the mate pair reads # to the contigs mappings=pairs.sam # Orientation of the mate pairs (in current version must be SOLID) orientation=SOLID # Insert length insert_length=600 # Minimum insert length min_insert_length=500 # Maximum insert length max_insert_length=700 # parameter specifications for the second stage [STAGE] # Maximum biconnected component size. (optional) maximum_biconnected_component=50 # Maximum allowed degree in scaffolding graph. (optional) maximum_degree=50 # Maximum coverage for nonrepetitive contig. (optional) #maximum_coverage=20 # The maximum overlap between contigs that is allowed without checking for # sequence similarity. By default this is set based on the variablility in # insert size lengths of each library. (optional) #maximum_overlap=300 # The minimum support for an edge. (optional) minimum_support=2 # Should edges with negative estimated distance be checked for sequence # similarity or removed automatically? (optional) check_negative_edges=1 # The maximum allowed error level when checking for sequence similarity # (optional) alignment_error=0.1 # Marker information (i.e. protein links) to be used in this stage markers=markers.txt # First library of the second stage [LIBRARY] # File in SAM format containing mappings for the mate pair reads # to the contigs mappings=pairs.sam # Orientation of the mate pairs (in current version must be SOLID) orientation=SOLID # Insert length insert_length=2000 # Minimum insert length min_insert_length=1500 # Maximum insert length max_insert_length=2500 # Second library of the second stage [LIBRARY] # File in SAM format containing mappings for the mate pair reads # to the contigs mappings=pairs.sam # Orientation of the mate pairs (in current version must be SOLID) orientation=SOLID # Insert length insert_length=2500 # Minimum insert length min_insert_length=2100 # Maximum insert length max_insert_length=3000 ------------------ New in Version 0.6 ------------------ Bug fixes: * In some cases the precision of floating point numbers was lost. This has now been fixed. ------------------ New in Version 0.5 ------------------ Sequence similarity threshold can now be set by the user by adjusting alignment_error in configuration file. Fixed some bugs: * integer type bugs related to scaffolding large genomes * computing lengths of alignments for figuring out sequence similarity ------------------ New in Version 0.4 ------------------ Fixed a bug in reading SAM files. Added many safety checks and more informal error messages. Changed example configuration file to use more sensible defaults. Fixed a bug in handling paths in filenames and directories.