MIP Scaffolder 0.5
Contact: leena.salmela@cs.helsinki.fi

--------
Overview
--------

MIP Scaffolder is a program for scaffolding contigs produced by
fragment assemblers using mate pair data such as those generated by
ABI SOLiD or Illumina Genome Analyzer.

---------
Reference
---------

L. Salmela, V. Mkinen, E. Ukkonen, N. Vlimki, and J. Ylinen: "Fast
Scaffolding with Small Independent Mixed Integer Programs". To appear
in Bioinformatics.

-------------------
System Requirements
-------------------

MIP Scaffolder has been tested on systems running Linux on an X86_64
architeture. MIP Scaffolder consists of several programs and scripts
written in Perl, shell script and C++. Compiling MIP Scaffolder
requires gcc. MIP Scaffolder uses lp_solve library for solving MIP
problems and lemon library for representing graphs. These can be
downloaded from:

http://sourceforge.net/projects/lpsolve/
http://lemon.cs.elte.hu/

Note that you need the "dev" package of lp_solve which contains the
precompiled library.

------------
Installation
------------

Unpack mip-scaffolder-0.5.tar.gz.
Set the paths to mip-scaffolder, lp_solve and lemon correctly in the Makefile.
Use absolute paths. Run make.

-----
Usage
-----

All scripts mentioned below are found in the scripts directory.

1. Input data

We use the naming convention and the orientation of SOLiD reads.
Therefore each mate pair libary should have two files, one for F3
reads and one for R3 reads. The reads should be named so that each
mate pair has a unique identifier and the name of the F3 end is the
unique identifier concatenated with _F3 and the name of the R3 end is
the unique identifier concatenated with _R3. The orientation of the
mate pair should be as follows:

   R3           F3
 ------>     ------->

MIP Scaffolder can also handle Illumina mate pairs / paired end reads
if the reads are renamed and the appropriate end is reversed so that the
above orientation is obtained.

Additionally you should have the contigs in a fasta file and the
coverage statistics of the contigs in a file with the following
format. The statistics should be given in the same order as contigs
are in the fasta file. For each contig the coverage file should
contain a line in the following format:

<id>    <name>        <length>        <coverage>

The first entry gives the index of the contig starting from 1.

2. Mapping mate pairs to contigs

Each mate pair read file should be mapped to the contigs using your
favorite read mapper. The output of the mapper should be in SAM
format. It is a good idea to tell the read mapper to produce only
unique mappings.

Use the merge.sh script to merge the mappings of the mate pair ends
together:

merge.sh <mappings-F3> <mappings-R3> <merged-mappings>

This script produces two files <merged-mappings>.sorted1 and
<merged-mappings>.sorted2 which are needed as input for filtering
consistent mappings.

Use filter-mappings.sh to filter consistent mappings:

filter-mappings.sh [-w <w> -p <p>] <merged-mappings>.sorted1 <merged-mappings>.sorted2 <filtered mappings>

This script produces one file <filtered mappings> which contains
(w,p)-consistent mappings.

3. Run the scaffolder

Produce a configuration file according to the example below. You can
have as many stages as you wish and each stage can contain as many
libraries as you wish. Generally using a new stage for each library
produces longer scaffolds. However, if the coverage of a library is
low, using it with another library in the same stage may give more
reliable results. Finally run the scaffolder:

mip-scaffolder.pl <configuration file> <contigs.fasta> <contig coverage> <working directory>

--------------------------
Example Configuration File
--------------------------

# Upper bound for genome length (required)
genome_length=500000000

#parameter specifications for the first stage
[STAGE]
# Maximum biconnected component size. (optional)
maximum_biconnected_component=50
# Maximum allowed degree in scaffolding graph. (optional)
maximum_degree=50
# Maximum coverage for nonrepetitive contig. (optional)
#maximum_coverage=20
# The maximum overlap between contigs that is allowed without checking for
# sequence similarity. By default this is set based on the variablility in
# insert size lengths of each library. (optional)
#maximum_overlap=100
# The minimum support for an edge. (optional)
minimum_support=2
# Should edges with negative estimated distance be checked for sequence 
# similarity or removed automatically? (optional)
check_negative_edges=1
# The maximum allowed error level when checking for sequence similarity 
# (optional)
alignment_error=0.1

# library specification for the first stage
[LIBRARY]
# File in SAM format containing mappings for the mate pair reads 
# to the contigs
mappings=pairs.sam
# Orientation of the mate pairs (in current version must be SOLID)
orientation=SOLID
# Insert length
insert_length=600
# Minimum insert length
min_insert_length=500
# Maximum insert length
max_insert_length=700

# parameter specifications for the second stage
[STAGE]
# Maximum biconnected component size. (optional)
maximum_biconnected_component=50
# Maximum allowed degree in scaffolding graph. (optional)
maximum_degree=50
# Maximum coverage for nonrepetitive contig. (optional)
#maximum_coverage=20
# The maximum overlap between contigs that is allowed without checking for
# sequence similarity. By default this is set based on the variablility in
# insert size lengths of each library. (optional)
#maximum_overlap=300
# The minimum support for an edge. (optional)
minimum_support=2
# Should edges with negative estimated distance be checked for sequence 
# similarity or removed automatically? (optional)
check_negative_edges=1
# The maximum allowed error level when checking for sequence similarity 
# (optional)
alignment_error=0.1

# First library of the second stage
[LIBRARY]
# File in SAM format containing mappings for the mate pair reads 
# to the contigs
mappings=pairs.sam
# Orientation of the mate pairs (in current version must be SOLID)
orientation=SOLID
# Insert length
insert_length=2000
# Minimum insert length
min_insert_length=1500
# Maximum insert length
max_insert_length=2500

# Second library of the second stage
[LIBRARY]
# File in SAM format containing mappings for the mate pair reads 
# to the contigs
mappings=pairs.sam
# Orientation of the mate pairs (in current version must be SOLID)
orientation=SOLID
# Insert length
insert_length=2500
# Minimum insert length
min_insert_length=2100
# Maximum insert length
max_insert_length=3000

------------------
New in Version 0.5
------------------

Sequence similarity threshold can now be set by the user
by adjusting alignment_error in configuration file.

Fixed some bugs:
   * integer type bugs related to scaffolding large genomes
   * computing lengths of alignments for figuring out sequence similarity

------------------
New in Version 0.4
------------------

Fixed a bug in reading SAM files.

Added many safety checks and more informal error messages.

Changed example configuration file to use more sensible defaults.

Fixed a bug in handling paths in filenames and directories.
