University homepage Suomenkielinen versio puuttuu Inte på svenska In english
University of Helsinki Department of Computer Science
 

Department of Computer Science

Run-Length Compressed Suffix Array

RLCSA [4, 3, 7] is a compressed suffix array implementation that has been optimized for highly repetitive text collections. Examples of such collections include version control data and individual genomes. This implementation also serves as a testbed for many techniques used with compressed suffix arrays.

The most up-to-date description of RLCSA can be found in [7].

The current version includes experimental support for:

  • LCP information [5]
  • Distribution-aware sampling [6, 9]
  • Adaptive distribution-aware sampling
  • Space-efficient document listing [8]

See README in the package for further information.

The implementation is available for download under the MIT / X11 License.

News

  • 2013-01-21 A new version with parallel sampling.
  • 2012-12-07 A new version with various changes accumulated over time.
  • 2012-11-05 More test data is now available online.
  • 2012-10-12 More test data is now available online.
  • 2012-02-14 More space-efficient construction. Option to use a succinct bit vector to mark sampled positions.
  • 2011-08-23 A minor update to support the August 2011 version of GCSA.
  • 2011-05-17 Support for distribution-aware sampling and an improved low-level interface for external modules.
  • 2011-01-17 A new version that supports the interface used in recent GCSA experiments.
  • 2010-11-25 Added a list of the data sets used in the experiments.
  • 2010-10-13 A new version with libstdc++ parallel mode support and performance improvements.
  • 2010-03-29 A new version with some compatibility and interface updates.
  • 2010-01-11 A new version with experimental support for LCP information.
  • 2009-11-25 A new version of RLCSA is available. This version includes bug fixes, more functionality and a bit cleaner interface.
  • 2009-06-15 The implementation of RLCSA is now available.

Downloads

  • Current version (January 2013).
  • January 2013. This version includes a multi-threaded sampling algorithm for finding (almost) optimal distribution-aware samples quickly. There is also some undocumented work on document listing with highly repetitive data. This version was used in the experiments in [9].
  • December 2012. This version has many technical changes from the earlier versions. This version is required by the December 2012 version of GCSA.
  • February 2012. Construction algorithm for partial indexes changed from multiple Larsson-Sadakane algorithms to a parallel prefix-doubling algorithm. Succinct bit vectors can be used instead of gap encoded ones to e.g. mark sampled positions. This version was used in the experiments in [7].
  • August 2011. Minor updates to some library functions. This version is required by the August 2011 version of GCSA.
  • May 2011. The low-level interface has been improved in this version. Weighted / distribution-aware sampling is now also supported. The test program has also been extended.
  • January 2011. This version has a low-level interface compatible with that of GCSA. Also, the default sample rate was changed to more realistic 128, and the test program supports patterns in Pizza&Chili format.
  • October 2010. This version supports libstdc++ parallel mode as an alternative to MCSTL. There are also some speed optimizations (an alternative encoding for run-length encoded bit vectors and improved display()).
  • March 2010. Fixed some problems when compiling with a new version of g++. RLCSA now uses iterators and const operations to make parallelization easier.
  • January 2010. Includes experimental support for LCP information. This version was used in the experiments reported in [5].
  • November 2009. A cleaned up version with more functionality.
  • June 2009. This version was used in the experiments reported in [3]. There are some bugs that were introduced when cleaning up the code. You should use the November 2009 version instead.

Data

The following data sets were used in experiments with RLCSA. Some of them are available from the Pizza&Chili Corpus (Chile, Italy).

References

  1. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro: Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections.
    Proc. SPIRE 2008, Springer LNCS 5280, pp. 164-175, Melbourne, Australia, November 10-12, 2008.
  2. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Individual Genomes.
    Proc. RECOMB 2009, Springer LNCS 5541, pp. 121-137, Tucson, Arizona, USA, May 18-21, 2009.
  3. Jouni Sirén: Compressed Suffix Arrays for Massive Data.
    Proc. SPIRE 2009, Springer LNCS 5721, pp. 63-74, Saariselkä, Finland, August 25-27, 2009.
  4. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
    Journal of Computational Biology 17(3):281-308, 2010.
  5. Jouni Sirén: Sampled Longest Common Prefix Array.
    Proc. CPM 2010, Springer LNCS 6129, pp. 227-237, New York, USA, June 21-23, 2010.
  6. Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-Aware Compressed Full-Text Indexes.
    Proc. ESA 2011, Springer LNCS 6942, pp. 760-771, Saarbrücken, Germany, September 5-7, 2011.
  7. Jouni Sirén: Compressed Full-Text Indexes for Highly Repetitive Collections (PhD Thesis).
    Department of Computer Science, Series of Publications A, Report A-2012-5, University of Helsinki, June 2012.
  8. Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Listing on Repetitive Collections.
    Accepted to CPM 2013, Bad Herrenalb, Germany, June 17-19, 2013.
  9. Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-aware compressed full-text indexes.
    Accepted to Algorithmica, 2013.

Jouni.Siren@cs.helsinki.fi