SuDS project / Compressed suffix tree implementation
Our C++-implementation of compressed suffix tree is available for download (version 1.1).
- Note: If you have access to a 64-bit server machine with 32 GB memory, you can construct the compressed suffix tree of the complete Human Genome. The final memory requirement for HG is 8.8 GB, but during construction the memory peak is at 24 GB.
- This version supports storing the structure to disk, so that the heavy construction for one sequence is required only once.
- News: To see what the structure can do, try out SUDS Genome Browser: A server machine with 128 GB memory is running the compressed suffix tree of Human Genome. With the client, you can query the structure, and it will show the corresponding path in the tree and a list of found occurrences. It is also possible to browse the tree manually and click the occurrences to follow a link to Ensembl Genome Browser.
See the preprint of a Bioinformatics application note for a brief overview and some experimental results.
Our implementation of compressed suffix tree follows closely a recent article by Sadakane (Compressed Suffix Trees with Full Functionality, Theory of Computing Systems, in press). The implementation supports all typical suffix tree operations, including suffix links. In addition, it supports lowest common ancestor (lca) queries. Some operations work in constant time (traversing in the tree, lca), others are slower than normal suffix tree operations by about a logarithm factor. The space-savings are significant: on a 10MB DNA sequence, the compressed suffix tree was taking about 33MB. A normal suffix tree takes easily 160MB (and this does not yet support lca-queries). Even a plain suffix array takes more space than the compressed suffix tree. For more details on the implementation, see the manuscript describing the algorithm engineering challenges faced during the project. For a brief overview, see the poster, and for some preliminary experiments, see the WEA 2007 article.
Version 1.1 - Fixed some compilation errors when using a more recent GNU C/C++.
Back to SuDS project page
Previous update: 21.10.2013 - Veli Mäkinen