First Huffman, then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index

Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro

Abstract. We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, c, present in that structure. On a text of length n with zero-order entropy H_0, our index needs O(n(H_0+1)) bits of space, without any dependence on c. The average search time for a pattern of length m is O(m(H_0+1)), under reasonable assumptions. Each position of a text occurrence can be reported in worst case time O((H_0+1) log n), while any text substring of length L can be retrieved in O((H_0+1)L) average time in addition to the previous worst case time. By paying 2n additional bits of space, it is possible to maintain the average complexities and ensure that H_0+1 becomes log c in the worst case for all the time complexities. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement.