Parallelization of string matching algorithms extended to graphs

Current research in algorithmic bioinformatics focuses on extending string algorithms to work on a string and a labeled directed graph (instead of just two strings). This is motivated, for example, by the rise of pan-genomics, where a "pan-genome" is a labeled graph encoding all genomes observed in a population. However, several such problems, for example exact string matching, can only be solved in quadratic time on graphs, as opposed to linear time on strings. This motivates the search for alternative strategies that work in practice, on inputs of billions of characters.

In this thesis you will develop methods to "decompose" a graph in several components such that these algorithms can be run in parallel on each such "component", and then combine the results for each "component". The focus is on aligning a string in a labeled directed acyclic graph (DAG) under the models of exact matching and edit distance. You will review (serial) algorithms for a string and a graph, review parallelization strategies for computing the edit distance between two strings, develop such decomposition strategies, prove their correctness, implement them and then perform experimental evaluations.

Ideally, you are already familiar with string algorithms (e.g. through the String Processing Algorithms course) and have experience with parallel programming. Moreover, ideally you also have some experience with GPU programming, in case the algorithms can also be implemented there.