# Analysis of the subsequence composition of biosequences

In this talk we assess the quality of phylogenies reconstructed from the composition of sparse structures in proteomes. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs -- motifs that cannot be made more specific without losing support -- to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale.

Motivated by the finding that long and sparse motifs carry phylogenetic information, we study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. First, we focus on patterns in which groups of characters from an alphabet $\Sigma$ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern $w$ in a random string generated by a Markov chain in $O(|w|^2)$ time, improving a previous result that required $O(2^{|w|})$ time. We then consider the problem of computing expectation and variance of the motifs of a string $s$ in an IID text. We study the case in which $s$ is given offline, and an arbitrary motif $w$ of $s$ is queried online. We relate computational complexity to the structure of $w$ and $s$, identifying sets of motifs that are amenable to $o(|w|\log|w|)$ time online computation after $O(|s|^3)$ preprocessing of $s$.

BRIEF BIO

Fabio Cunial received Bachelor's and Master's degrees in Computer Engineering from the University of Padova (Italy), and Master's and PhD degrees in Computational Science and Engineering from GeorgiaTech. During his PhD, he worked as a summer intern in the Symbiose bioinformatics team at INRIA Rennes, and in the Department of Sociology at Emory University; he also collaborated with the National Biodiversity Institute of Costa Rica (INBio) for one year. He is interested in algorithms for string statistics, alignment-free sequence comparison, compositional notions of information, large-scale applications of string pattern discovery in molecular biology and sociology, and in algorithms and applications in biodiversity informatics.