Erik Aurell, Professor of Biological Physics, KTH-Royal Institute of Technology

**Correlation-Compressed Direct Coupling Analysis**

**Abstract:** Direct Coupling Analysis (DCA) is a powerful tool to find pair-wise dependencies in large biological data sets. It amounts to inferring coefficients in a probabilistic model in an exponential family, and then using the largest such inferred coefficients as predictors for the dependencies of interest. The main computational bottle-neck is the inference. As described recently by Jukka Corander in this seminar series DCA has be done on bacterial whole-genome data, at the price of significant compute time, and investment in code optimization. We have looked at if DCA can be speeded up by first filtering the data on correlations, an approach we call Correlation-Compressed Direct Coupling Analysis (CC-DCA). The computational bottle-neck then moves from DCA to the more standard task of finding a subset of most strongly correlated vectors in large data sets. I will describe results obtained so far, and outline what it would take to do CC-DCA on whole-genome data in human and other higher organisms.

This is joint work with Chen-Yi Gao and Hai-Jun Zhou, available as arXiv:1710.04819.

Machine Learning Coffee seminars are weekly seminars held jointly by the Aalto University and the University of Helsinki. The seminars aim to gather people from different fields of science with interest in machine learning. Talks will begin at 9:15 am and porridge and coffee will be served from 9:00 am.

Note that we will have no talk on December 4th, 2017 (due to the NIPS conference). The following programme will be announced soon.

Welcome!