This HTML version of the exercises is only intended to give you a quick view of what the exercises are about. Use the .ipynb file to solve the exercises!

Exercise 3: Gene regulatory network inference (DL 27 November)

Lecturer: Antti Honkela

Your name:

Name of your pair:

In this exercise you will experiment with methods for inferring regulatory interactions.

If you have any questions, please post them to the Moodle discussion area!

Task 1: Regulatory interaction inference with Gaussian Markov random field (2 pts)

In this task, you will generate a sparse Erdős–Rényi random graph, form the corresponding Gaussian Markov random field and study its covariance.

Generate an undirected ER random graph for 12 genes using p=0.4. Construct a corresponding Gaussian Markov random field by forming the precision matrix which has non-zero elements corresponding to the edges of the graph. Make sure the precision matrix is positive definite by making the diagonal elements positive and sufficiently large compared to the other elements of the matrix.
Compute the correlation matrix among the 12 genes from the precision matrix. Visualise the matrix. Can you spot the true connections?
Try setting a threshold for absolute values of the correlation matrix elements to distinguish between correct and false connections. Use the Hamming distance between the true and estimated connectivity matrices as a measure of accuracy. How small a distance can you achieve?

Hints:

You can check if the precision matrix is positive definite using scipy.linalg.eig(). The first argument contains the eigenvalues of the matrix which should all be positive. Eigenvalues of a symmetric matrix are always real. If you get complex eigenvalues, make sure the matrix is symmetric. (Precision and covariance matrices of Gaussian distributions are always symmetric.)
You can force a matrix \(\mathbf{A}\) to be positive definite by adding a diagonal term. If eigenvalues of \(\mathbf{A}\) are \((\lambda_1, \dots, \lambda_n)\), eigenvalues of \(\mathbf{A} + \alpha \mathbf{I}\) are \((\lambda_1+\alpha, \dots, \lambda_n+\alpha)\).
plt.spy() function can be useful for exploring the sparsity pattern of matrices.
Hamming distance of the binary connectivity matrices is the number of positions in which their values are different.

In []:

%matplotlib inline
# Import some useful modules
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr

Answers to questions:

Task 2: Inferring interactions from data (3 pts)

In this task, you will simulate observations from the Gaussian Markov random field model constructed in Task 1 and attempt to infer the network structure using those observations.

Simulate three multivariate normal observation data sets of sizes [25, 50, 100] with 0 mean using the covariance from Task 1.
Compute and visualise the correlation matrix among the 12 genes simulated. How small Hamming distance can you obtain with the different data sets using a suitable threshold?
Compute and visualise the inverse of the correlation matrix. How small Hamming distance can you now obtain with the different data sets using a suitable threshold?
Learn the structure using graphical lasso (sklearn.covariance.graph_lasso()). Try different values of the regularisation parameter. How good results in terms of the Hamming distance can you now obtain?

Hints:

See hints for Task 1 about positive definiteness of the covariance matrix.
Graphical lasso is implemented in scikit-learn. You can install the package using the command "pip3 install sklearn --user" on the command line. If this does not work, more instructions are at: http://scikit-learn.org/

Task 3: Inferring interactions from data II (2 pts)

In this task, you will simulate observations using a linear generative model: \[ \mathbf{y_i} = \mathbf{A} \mathbf{x_i} + \epsilon_i. \] Here \(\mathbf{x_i}\) are TF expression levels (assumed to be random, following a Gaussian distribution), \(\mathbf{A}\) is the connectivity matrix from TFs to target genes, \(\mathbf{y_i}\) are the target gene expression levels and \(\epsilon_i\) is Gaussian observation noise. Based on these data, you will attempt to infer the network structure.

Generate a sparse Erdős–Rényi random connectivity graph from two TFs to 10 target genes and sample the random connection weigths having this sparsity pattern.
Simulate 3 data sets following the above connectivity graph and model with 2 TFs and 10 target genes, having [25, 50, 100] samples. You can simulate \(\mathbf{x_i}\) from \(\mathcal{N}(0, 1)\) and \(\epsilon_i\) from \(\mathcal{N}(0, 0.1^2)\).
Form a data set by concatenating the TFs and targets. Compute and visualise the correlation matrix among the 12 genes. How small Hamming distance between the true network and the estimated network can you obtain?
Compute and visualise the inverse of the correlation matrix. How well can you now find the true connections in terms of the Hamming distance?

Hints:

When computing covariances, it can be useful to check the shape of the result that you are computing it the right way. You can then transpose the data (X -> X.T) if necessary.

Task 4: Inferring interactions from dynamic simulations (3 pts)

In this task, you will simulate observations using the following dynamical model: \[ \frac{d \mathbf{x}(t)}{dt} = \sigma(\mathbf{A} \mathbf{x}) - \delta \mathbf{x} + \gamma, \] where \(\mathbf{x}\) are the gene expression levels, \(\mathbf{A}\) is the connectivity matrix (connecting only TFs to TFs and TFs to targets), \(\sigma(x) = 1/(1+e^{-x})\) is a sigmoid function, \(\delta\) are the mRNA degradation rates and \(\gamma\) are constant external inputs (affecting only TFs).

Implement a function for simulating from the above system. You can use scipy.integrate.odeint() to numerically solve the differential equation given a suitable non-negative initial state.
Simulate the system with a constant non-negative external input and \(\delta = 1\) and visualise the results.
Simulate the system with number of non-negative random external inputs until it reaches steady state. What can you see?
Compute the correlations and inverse covariance for the combined steady-state measurements from [10, 20, 40] simulations. How well can you recover the true connections?

Background on Task 4

The use of sigmoid function in the dynamics as in this task is not very common. It has clear advantages over plain linear model (saturation of transcription rate to a fixed maximum, possibly easier stability guarantees, possibility to model repression), but the presented formulation still cannot realistically model combinatorial regulation, and it makes model inference much more difficult than linear differential equations.