This HTML version of the exercises is only intended to give you a quick view of what the exercises are about. Use the .ipynb file and the IPython Notebook environment to solve the exercises!

Exercise 1: Global network models (DL 5.11.)

Your name:

Name of your pair:

1. An ER graph and its degree distribution (2 points)

In this exercise we will eventually create a random ER graph and plot its degree distribution. However, we will start by introducing some basic python concepts and the iPython notebook environment to get you started.

The code below shows how to generate a random number in python. Run the code. Modify the code to print 10 random integers.

Your solution:

In []:

from random import random

print(random())

Graphs can be represented in python using dictionaries and lists. Below is an example of python code representing the following directed graph:

A \(\rightarrow\) B

A \(\rightarrow\) C

B \(\rightarrow\) E

C \(\rightarrow\) B

C \(\rightarrow\) D

D \(\rightarrow\) A

In []:

graph = {'A' : ['B', 'C'],
         'B' : ['E'],
         'C' : ['B', 'D'],
         'D' : ['A'],
         'E' : []}
print(graph)

Create a graph with 1000 vertices labeled {1,...,1000} where each directed pair of vertices is connected by an edge with probability 0.05. Use the random function from above and the graph data structure described above.

Your solution:

The code below produces 100 random integers in range [0,9] and draws a histogram of them.

In []:

from random import randint
import matplotlib.pyplot as plt

# Make plots appear inline
%matplotlib inline

# Generate the random numbers
numbers = [randint(0,9) for i in range(0,100)]
# Plot the histogram
plt.hist(numbers, bins=10)

Plot the degree histogram of the random graph you created in the previous step.

Your solution:

2. Degree distribution of E. coli regulatory network (3 points)

In this exercise we will read the E. coli regulatory network and plot its degree distribution.

The E. coli regulatory network can be downloaded from RegulonDB:

http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt

The header of the file contains general information on several commented lines starting with #. This is followed by the description of the network where each lines describes one edge of the network: the first column contains the transcription factor (TF) that regulates the gene in the second column thus describing an edge pointing from the TF to the gene. Note that the TF names start with upper case letters and the gene names with lower case letters. For example accB is the gene coding for the TF AccB. Thus when constructing the regulatory network treat accB and AccB as the same vertex. For this exercise you can ignore the rest of the columns.

Write code that reads the E. coli regulatory network into a graph data structure similar to that used in the first exercise. You can use the gene names as vertex labels. You might find the python string method split helpful in parsing the file.

Your solution:

Plot a histogram of the degree distribution of the E. coli regulatory network.

Your solution:

In []:

import urllib.request  # library handling URLs

ecoli_network = {}

# E. coli network file
networkfile="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt"
# Open the file. This will work exactly like a regular file.
data = urllib.request.urlopen(networkfile)
for line in data:
    line = line.decode('utf-8')  # decode the byte array into a string
    # Add code to load the data here

Plot the degree distribution also as a scatter plot where x axis is the degree and y axis gives the number of vertices having the given degree. Use logarithmic scale on both axis. The code below shows an example of drawing a scatter plot (of random data) and manipulating the axis of a plot.

Does the degree distribution of the E. coli regulatory network follow a power law? Why? / Why not? Try to fit a power law distribution to the data by plotting a curve \(y=a*x^{-k}\) with suitable values of \(a\) and \(k\).

In []:

from random import randint
import matplotlib.pyplot as plt

# Make plots appear inline
%matplotlib inline

# Generate random data
x = [2**i for i in range(0,10)]
y = [2**randint(0,9) for i in range(0,10)]
# Get the current axis
axis = plt.gca()
# Set log scale on x axis
axis.set_xscale('log')
# Set log scale on y axis
axis.set_yscale('log')
# Plot the data ('o' makes it a scatter plot)
axis.plot(x,y,'o')

Write your answer to the questions here.

3. Statistical testing of a network property I (3 points)

In this exercise we will investigate the clustering coefficient of the E. coli regulatory network and compare it to the clustering coefficient of ER graphs with the same number of vertices and with a similar probability for an edge as in the E. coli regulatory network.

Define a function for computing the clustering coefficient of a graph. For a directed graph allowing self loops the clustering coefficient of a vertex \(v\) is defined as \[ \frac{E_v}{k_v\cdot k_v} \] where \(E_v\) is the number of edges between the outneighbors of \(v\) and \(k_v\) is the outdegree of \(v\). The global clustering coefficient is then the average of the clustering coefficients of all vertices. Set the clustering coefficient of vertices of degree 0 to 0 in your analysis. What is the clustering coefficient of the E. coli regulatory network?

Your solution:

In []:

def clustering_coefficient(graph):
    # Compute the clustering coefficient
    
    # Return the clustering coefficient
    return 0.0

print(clustering_coefficient(ecoli_network))

What is the number of vertices and edges of the E. coli regulatory network? For an ER graph with the same number of vertices and edges, what would be the probability of having an edge between two vertices?

Your solution:

Write a function that creates an ER graph with the given number of vertices and probability for an edge to exist. Create an ensemble of 100 ER graphs with the same amount of vertices as the E. coli regulatory network using the probability computed above for the existence of an edge. Compute the clustering coefficient of each network and plot these as a histogram. As compared to the ER graphs, is the E. coli regulatory network highly clustered? Why? / Why not?

Your solution:

In []:

def ergraph(n, p):
    # Create an ER graph with n vertices and probability p for an edge to exist
    graph = {}
    
    return graph

# Create an ensemble of ER graphs and compute their clustering coefficients

# Plot the histogram of the clustering coefficients

Write your answer to the questions here.

4. Statistical testing of a network property II (2 points)

ER graphs have a very different degree distribution as compared to the E. coli regulatory network. Therefore the result of the previous exercise might just be a consequence of having a particular degree distribution. In this exercise we will investigate if the clustering coefficient of the E. coli regulatory network is exceptional for graphs with the same degree distribution.

First we will create random networks with the same degree distribution as the E. coli regulatory network. One way to do this is to start with the E. coli regulatory network and repeatedly choose two edges \(a\rightarrow b\) and \(c\rightarrow d\) and switch them by removing the original edges and inserting the edges \(a\rightarrow d\) and \(c\rightarrow b\). Note that the degrees of the vertices stay the same. Your first task is to write a function that takes as input a graph, creates a copy of it and performs \(m\) random switches on it where \(m\) is the number of edges in the graph. Create one randomly switched graph starting with the E. coli regulatory network using your new function and plot its degree distribution to make sure it looks the same as for the original graph.

Your solution:

In []:

def switchgraph(graph):
    # Create a copy of the original graph
    switchedgraph = {}
 
    # Perform m random switches

    # Return the graph
    return switchedgraph

# Create a random graph and plot its degree distribution

Create 100 random graphs with the function you wrote and compute the clustering coefficient of each graph. Plot a histogram of the clustering coefficients. Is the clustering coefficient of the E. coli regulatory network exceptional as compared to this ensemble of graphs? Why? / Why not?

Your solution: