Exercise 1: Global network models (DL 29.9.)

Lecturer: Leena Salmela

Your name:

Name of your pair:

Task 1: An ER graph and its degree distribution (2 points)

In this exercise we will create an ER graph and plot its degree distribution. We will use the networkx library to represent graphs. The code below creates two graphs using networkx.

In [ ]:
import networkx as nx
import matplotlib.pyplot as plt

# Define a graph by creating an empty graph and adding vertices and edges one by one
G1 = nx.DiGraph()
G1.add_node('A')
G1.add_node('B')
G1.add_edge('A','B')
G1.add_edge('A','A')

# Define another graph by giving a list of edges
G2 = nx.DiGraph([('C','D'), ('D','E'), ('C','E')])

# Print the edge lists
print(G1.edges())
print(G2.edges())

a) Creating an ER graph

Write a function that creates an ER graph with the given number of vertices and probability for an edge to exist. Remember to add also self loops with the same probability. Create a graph with 1000 vertices labeled {1,...,1000} where each directed pair of vertices is connected by an edge with probability 0.05. Create the graph one edge at a time. Do not use the library functions for creating random graphs available in the networkx library.

In [ ]:
# Create an ER graph with n vertices and probability p for an edge to exist between any two vertices
def er_graph(n,p):
    G = nx.DiGraph()
    # Add your code here
    
    return G

# Create the graph
G = er_graph(1000,0.05)
print(G.edges())

b) Degree histogram

Plot the degree histogram of the random graph you created in the previous step. Check the degree function provided by networkx.

In [ ]:
# Write your code here

Task 2: Degree distribution of E. coli regulatory network (4 points)

In this exercise we will read the E. coli regulatory network and plot its degree distribution.

a) Loading the network

The E. coli regulatory network can be downloaded from RegulonDB:

http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt

The header of the file contains general information on several commented lines starting with #. This is followed by the description of the network where each line describes one edge of the network: the first column contains the transcription factor (TF) that regulates the gene in the second column thus describing an edge pointing from the TF to the gene. Note that the TF names start with upper case letters and the gene names with lower case letters. For example accB is the gene coding for the TF AccB. Thus when constructing the regulatory network treat accB and AccB as the same vertex. For this exercise you can ignore the rest of the columns. Note that the same edge can be present several times in the file.

Write code that reads the E. coli regulatory network into a graph data structure similar to that used in the first exercise. You can use the gene names as vertex labels. You might find the python string method split helpful in parsing the file.

In [ ]:
import urllib.request  # library handling URLs

ecoli_network = nx.DiGraph()

# E. coli network file
networkfile="http://regulondb.ccg.unam.mx/menu/download/datasets/files/network_tf_gene.txt"
# Open the file. This will work exactly like a regular file.
data = urllib.request.urlopen(networkfile)
for line in data:
    line = line.decode('utf-8')  # decode the byte array into a string
    # Add code to load the data here

b) Degree histogram

Plot a histogram of the degree distribution of the E. coli regulatory network.

In [ ]:
# Write your code here

c) Degree distribution as a scatter plot

Plot the degree distribution also as a scatter plot where x axis is the degree and y axis gives the number of vertices having the given degree. Use logarithmic scale on both axis. Try to fit a power law distribution to the data by plotting a curve $y=a*x^{-k}$ with suitable values of $a$ and $k$. The code below shows an example of drawing a scatter plot (of random data) and manipulating the axis of a plot.

In [ ]:
from random import randint
import matplotlib.pyplot as plt

# Generate random data
x = [2**i for i in range(0,10)]
y = [2**randint(0,9) for i in range(0,10)]
# Get the current axis
axis = plt.gca()
# Set log scale on x axis
axis.set_xscale('log')
# Set log scale on y axis
axis.set_yscale('log')
# Plot the data ('o' makes it a scatter plot)
axis.plot(x,y,'o')
In [ ]:
# Write your code here

d) Conclusions

Does the degree distribution of the E. coli regulatory network follow a power law? Why? / Why not?

Write your answer to the questions here.

Task 3: Statistical testing of a network property (4 points)

In this exercise we will investigate the clustering coefficient of the E. coli regulatory network and compare it to the clustering coefficient of ER graphs with the same number of vertices and with a similar probability for an edge as in the E. coli regulatory network.

a) Clustering coefficient of a directed graph

Define a function for computing the clustering coefficient of a directed graph. For a directed graph allowing self loops the clustering coefficient of a vertex $v$ is defined as $$ \frac{E_v}{k_v\cdot k_v} $$ where $E_v$ is the number of edges between the neighbors of $v$ and $k_v$ is the degree of $v$. Note that both in and out neighbors of the vertex are taken into account. The global clustering coefficient is then the average of the clustering coefficients of all vertices. Set the clustering coefficient of vertices of degree 0 to 0 in your analysis. What is the clustering coefficient of the E. coli regulatory network? You might find the predecessors and successors functions of networkx helpful.

In [ ]:
def clustering_coefficient(graph):
    # Add code to compute the clustering coefficient
    
    # Return the clustering coefficient
    return 0.0

print(clustering_coefficient(ecoli_network))

b) ER graph

What is the number of vertices and edges of the E. coli regulatory network? For an ER graph with the same number of vertices and edges, what would be the probability of having an edge between two vertices?

In [ ]:
# Write your code here

c) Clustering coefficients of ER graphs

Use the function from Task 1a to create an ensemble of 100 ER graphs with the same amount of vertices as the E. coli regulatory network using the probability computed above for the existence of an edge. Compute the clustering coefficient of each network and plot these as a histogram.

In [ ]:
# Create an ensemble of ER graphs and compute their clustering coefficients

# Plot the histogram of the clustering coefficients

d) Conclusions

As compared to the ER graphs, is the E. coli regulatory network highly clustered? Why? / Why not?

Write your answer to the questions here.