Exercise 1: k-mer statistics (DL: 15.9.)

Lecturer: Juha Kärkkäinen

Your name:

Name of your pair:

Task 1: Generating Bernoulli sequences (2 points)

In this task, we learn to generate random sequences using the Bernoulli model. This is also an introduction to Jupyter and python.

The code below generates and prints a random DNA sequence of length 50 using the uniform distribution. Run the code to produce the output.

In [ ]:
import random

dna_bases = ['A', 'C', 'G', 'T']

# generate a random sequence
def uniform_bernoulli_sequence(symbols, length):
    return ''.join(random.choice(symbols) for i in range(length))

seq = uniform_bernoulli_sequence(dna_bases, 50)
print(seq)

a) Arbitrary distribution

Next we want to use a specified distribution instead of a uniform distribution. Complete the following code to accomplish this.

In [ ]:
import random

base_distribution = {'A' : 0.345, 'C' : 0.158, 'G' : 0.059, 'T' : 0.437}

# choose a random symbol according to a given distribution 
def weighted_choice(dist):
    # generate a random number in [0,1)
    r = random.random()
    # make a choise based on r
    choices = dist.keys()
    # ...
    

# generate a random sequence
def bernoulli_sequence(symbol_distribution, length):
    return ''.join(weighted_choice(symbol_distribution) for i in range(length))
    
seq = bernoulli_sequence(base_distribution, 50)
print(seq)

b) Arbitrary weights

Modify the function weighted_choice() so that instead of probabilities it uses arbitrary weights that need not sum up to one. The probability of generating a given base should be its weight divided by the sum of all weights.

In [ ]:
import random

base_counts = {'A' : 30, 'C' : 11, 'G' : 5, 'T' : 44}

# choose a random symbol according to given weights
def weighted_choice(weights):
    total = sum(weights.values())
    # ...


# generate a random sequence
def bernoulli_sequence(symbol_weights, length):
    return ''.join(weighted_choice(symbol_weights) for i in range(length))

seq = bernoulli_sequence(base_counts, 50)
print(seq)

Task 2: First order Markov model (2 points)

Generate a sequence of length 50 using the first order Markov model given below. Note that transition matrix contains weights instead of probabilities, but you can use each row of the matrix as an argument to weighted_choice() in exercise 1(b).

In [ ]:
import random

base_counts = {'A' : 21, 'C' : 12, 'G' : 7, 'T' : 9}

transition_matrix = {
    'A': {'A': 7, 'C': 7, 'G': 4, 'T': 3}, 
    'C': {'A': 6, 'C': 3, 'G': 2, 'T': 1}, 
    'G': {'A': 4, 'C': 2, 'G': 1, 'T': 0}, 
    'T': {'A': 5, 'C': 3, 'G': 1, 'T': 0}
}

# choose a random symbol according to given weights
def weighted_choice(weights):
    # Copy here from 1(b)

# generate a random sequence
def markov1_sequence(symbol_weights, transition_matrix, length):
    # ...


seq = markov1_sequence(base_counts, transition_matrix, 50)
print(seq)

Task 3: Computing k-mer frequencies and transition probabilities (3 points)

In this task, we learn to compute k-mer frequencies and transition probabilities.

a) 1-mer frequencies

Compute and print the base (1-mer) counts/frequencies of the sequence given below. The result should be in a form that can be used for generating a Bernoulli sequence using the code in Task 1. Verify this by generating a short sequence.

In [ ]:
import random
from collections import Counter  # This is a hint

seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'

# Compute 1-mer counts in sequence
def symbol_counts(sequence):
    # ...
    

base_counts = symbol_counts(seq)
print(base_counts)

# Copy here weighted_choice and bernoulli_sequence from 1(b)

# print a random sequence
print(bernoulli_sequence(base_counts, 50))

b) 2-mer frequencies

Generate and print the 2-mer counts/frequencies in the sequence given below.

In [ ]:
from collections import Counter

seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'

# Compute 2-mer frequencies
def twomer_counts(sequence):
    # ...
    

counts = twomer_counts(seq)
print(counts)
    

c) Computing first order Markov models

Compute and print the 1st order Markov model transition probabilities/weights from the 2-mer counts of part (b). The result should be in a form that can be used for generating a random sequence using the code in Task 2. Verify this by generating a short sequence.

In [ ]:
import random
from collections import Counter

seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'

# Compute 1-mer frequencies
def symbol_counts(sequence):
    # Copy from 3(a)

base_counts = symbol_counts(seq)

# Compute 2-mer counts
def twomer_counts(sequence):
    # Copy from 3(b)

counts = twomer_counts(seq)

def transition_matrix(base_counts, twomer_counts):
    # ...

print(transition_matrix(base_counts, counts))

Task 4: k-mer spectra (3 points)

In this task, we learn to plot k-mer spectra.

a) k-mer counts/frequencies

Write a function that, given k and a sequence, computes the k-mer counts/frequencies in the sequence. Test the function with some example sequence.

In [ ]:
def kmer_counts(k, sequence):
    # ...

b) k-mer spectra

The code below produces 100 random integers in range [0,9] and draws a histogram of them.

In [ ]:
from random import randint
import matplotlib.pyplot as plt

# Make plots appear inline
%matplotlib inline

# Generate the random numbers
numbers = [randint(1,9) for i in range(0,100)]
print(numbers)

# Plot the histogram
plt.hist(numbers, bins=max(numbers))

Write code that plots the histogram of the k-mer count values computed in part (a). This is the k-mer spectrum. Test it with a small example.

In [ ]:
# Your code here

c) k-mer spectrum of Human genome

The following code reads a 128kb segment of Human genome. Plot its 5-mer spectrum.

In [ ]:
import urllib.request

url="https://www.cs.helsinki.fi/u/tpkarkka/teach/16-17/MAiB/dna.128k"
# Open the file. This will work exactly like a regular file.
file = urllib.request.urlopen(url)
seq = file.read()

# plot 5-mer spectrum of seq
# ...