In this task, we learn to generate random sequences using the Bernoulli model. This is also an introduction to Jupyter and python.
The code below generates and prints a random DNA sequence of length 50 using the uniform distribution. Run the code to produce the output.
import random
dna_bases = ['A', 'C', 'G', 'T']
# generate a random sequence
def uniform_bernoulli_sequence(symbols, length):
return ''.join(random.choice(symbols) for i in range(length))
seq = uniform_bernoulli_sequence(dna_bases, 50)
print(seq)
Next we want to use a specified distribution instead of a uniform distribution. Complete the following code to accomplish this.
import random
base_distribution = {'A' : 0.345, 'C' : 0.158, 'G' : 0.059, 'T' : 0.437}
# choose a random symbol according to a given distribution
def weighted_choice(dist):
# generate a random number in [0,1)
r = random.random()
# make a choise based on r
choices = dist.keys()
# ...
# generate a random sequence
def bernoulli_sequence(symbol_distribution, length):
return ''.join(weighted_choice(symbol_distribution) for i in range(length))
seq = bernoulli_sequence(base_distribution, 50)
print(seq)
Modify the function weighted_choice()
so that instead of probabilities it uses arbitrary weights that need
not sum up to one. The probability of generating a given base should be
its weight divided by the sum of all weights.
import random
base_counts = {'A' : 30, 'C' : 11, 'G' : 5, 'T' : 44}
# choose a random symbol according to given weights
def weighted_choice(weights):
total = sum(weights.values())
# ...
# generate a random sequence
def bernoulli_sequence(symbol_weights, length):
return ''.join(weighted_choice(symbol_weights) for i in range(length))
seq = bernoulli_sequence(base_counts, 50)
print(seq)
Generate
a sequence of length 50 using the first order Markov model given below.
Note that transition matrix contains weights instead of probabilities,
but you can use each row of the matrix as an argument to weighted_choice()
in exercise 1(b).
import random
base_counts = {'A' : 21, 'C' : 12, 'G' : 7, 'T' : 9}
transition_matrix = {
'A': {'A': 7, 'C': 7, 'G': 4, 'T': 3},
'C': {'A': 6, 'C': 3, 'G': 2, 'T': 1},
'G': {'A': 4, 'C': 2, 'G': 1, 'T': 0},
'T': {'A': 5, 'C': 3, 'G': 1, 'T': 0}
}
# choose a random symbol according to given weights
def weighted_choice(weights):
# Copy here from 1(b)
# generate a random sequence
def markov1_sequence(symbol_weights, transition_matrix, length):
# ...
seq = markov1_sequence(base_counts, transition_matrix, 50)
print(seq)
In this task, we learn to compute k-mer frequencies and transition probabilities.
Compute and print the base (1-mer) counts/frequencies of the sequence given below. The result should be in a form that can be used for generating a Bernoulli sequence using the code in Task 1. Verify this by generating a short sequence.
import random
from collections import Counter # This is a hint
seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'
# Compute 1-mer counts in sequence
def symbol_counts(sequence):
# ...
base_counts = symbol_counts(seq)
print(base_counts)
# Copy here weighted_choice and bernoulli_sequence from 1(b)
# print a random sequence
print(bernoulli_sequence(base_counts, 50))
Generate and print the 2-mer counts/frequencies in the sequence given below.
from collections import Counter
seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'
# Compute 2-mer frequencies
def twomer_counts(sequence):
# ...
counts = twomer_counts(seq)
print(counts)
Compute and print the 1st order Markov model transition probabilities/weights from the 2-mer counts of part (b). The result should be in a form that can be used for generating a random sequence using the code in Task 2. Verify this by generating a short sequence.
import random
from collections import Counter
seq = 'TGATCTTAAGTCACTTGCTTTGAAATGTTTGACGTATCATGATCTGATTA'
# Compute 1-mer frequencies
def symbol_counts(sequence):
# Copy from 3(a)
base_counts = symbol_counts(seq)
# Compute 2-mer counts
def twomer_counts(sequence):
# Copy from 3(b)
counts = twomer_counts(seq)
def transition_matrix(base_counts, twomer_counts):
# ...
print(transition_matrix(base_counts, counts))
def kmer_counts(k, sequence):
# ...
The code below produces 100 random integers in range [0,9] and draws a histogram of them.
from random import randint
import matplotlib.pyplot as plt
# Make plots appear inline
%matplotlib inline
# Generate the random numbers
numbers = [randint(1,9) for i in range(0,100)]
print(numbers)
# Plot the histogram
plt.hist(numbers, bins=max(numbers))
Write code that plots the histogram of the k-mer count values computed in part (a). This is the k-mer spectrum. Test it with a small example.
# Your code here
The following code reads a 128kb segment of Human genome. Plot its 5-mer spectrum.
import urllib.request
url="https://www.cs.helsinki.fi/u/tpkarkka/teach/16-17/MAiB/dna.128k"
# Open the file. This will work exactly like a regular file.
file = urllib.request.urlopen(url)
seq = file.read()
# plot 5-mer spectrum of seq
# ...