Lecturer: Juha Kärkkäinen
Your name:
Name of your pair:
In this task, we learn to generate an instance of the Planted Motif Problem.
In Python, strings are immutable and you cannot change individual
symbols, which makes random mutations tricky. A mutable alternative is bytearray
, which is demonstrated in the following code.
# transform string to bytearray using encode
s = 'AAAAAA'
seq = bytearray(s.encode())
print(seq)
# for literal strings a more direct way is:
seq = bytearray(b'AAAAAA')
print(seq)
# transform bytearray to string using decode
print(seq.decode())
# transform a single symbol using ord
seq[1] = ord('G')
print(seq.decode())
# seq[1] = b'G' does not work because ord('G') is a single integer
# but b'G' or 'G'.encode() is a sequence of integers (of length 1)
# bytearrays support assigning to slices too
seq[2:4] = b'TT'
print(seq.decode())
seq[5:6] = b'C'
print(seq.decode())
seq[4:5] = seq[1:4]
print(seq.decode())
Your task is to write a function that generates an instance of the Planted Motif Problem. The problem and its parameters are described in the lecture notes. One difference to lecture notes is that the unmutated motif is given as an argument, which helps in checking if the correct motif was found. The output should be a list of strings. Some helpful functions are already provided in the code.
import random
dna_bases = 'ACGT'
# dna_bases.encode() for byte encoding
# dna_bases.lower() for lower case bases
# Mixing upper and lower case symbols may help in debugging
def uniform_bernoulli_sequence(symbols, length):
return ''.join(random.choice(symbols) for i in range(length))
def mutate(symbol, symbols):
new_symbol = random.choice(symbols)
while (new_symbol == symbol):
new_symbol = random.choice(symbols)
return new_symbol
def mutate_random_positions(seq, d, symbols):
pos = random.sample(range(len(seq)), d)
for p in pos:
seq[p] = mutate(seq[p], symbols)
def planted_motif_instance(t, n, motif, d):
l = len(motif)
instance = []
# ...
# your code here
# ...
return instance
print(planted_motif_instance(5, 20, 'AAAAAAAAA', 2))
In this task, we implement the Gibbs Sampler algorithm described in the lectures.
First, complete the following function that computes the consensus of a motif, which is given as a list of l-mers.
from collections import Counter
def consensus(motif):
t = len(motif)
l = len(motif[0])
consensus_list = []
for i in range(l):
# ...
# your code here
# ...
consensus_list.append(consensus_symbol)
return ''.join(consensus_list)
motif = ['ACGTT', 'AGGTA', 'ACCAT']
print(consensus(motif))
We also need to compute a profile from a motif. The code is given here in full.
from math import log
# generate a profile matrix from a motif given as a list of l-mers
# the profile entries are logarithms of pseudocounts
def profile_matrix(motif):
t = len(motif)
l = len(motif[0])
profile = []
for col in range(l):
counts = Counter({'A':1, 'C':1, 'G':1, 'T':1})
for i in range(t):
counts[motif[i][col]] += 1
logfreqs = {}
for s in counts.keys():
logfreqs[s] = log(counts[s])
profile.append(logfreqs)
return profile
print(motif)
profile = profile_matrix(motif)
print(profile)
During its operation Gibbs Sampler needs to compute l-mer likelihoods or weights using a given profile. Below is a function for that. Also provided as a first application is a function for computing a motif score.
# compute the logweight of an l-mer according to a profile
# the output is the sum of the logarithms of the pseudoweights, which can be used for comparison
def lmer_logweight(lmer, profile):
l = len(lmer)
logweight = 0.0
for i in range(l):
logweight += profile[i][lmer[i]]
return logweight
print(lmer_logweight('ACGTT', profile))
def motif_score(motif):
profile = profile_matrix(motif)
score = 0.0
for lmer in motif:
score += lmer_logweight(lmer, profile)
return score
print(motif_score(motif))
Another place, where Gibbs sampler needs the l-mer weights, is in
choosing a random l-mer from a sequence according to the l-mer weights.
Complete the function below that does this. Here we need the actual
weights instead of their logarithms. You can use the function weighted_choice()
from Exercise 1(b) to make the choice once you have computed the weights.
from math import exp
# the actual weight for random selection is obtained as exp(logweight)
logweight = lmer_logweight('ACGTT', profile)
weight = exp(logweight)
print(weight)
def weighted_choice(weights):
# copy from exercise 1(b)
# returns a randomly chosen lmer from seq
def lmer_choice(seq, profile):
n = len(seq)
l = len(profile)
# ...
# your code here
# ...
seq = 'ACGTTGCACGTA'
lmer = lmer_choice(seq, profile)
print(lmer)
Complete the following code to implement Gibbs Sampler. Note that the initial motif is chosen according to a given profile. Here the uniform profile is used but later we can use a profile obtained by random projection.
def uniform_profile(l):
return [{'A':1, 'C':1, 'G':1, 'T':1}] * l
def gibbs_sampler(seqs, l, initial_profile, no_of_rounds):
t = len(seqs)
n = len(seqs[0])
# generate a random motif from the initial profile
motif = []
for i in range(t):
motif.append(lmer_choice(seqs[i], initial_profile))
best_motif = motif
best_score = motif_score(motif)
time_from_last_improvement = 0
# stop when there is no improvement for long enough
while time_from_last_improvement < no_of_rounds:
time_from_last_improvement += 1
# ...
# Your code here
# ...
if score > best_score:
best_motif = motif
best_score = score
time_from_last_improvement = 0
return best_motif
sequences = planted_motif_instance(10, 20, 'ACGTTGAG', 1)
print(sequences)
initial_profile = uniform_profile(8)
motif = gibbs_sampler(sequences, 8, initial_profile, 50)
print(motif)
print(consensus(motif))
In this task, we implement the random projection algorithm.
The code below implements random projection in its simplest form: It distributes the input l-mers into buckets based on a random projection and returns the contents of the largest bucket as the motif.
from collections import defaultdict
def projection(lmer, pos):
l = len(lmer)
k = len(pos)
return ''.join(lmer[pos[i]] for i in range(k))
print(projection('ACGTTGAG', [1, 3, 7]))
def simple_random_projection(seqs, l, k):
t = len(seqs)
n = len(seqs[0])
pos = random.sample(range(l), k)
buckets = defaultdict(list)
for i in range(t):
for j in range(n-l+1):
lmer = seqs[i][j:j+l]
kmer = projection(lmer, pos)
buckets[kmer].append(lmer)
max_size = 0
for kmer in buckets.keys():
if len(buckets[kmer]) > max_size:
motif = buckets[kmer]
max_size = len(buckets[kmer])
return motif
print(motif)
print(consensus(motif))
Write the full random projection algorithm including the following features (1 point for each feature):
def random_projection(seqs, l, k, no_of_rounds)
# ...
Try to solve the Planted (15,4)-Motif Problem using the algorithms implemented above. Try different parameter settings to find one that works well. If successful, try the more challenging (14,4)-motif problem. Can you even solve the (13,4)-motif problem?
sequences = planted_motif_instance(20, 600, 'ACGTTGAGACCATGG', 4)
# ...
sequences = planted_motif_instance(20, 600, 'ACGTTGAGACCATG', 4)