Exercise 1: Randomized Motif Finding (DL: 22.9. at 23:55)¶

Lecturer: Juha Kärkkäinen

Your name:

Name of your pair:

Task 1: Generate Planted Motif Problem Instance (2 points)¶

In this task, we learn to generate an instance of the Planted Motif Problem.

In Python, strings are immutable and you cannot change individual symbols, which makes random mutations tricky. A mutable alternative is bytearray, which is demonstrated in the following code.

# transform string to bytearray using encode
s = 'AAAAAA'
seq = bytearray(s.encode())
print(seq)

# for literal strings a more direct way is:
seq = bytearray(b'AAAAAA')
print(seq)

# transform bytearray to string using decode
print(seq.decode())

# transform a single symbol using ord
seq[1] = ord('G')
print(seq.decode())
# seq[1] = b'G' does not work because ord('G') is a single integer
# but b'G' or 'G'.encode() is a sequence of integers (of length 1)

# bytearrays support assigning to slices too
seq[2:4] = b'TT'
print(seq.decode())

seq[5:6] = b'C'
print(seq.decode())

seq[4:5] = seq[1:4]
print(seq.decode())

Your task is to write a function that generates an instance of the Planted Motif Problem. The problem and its parameters are described in the lecture notes. One difference to lecture notes is that the unmutated motif is given as an argument, which helps in checking if the correct motif was found. The output should be a list of strings. Some helpful functions are already provided in the code.

import random

dna_bases = 'ACGT'
# dna_bases.encode() for byte encoding
# dna_bases.lower() for lower case bases
# Mixing upper and lower case symbols may help in debugging

def uniform_bernoulli_sequence(symbols, length):
    return ''.join(random.choice(symbols) for i in range(length))

def mutate(symbol, symbols):
    new_symbol = random.choice(symbols)
    while (new_symbol == symbol):
        new_symbol = random.choice(symbols)
    return new_symbol

def mutate_random_positions(seq, d, symbols):
    pos = random.sample(range(len(seq)), d)
    for p in pos:
        seq[p] = mutate(seq[p], symbols)

def planted_motif_instance(t, n, motif, d):
    l = len(motif)
    instance = []
    # ...
    # your code here
    # ...
    return instance
    
print(planted_motif_instance(5, 20, 'AAAAAAAAA', 2))

Task 2: Gibbs Sampler (3 points)¶

In this task, we implement the Gibbs Sampler algorithm described in the lectures.

a) Motif consensus (and profile)¶

First, complete the following function that computes the consensus of a motif, which is given as a list of l-mers.

from collections import Counter

def consensus(motif):
    t = len(motif)
    l = len(motif[0])
    consensus_list = []
    for i in range(l):
        # ...
        # your code here
        # ...
        consensus_list.append(consensus_symbol)
    return ''.join(consensus_list)

motif = ['ACGTT', 'AGGTA', 'ACCAT']
print(consensus(motif))

We also need to compute a profile from a motif. The code is given here in full.

from math import log

# generate a profile matrix from a motif given as a list of l-mers
# the profile entries are logarithms of pseudocounts
def profile_matrix(motif):
    t = len(motif)
    l = len(motif[0])
    profile = []
    for col in range(l):
        counts = Counter({'A':1, 'C':1, 'G':1, 'T':1})
        for i in range(t):
            counts[motif[i][col]] += 1
        logfreqs = {}
        for s in counts.keys():
            logfreqs[s] = log(counts[s])
        profile.append(logfreqs)
    return profile

print(motif)
profile = profile_matrix(motif)
print(profile)

b) l-mer weights¶

During its operation Gibbs Sampler needs to compute l-mer likelihoods or weights using a given profile. Below is a function for that. Also provided as a first application is a function for computing a motif score.

# compute the logweight of an l-mer according to a profile
# the output is the sum of the logarithms of the pseudoweights, which can be used for comparison
def lmer_logweight(lmer, profile):
    l = len(lmer)
    logweight = 0.0
    for i in range(l):
        logweight += profile[i][lmer[i]]
    return logweight

print(lmer_logweight('ACGTT', profile))

def motif_score(motif):
    profile = profile_matrix(motif)
    score = 0.0
    for lmer in motif:
        score += lmer_logweight(lmer, profile)
    return score

print(motif_score(motif))

Another place, where Gibbs sampler needs the l-mer weights, is in choosing a random l-mer from a sequence according to the l-mer weights. Complete the function below that does this. Here we need the actual weights instead of their logarithms. You can use the function weighted_choice() from Exercise 1(b) to make the choice once you have computed the weights.

from math import exp

# the actual weight for random selection is obtained as exp(logweight)
logweight = lmer_logweight('ACGTT', profile)
weight = exp(logweight)
print(weight)

def weighted_choice(weights):
    # copy from exercise 1(b)

    
# returns a randomly chosen lmer from seq
def lmer_choice(seq, profile):
    n = len(seq)
    l = len(profile)
    # ...
    # your code here
    # ...

seq = 'ACGTTGCACGTA'
lmer = lmer_choice(seq, profile)
print(lmer)

c) Gibbs Sampler¶

Complete the following code to implement Gibbs Sampler. Note that the initial motif is chosen according to a given profile. Here the uniform profile is used but later we can use a profile obtained by random projection.

def uniform_profile(l):
    return [{'A':1, 'C':1, 'G':1, 'T':1}] * l

def gibbs_sampler(seqs, l, initial_profile, no_of_rounds):
    t = len(seqs)
    n = len(seqs[0])
    # generate a random motif from the initial profile
    motif = []
    for i in range(t):
        motif.append(lmer_choice(seqs[i], initial_profile))
    best_motif = motif
    best_score = motif_score(motif)
    time_from_last_improvement = 0
    # stop when there is no improvement for long enough
    while time_from_last_improvement < no_of_rounds:
        time_from_last_improvement += 1
        # ...
        # Your code here
        # ...
        if score > best_score:
            best_motif = motif
            best_score = score
            time_from_last_improvement = 0
    return best_motif

sequences = planted_motif_instance(10, 20, 'ACGTTGAG', 1)
print(sequences)
initial_profile = uniform_profile(8)
motif = gibbs_sampler(sequences, 8, initial_profile, 50)
print(motif)
print(consensus(motif))

Task 3: Random Projection (3 points)¶

In this task, we implement the random projection algorithm.

The code below implements random projection in its simplest form: It distributes the input l-mers into buckets based on a random projection and returns the contents of the largest bucket as the motif.

from collections import defaultdict

def projection(lmer, pos):
    l = len(lmer)
    k = len(pos)
    return ''.join(lmer[pos[i]] for i in range(k))

print(projection('ACGTTGAG', [1, 3, 7]))

def simple_random_projection(seqs, l, k):
    t = len(seqs)
    n = len(seqs[0])
    pos = random.sample(range(l), k)
    buckets = defaultdict(list)
    for i in range(t):
        for j in range(n-l+1):
            lmer = seqs[i][j:j+l]
            kmer = projection(lmer, pos)
            buckets[kmer].append(lmer)
    max_size = 0
    for kmer in buckets.keys():
        if len(buckets[kmer]) > max_size:
            motif = buckets[kmer]
            max_size = len(buckets[kmer])
    return motif

print(motif)
print(consensus(motif))

Write the full random projection algorithm including the following features (1 point for each feature):

Use Gibbs Sampler to refine the motif.
If there are multiple buckets with size over a given threshold, try refining them all.
Repeat until no improvement is seen for a given number of rounds.

def random_projection(seqs, l, k, no_of_rounds)
    # ...

Task 4: Test the algorithms (2 points)¶

Try to solve the Planted (15,4)-Motif Problem using the algorithms implemented above. Try different parameter settings to find one that works well. If successful, try the more challenging (14,4)-motif problem. Can you even solve the (13,4)-motif problem?

sequences = planted_motif_instance(20, 600, 'ACGTTGAGACCATGG', 4)

# ...

sequences = planted_motif_instance(20, 600, 'ACGTTGAGACCATG', 4)