Exercise session 3

Introduction to bioinformatics, Autumn 2009

Group 1: Thursday 1.10 12-14 Exactum BK106
Group 2: Thursday 1.10 16-18 Exactum BK106.

General instructions:

Problems for each exercise session will be distributed approximately one week before the session. You are expected to be prepared to present your solutions in the exercise session.

In addition, you need to send notes of the assignments you are going to mark to Laura Langohr by email before exercises (Thursdays 12.15).

These exercise notes should contain a brief description of the steps you took to solve the assignment, as well as the results. Important: When sending email, use subject of form "ITB exercise X, where X is the exercise session number (1/2). Send your notes in email text body. If you need to include a figure, send it as an attachment.

Assignments

  1. (a) What does Mummer do? What techniques it uses?
    (b) Reverse engineer SuDS Genome Browser by testing it with various inputs. What techniques it appears to be implementing from the course lectures?

  2. A typical day at biochemistry laboratory. You are the only bioinformatician in the lab, and your boss comes with an urgent task: "We need to replicate this piece of DNA by PCR, but we do not have primers designed for it". After some discussion, you figure out that the problem is essentially as follows: In both ends of the DNA to be replicated, you have regions where to select short fragments (10-20 bp) that have two properties: (1) their melting temperatures should be similar; and (2) the primers created for the selected fragments should not bind anywhere else in the DNA. How would you go for selecting the two fragments if you needed to do the primer design just based on the course lectures? You may assume that the full DNA sequence of the organism in question is available and you may ignore property (1) for this exercise.

  3. In sequence assembly methods like Overlap-Layout-Consensus, the first part of finding the overlaps between reads is a heavy procedure, because every pair of reads needs to be considered. This becomes impossible to do with the brute force pairwise alignments when applied to the next-generation short-read data (454, Solexa, SOLiD). Show that under the assumption that the data is free of measurement errors and SNPs, one can find all the overlaps in time O(n+#overlaps), where n is the total length of all reads and #overlaps is the number of overlaps longer than a given threshold. Can you somehow extend the method to allow measurent errors and SNPs? Hint: Exploit generalized suffix tree.

  4. Perform global alignment of the sequences

        
        s = ACGTCCCTAGT
        t = ATCACGCTTA
      
    with mismatch penalty -1, indel penalty -1 and uniform match score 1 by constructing the dynamic programming matrix. What is the optimal alignment or alignments and corresponding score?

  5. Perform local alignment of the sequences

        
      s = GTGTATGAGTGCGACTCA
      t = CCGCTACTATAGACACG
      
    with mismatch penalty -1, indel penalty -2 and uniform match score 1 by constructing the dynamic programming matrix. What is the optimal alignment or alignments and corresponding score?

  6. Reverse translation. Assume that you have the complete DNA sequence of some eukaryotic organism and the amino acid sequence of some protein coded by a gene in the organism DNA. The gene coding the protein is unknown and your task is to locate it from the DNA. Give a dynamic programming algorithm to locate a gene (a continuous region in DNA consisting of exons and introns) containing fewest number of introns, such that the concatenation of exons can be translated into the given protein sequence. What is the running time of your algorithm? Can you plug in other constrainst to the gene (overall span in DNA, maximum/minimum distance between consecutive exons, etc.?).