1. (a) What does Mummer do? What techniques it uses? * MUMmer aligns whole genome sequences. * It uses suffix trees. * see http://mummer.sourceforge.net/MUMmer.pdf 1. (b) Reverse engineer SuDS Genome Browser by testing it with various inputs. What techniques it appears to be implementing from the course lectures? * SuDS (Succinct Data Structures) is a visual interface to the suffix tree of Human Genome running on a server machine. It demonstrates the capabilities of suffix tree concept for analyzing and searching genome sequences. * see http://www.cs.helsinki.fi/group/suds/ ________________________________________________________________________________ 2. A typical day at biochemistry laboratory. You are the only bioinformatician in the lab, and your boss comes with an urgent task: "We need to replicate this piece of DNA by PCR, but we do not have primers designed for it". After some discussion, you figure out that the problem is essentially as follows: In both ends of the DNA to be replicated, you have regions where to select short fragments (10-20 bp) that have two properties: (1) their melting temperatures should be similar; and (2) the primers created for the selected fragments should not bind anywhere else in the DNA. How would you go for selecting the two fragments if you needed to do the primer design just based on the course lectures? You may assume that the full DNA sequence of the organism in question is available and you may ignore property (1) for this exercise. * Find in both ends of the given DNA a primer region, i.e., short fragments (10-20bp) which do not exist anywhere else in the given DNA. * for each potential primer: check with local alignment that there is no such sequence anywhere else in the sequence; problem: running time is high * build suffix tree and backtrack it for each potential primer to check that there is no such sequence anywhere else in the sequence ________________________________________________________________________________ 3. In sequence assembly methods like Overlap-Layout-Consensus, the first part of finding the overlaps between reads is a heavy procedure, because every pair of reads needs to be considered. This becomes impossible to do with the brute force pairwise alignments when applied to the next-generation short-read data (454, Solexa, SOLiD). Show that under the assumption that the data is free of measurement errors and SNPs, one can find all the overlaps in time O(n+#overlaps), where n is the total length of all reads and #overlaps is the number of overlaps longer than a given threshold. Can you somehow extend the method to allow measurent errors and SNPs? Hint: Exploit generalized suffix tree. * built generalized suffix tree of all reads: this tree has n leaves and can be constructed in O(n) time * for each read: - search it in suffix tree - go back towards the root as long as threshold is reached - for each node on the way back: follow other paths starting with symbol # and save overlaps * For details, see the book Gusfield: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, page 137 ________________________________________________________________________________ 4. Perform global alignment of the sequences s = ACGTCCCTAGT t = ATCACGCTTA with mismatch penalty -1, indel penalty -1 and uniform match score 1 by constructing the dynamic programming matrix. What is the optimal alignment or alignments and corresponding score? see ex3_4.pdf ________________________________________________________________________________ 5. Perform local alignment of the sequences s = GTGTATGAGTGCGACTCA t = CCGCTACTATAGACACG with mismatch penalty -1, indel penalty -2 and uniform match score 1 by constructing the dynamic programming matrix. What is the optimal alignment or alignments and corresponding score? see ex3_5.pdf ________________________________________________________________________________ 6. Reverse translation. Assume that you have the complete DNA sequence of some eukaryotic organism and the amino acid sequence of some protein coded by a gene in the organism DNA. The gene coding the protein is unknown and your task is to locate it from the DNA. Give a dynamic programming algorithm to locate a gene (a continuous region in DNA consisting of exons and introns) containing fewest number of introns, such that the concatenation of exons can be translated into the given protein sequence. What is the running time of your algorithm? Can you plug in other constrainst to the gene (overall span in DNA, maximum/minimum distance between consecutive exons, etc.?). * Let P[1...m] be the protein sequence and T[1...n] the DNA sequence * Let match := "codon T[j-2,j] codes amino acid P[i]" * Let I[i,j] be "minimum amount of introns needed to align P[1...i] with in T[1...j]" * I(i,j) = min { I(i-1,j-3) if match, I(i,j')+1 for 0<=j'<=j-1 } I(0,j) = 0 I(i,0) = inf for i!=0 * When filling I row-by-row, update the i' giving the minimum at each step - Then min{} can be computed in constant time - Total running time O(mn) * Limit on maximum intron length can be enforced by j-R<=j'<=j-1 - O(mnR) running time when computed brute-force - Can be improved to O(mn) by keeping so-called *left-most minima* of I(i,j') values on a sliding window of length R; details omitted here - Limit on maximum gene length requires higher-dimensional dynamic programming, e.g. by finding a recursion to compute table such that I[i,j,k] = "minimum j' such that P[1..i] is aligned to T[j'..j] with exactly k introns".