1. For I = GCTGCTATGCTTGGC and J = CGCGGCTATG, make a 2-mer list for J. Compute diagonal common word sums for I and J using the algorithm presented in lectures and in course book.
   * see ex4_1.pdf
________________________________________________________________________________

2. Run FASTA-Nucleotide tool at EBI (Tools -> Similarity & Homology -> FASTA) against EMBL Coding Sequence database using this sequence as query sequence. Choose "interactive" as the parameter Results. Otherwise use default parameters.

   Explain the contents of the result page in your own words. How many matches did you get? How similar were the best matches to the query sequence? How long did the query take?
   * 50 results, but this is set on the query page (number of alignments)
   * took aprox. 4min
   * best result had a similarity of 100% and a length of 2409bp
   * see also MView and VisualFasta at the result page
 
________________________________________________________________________________

3.a) Run nucleotide BLAST tool at NCBI against Reference mRNA sequence database using this sequence as the query sequence. Choose to Optimize for Somewhat similar sequences (blastn). Otherwise use default parameters.

     Explain the contents of the result page in your own words. How many matches did you get? How similar were the best matches to the query sequence? How long did the query take?
     * 153 results
     * took some seconds
     * best result had 98% identities

3.b) Run protein BLAST tool against Non-redundant protein sequences (nr) database using this sequence as the query sequence. Discuss the results as in 3 a).
     * 117 results
     * took few seconds
     * best result had 100% identities

     (This assignment uses the same query sequence as a BLAST tutorial at NCBI, which is useful to go through)
     
________________________________________________________________________________

4. Some binding sites for hematopoietic transcription factor GATA-1 from H. sapiens are listed below:

      AGATAA
      TGATAA
      AGATAG
      TGATAG
      TGATCA
      TTATCA

   Compute the consensus sequence, positional weight matrix (PWM), and position-specific scoring matrix (PSSM) for the sites as described at the lecture (using pseudocounts for the latter). Compute also the sequence logo heights for the letters at each position.
   * see ex4_4.pdf

________________________________________________________________________________

5. Familiarize yourself with the motif finding program called Weeder.
   a) Test it in the following manner: Take a suitable PWM from Jaspar database, select the corresponding consensus sequence as a basis, and hide it in several copies of random DNA. Alter the hidden copies of consensus sequence slighly according to the PWM. Does Weeder find your hidden motif?
   * MA001 frequency matrix obtained from JASPAR_CORE database:
     A  [ 0  3 79 40 66 48 65 11 65  0 ]
     C  [94 75  4  3  1  2  5  2  3  3 ]
     G  [ 1  0  3  4  1  0  5  3 28 88 ]
     T  [ 2 19 11 50 29 47 22 81  1  6 ]
   * consensus sequence: CCATAAATAG -> will be my motif
   * random DNA with slightly altered motif hidden into it: see ex4_5.faa
     the < and > signs indicate where I hid it
   * paste it into Weeder (after removing < and > signs)
   * result can be found at http://159.149.109.9:8080/weederweb2006/OutputFiles/laura.langohr@cs.helsinki.fi_1254911347237.txt.html for approx 1 month
   * pattern of length 10bp that Weeder found is: CCAAATATAG

   b) Find the article describing the algorithm behind Weeder. What familiar techniques from the course the algorithm uses?
   * Pavesi,G., Mauri,G. and Pesole,G. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17 (Suppl. 1), S207-214.
   * pattern discovery with suffix trees
   * findig approximate occurences of a pattern by determining dynamically the error threshold according to the pattern length

________________________________________________________________________________

6. Note: this assignment gives you two marks.

   Write a program implementing the Needleman-Wunsch global alignment algorithm capable of reporting the optimal global alignment score and corresponding alignment.
   * see ex4_6.py

   Test your program with two sequences (first, second) varying parameter values for mismatch and indel penalty while keeping match score constant. For example, use values -20,-10,-5,0 for both penalties, and 10 for match score.

   Report the number of matches, mismatches and indels in optimal alignment for each parameter combination. What conclusions can you draw about the effects of different parameter values to alignment result?
   * penalties=  0: 69 matches, 14 mismatches, 39 indels
   * penalties= -5: 69 matches, 16 mismatches, 35 indels
   * penalties=-10: 67 matches, 22 mismatches, 27 indels
   * penalties=-20: 67 matches, 22 mismatches, 27 indels
   * alignments obtained with larger penalties have less matches, more mismatches, and less indels (less gaps)