Exercise session 4

Introduction to bioinformatics, Autumn 2009

Group 1: Thursday 8.10 12-14 Exactum BK106
Group 2: Thursday 8.10 16-18 Exactum BK106.

General instructions:

Problems for each exercise session will be distributed approximately one week before the session. You are expected to be prepared to present your solutions in the exercise session.

In addition, you need to send notes of the assignments you are going to mark to Laura Langohr by email before exercises (Thursdays 12.15).

These exercise notes should contain a brief description of the steps you took to solve the assignment, as well as the results. Important: When sending email, use subject of form "ITB exercise X, where X is the exercise session number (1/2). Send your notes in email text body. If you need to include a figure, send it as an attachment.

Assignments

For I = GCTGCTATGCTTGGC and J = CGCGGCTATG, make a 2-mer list for J. Compute diagonal common word sums for I and J using the algorithm presented in lectures and in course book.
Run FASTA-Nucleotide tool at EBI (Tools -> Similarity & Homology -> FASTA) against EMBL Coding Sequence database using this sequence as query sequence. Choose "interactive" as the parameter Results. Otherwise use default parameters.

Explain the contents of the result page in your own words. How many matches did you get? How similar were the best matches to the query sequence? How long did the query take?
1. Run nucleotide BLAST tool at NCBI against Reference mRNA sequence database using this sequence as the query sequence. Choose to Optimize for Somewhat similar sequences (blastn). Otherwise use default parameters.
  
  Explain the contents of the result page in your own words. How many matches did you get? How similar were the best matches to the query sequence? How long did the query take?
2. Run protein BLAST tool against Non-redundant protein sequences (nr) database using this sequence as the query sequence. Discuss the results as in 3 a).
  
  (This assignment uses the same query sequence as a BLAST tutorial at NCBI, which is useful to go through)
Some binding sites for hematopoietic transcription factor GATA-1 from H. sapiens are listed below:
```
AGATAA
TGATAA
AGATAG
TGATAG
TGATCA
TTATCA
```
Compute the consensus sequence, positional weight matrix (PWM), and position-specific scoring matrix (PSSM) for the sites as described at the lecture (using pseudocounts for the latter). Compute also the sequence logo heights for the letters at each position.

Familiarize yourself with the motif finding program called Weeder.
1. Test it in the following manner: Take a suitable PWM from Jaspar database, select the corresponding consensus sequence as a basis, and hide it in several copies of random DNA. Alter the hidden copies of consensus sequence slighly according to the PWM. Does Weeder find your hidden motif?
2. Find the article describing the algorithm behind Weeder. What familiar techniques from the course the algorithm uses?

Note: this assignment gives you two marks.

Write a program implementing the Needleman-Wunsch global alignment algorithm capable of reporting the optimal global alignment score and corresponding alignment.

Test your program with two sequences (first, second) varying parameter values for mismatch and indel penalty while keeping match score constant. For example, use values -20,-10,-5,0 for both penalties, and 10 for match score.

Report the number of matches, mismatches and indels in optimal alignment for each parameter combination. What conclusions can you draw about the effects of different parameter values to alignment result?