Exercise session 2

Introduction to bioinformatics, Autumn 2008

Exercise session: Tuesday 16 September 16.15-18.00 Exactum C221.

Remember to send your exercise notes to Lauri Eronen before the exercise session begins!

Assignments

Calculate Codon Adaptation Index (CAI) and log-odds CAI for the two DNA sequences r and s using the codon usage data containing probabilities p_k for each codon below. You can perform the calculation by hand or write a program for the task. Hint: translate the DNA sequences into amino acid sequences by using the genetic code table presented in lectures.

    r: atgcagccgagaagtgtaattatgtcatttttccaaaca
    s: atggccttccagcagctgtcgggggtcaacgccgtcatgttc

Codon usage table (same data in a text file). Note that "*" denotes a stop codon.

    ATT I 0.32  ACT T 0.24  AAT N 0.39  AGT S 0.14
    ATC I 0.56  ACC T 0.40  AAC N 0.61  AGC S 0.24
    ATA I 0.12  ACA T 0.25  AAA K 0.38  AGA R 0.18
    ATG M 1.00  ACG T 0.11  AAG K 0.62  AGG R 0.20
    
    CTT L 0.12  CCT P 0.30  CAT H 0.35  CGT R 0.10
    CTC L 0.21  CCC P 0.33  CAC H 0.65  CGC R 0.21
    CTA L 0.07  CCA P 0.26  CAA Q 0.24  CGA R 0.13
    CTG L 0.42  CCG P 0.11  CAG Q 0.76  CGG R 0.18
    
    GTT V 0.16  GCT A 0.30  GAT D 0.42  GGT G 0.18
    GTC V 0.27  GCC A 0.41  GAC D 0.58  GGC G 0.35
    GTA V 0.10  GCA A 0.20  GAA E 0.39  GGA G 0.24
    GTG V 0.47  GCG A 0.10  GAG E 0.61  GGG G 0.22

    TTT F 0.42  TCT S 0.20  TAT Y 0.39  TGT C 0.43
    TTC F 0.58  TCC S 0.25  TAC Y 0.61  TGC C 0.57
    TTA L 0.05  TCA S 0.12  TAA * 0.34  TGA * 0.47
    TTG L 0.13  TCG S 0.06  TAG * 0.19  TGG W 1.00

Simulate the Overlap-Layout-Consensus algorithm given in lectures with the following data and answer the questions below.

What contigs did you obtain from method?
What was the total length of contigs?

Sequence reads are all in the same orientation (you do not have to consider reverse complement). Overlap matrix has been computed for you assuming no sequencing errors. Disregard overlaps under five bases.

Sequence reads:

0 cgaccacttc
1 cgttaatggc
2 gttaaaccaa
3 gcccgttaat
4 accacttcac
5 taaaccaaag
6 ttaaaccaaa
7 ggactctacc
8 tctaccgcga
9 aagtaaaccg
10 ccacttcact
11 agatatccaa
12 ggttaaacca
13 ttaatggcca

Overlap matrix for the sequences. Note that first row and first column correspond to sequence indecies.

      0   1   2   3   4   5   6   7   8   9  10  11  12  13
  0  10   1   0   0   8   0   0   1   3   2   7   0   0   0
  1   1  10   0   7   1   0   0   1   0   2   1   0   0   8
  2   0   0  10   0   1   8   9   0   0   2   0   1   9   0
  3   0   7   0  10   0   1   1   0   1   1   0   0   0   5
  4   8   1   1   0  10   0   1   3   1   0   9   1   4   1
  5   0   0   8   1   0  10   9   1   0   3   1   2   7   0
  6   0   0   9   1   1   9  10   0   0   2   1   1   8   0
  7   1   1   0   0   3   1   0  10   6   1   2   0   0   0
  8   3   0   0   1   1   0   0   6  10   1   1   1   0   0
  9   2   2   2   1   0   3   2   1   1  10   0   2   1   1
 10   7   1   0   0   9   1   1   2   1   0  10   0   3   3
 11   0   0   1   0   1   2   1   0   1   2   0  10   1   1
 12   0   0   9   0   4   7   8   0   0   1   3   1  10   0
 13   0   8   0   5   1   0   0   0   0   1   3   1   0  10

Optional: Simulate the algorithm for the data again, but instead of disregarding overlaps under five, consider all sequence pairs with at least one overlapping base. How does the result change from previous scenario?

In this assignment, you run the CAP3 sequence assembler to assemble a set of reads into one or more contigs.

In order to be able to complete this assignment, you need to have access to CS computers. You can either get a minoring student account (instructions how to get it, look for section "Minoring students") or then use the computers with your University of Helsinki account. In the latter case, first login in with username 'csguest' and password 'csguest', and then enter your U. Helsinki account information as the system should ask for them. You should now be able to use the computer. Note: all data stored in the latter case on the file system is deleted after session! Remember to store data on a USB memory stick, for example.

The CAP3 program and data for this exercise are in CS filesystem directory /group/home/bioinfo/itb/cap. To access them, you need to copy them to a directory under your own account. This can be done, for example, by writing the following commands in CS Linux terminal:
```
      cd
      mkdir -p tmp/cap
      cp /group/home/bioinfo/itb/cap/* tmp/cap
    
```
Now you should have a personal copy of the program and data (reads.txt, readswitherrors.txt) in directory ~/tmp/cap.

Both data files contain 1000 reads sampled from a DNA sequence of unknown length. The reads can be from either strand of DNA. First, the file 'reads.txt' contains the reads that contain no errors. The file 'readswitherrors.txt' contains 1000 reads (not necessarily the same as in reads.txt) that contain some amount of errors that are signified with the letter 'n' in the sequence.

You can get help about CAP3 command line parameters by invoking cap3 without any parameters.

Run the assembler for 'reads.txt'. Describe your results:
- How many different contigs did you get?
- How many gaps were left in the sequence?
- How long were the contigs?
- How much time did the assembly take?
Continuing from the previous assignment, run the CAP3 assembler for reads containing errors (readswitherrors.txt). Describe the results: how did the number of contigs change? What about contig lengths?

Lastly investigate how the number of available reads affects the assembly outcome. Run the assembler with the first 100 reads from the file reads.txt. Describe the results as before.
Read the article The Sequence of the Human Genome published in Science 16, February 2001, Vol 291, Issue 5507 up to section 2.2 (Assembly strategies) and answer the following questions.
1. How many reads were used to assemble the genome sequence?
2. How large coverage was achieved?
3. How many individuals were DNA samples collected from?
4. How was data from the public Human Genome Project utlised in this work?
Have a look at the resources on the Ornithorhynchus anatinus provided at NCBI, particularly the genome map viewer. Answer the following questions:
1. What is the coverage of the genome assembly presented?
2. How many chromosomes does this organism have? How many of these have been sequenced?
3. Explore the chromosomes 4 and 10. How large are the chromosomes? How many contigs do they have in the assembly? How large are the contigs? How are they shown in the map viewer?