1.In PubMed, search for review articles published in the last year
discussing the gene HbA1 in humans.

  * Describe briefly what PubMed is.
  see http://www.ncbi.nlm.nih.gov/pubmed/

  * How many articles does the query return?
  50

  * Which disease or diseases are mentioned in article titles?
  Haemoglobin variant, type 2 and type1 diabetes mellitus

________________________________________________________________________________

2.Search for gene HbA1 in OMIM.

  * Describe briefly what OMIM is.
  see http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

  * How many results do you get?
  14

  * Choose one result entry from each of the categories denoted by symbols +, *, # and %, and describe in your own words what is being described by each entry.

  * What is the meaning of the four symbols here?
  see http://www.ncbi.nlm.nih.gov/Omim/omimfaq.html

________________________________________________________________________________

3. Search for HbA1 in NCBI RefSeq using Entrez. Hint: Choose Nucleotide
option from the Search list and set the options in Limits tab accordingly.

  * Describe briefly what RefSeq is.
  see http://www.ncbi.nlm.nih.gov/RefSeq/

  * How many results did you get?
  24

  * How can you separate your RefSeq results from other results?
  Results are seperated by tabs.

Access the entry for human HBA1 in NCBI RefSeq and answer the following questions.

  * How long is the RNA sequence corresponding to the gene?
  576bp

  * How many exons have been annotated in this sequence?
  3

  * In which chromosome is this gene located in?
  16

  * When was the entry last updated?
  Sept 13, 2009

  * How can you easily download the sequence corresponding to a nucleotide entry in NCBI?
  Click on "Download" and select a file format.

________________________________________________________________________________

4. Find entries related to gene HbA in UniProt.

  * Describe briefly what UniProt is.
  see http://www.uniprot.org/help/about

  * What are the two sections of UniProt, and how do they differ from each other? How can you separate between the two sections in search results?
  Swiss-Prot: manual annotation, reviewed
  TrEMBL: automatic annotation, unreviewed
  Separation after searching via link OR type in search field "HbA AND reviewed:[yes,no]"

  * Describe your results for the query: how many results in the two sections did you get?
  438 in total, 301 reviewed, 137 unreviewed

  * Access the entry HBA_HUMAN. What does the entry say about evidence for this protein? How is this protein's function being characterised? Hint: see the Ontologies section.
  Evidence at protein level. The protein is involved in oxygen transport from the lung to the various peripheral tissues.

  *Access the entry Q86YQ5_HUMAN and describe it in the same fashion as HBA_HUMAN.
  Evidence is inferred from homology. The protein has a function in the Oxygen transport.
  
________________________________________________________________________________

5. Download genome sequences of Escherichia coli (GenBank ID NC_000913) and Thermoplasma volcanium (NC_002689) from NCBI. 

  1. Find out, using your favourite programming language (notes on programming languages below) or other method, the nucleotide, dinucleotide and trinucleotide frequencies.
  see ex1_computefreqs.py

  2. What is the G-C content of the sequences?
  see slides 9-11 of Lecture_100909.pdf
  E.coli: fr(G+C) = (#G + #C) / #N   ( N - nucleotides)
                  = (1176923 + 1179554) / 4639675
                  = 0.5079
                  i.e. 50.79%
  T.volcanium: fr(G+C) = (#G + #C) / #N   ( N - nucleotides)
                       = (317147 + 315483) / 1584804
                       = 0.3992
                       i.e. 39.92%

  3. Draw a diagram of 2-word and 3-word distributions in both sequences (you can use any software available).

________________________________________________________________________________

6. Write a program in your favourite language that tries to find gene coding regions with the following method.

  * Scan the given sequence for start (ATG) and stop codons (TAA, TAG, TGA).
  * Report the regions that begin with the start codon and end in a stop codon. Note: remember that within a coding region, ATG codes for methionine and does not "restart" the coding region. 
  * Take into account frame shifts, considering codons starting from the first, second and third position of the input sequence.
  see ex1_findcodingregions.py

Test your program with this DNA sequence.

  * How many candidate coding regions can you find? Where can you find them? How long are the regions?
  37

  * Discuss how could you investigate further your findings.