1.In PubMed, search for review articles published in the last year discussing the gene HbA1 in humans. * Describe briefly what PubMed is. see http://www.ncbi.nlm.nih.gov/pubmed/ * How many articles does the query return? 50 * Which disease or diseases are mentioned in article titles? Haemoglobin variant, type 2 and type1 diabetes mellitus ________________________________________________________________________________ 2.Search for gene HbA1 in OMIM. * Describe briefly what OMIM is. see http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim * How many results do you get? 14 * Choose one result entry from each of the categories denoted by symbols +, *, # and %, and describe in your own words what is being described by each entry. * What is the meaning of the four symbols here? see http://www.ncbi.nlm.nih.gov/Omim/omimfaq.html ________________________________________________________________________________ 3. Search for HbA1 in NCBI RefSeq using Entrez. Hint: Choose Nucleotide option from the Search list and set the options in Limits tab accordingly. * Describe briefly what RefSeq is. see http://www.ncbi.nlm.nih.gov/RefSeq/ * How many results did you get? 24 * How can you separate your RefSeq results from other results? Results are seperated by tabs. Access the entry for human HBA1 in NCBI RefSeq and answer the following questions. * How long is the RNA sequence corresponding to the gene? 576bp * How many exons have been annotated in this sequence? 3 * In which chromosome is this gene located in? 16 * When was the entry last updated? Sept 13, 2009 * How can you easily download the sequence corresponding to a nucleotide entry in NCBI? Click on "Download" and select a file format. ________________________________________________________________________________ 4. Find entries related to gene HbA in UniProt. * Describe briefly what UniProt is. see http://www.uniprot.org/help/about * What are the two sections of UniProt, and how do they differ from each other? How can you separate between the two sections in search results? Swiss-Prot: manual annotation, reviewed TrEMBL: automatic annotation, unreviewed Separation after searching via link OR type in search field "HbA AND reviewed:[yes,no]" * Describe your results for the query: how many results in the two sections did you get? 438 in total, 301 reviewed, 137 unreviewed * Access the entry HBA_HUMAN. What does the entry say about evidence for this protein? How is this protein's function being characterised? Hint: see the Ontologies section. Evidence at protein level. The protein is involved in oxygen transport from the lung to the various peripheral tissues. *Access the entry Q86YQ5_HUMAN and describe it in the same fashion as HBA_HUMAN. Evidence is inferred from homology. The protein has a function in the Oxygen transport. ________________________________________________________________________________ 5. Download genome sequences of Escherichia coli (GenBank ID NC_000913) and Thermoplasma volcanium (NC_002689) from NCBI. 1. Find out, using your favourite programming language (notes on programming languages below) or other method, the nucleotide, dinucleotide and trinucleotide frequencies. see ex1_computefreqs.py 2. What is the G-C content of the sequences? see slides 9-11 of Lecture_100909.pdf E.coli: fr(G+C) = (#G + #C) / #N ( N - nucleotides) = (1176923 + 1179554) / 4639675 = 0.5079 i.e. 50.79% T.volcanium: fr(G+C) = (#G + #C) / #N ( N - nucleotides) = (317147 + 315483) / 1584804 = 0.3992 i.e. 39.92% 3. Draw a diagram of 2-word and 3-word distributions in both sequences (you can use any software available). ________________________________________________________________________________ 6. Write a program in your favourite language that tries to find gene coding regions with the following method. * Scan the given sequence for start (ATG) and stop codons (TAA, TAG, TGA). * Report the regions that begin with the start codon and end in a stop codon. Note: remember that within a coding region, ATG codes for methionine and does not "restart" the coding region. * Take into account frame shifts, considering codons starting from the first, second and third position of the input sequence. see ex1_findcodingregions.py Test your program with this DNA sequence. * How many candidate coding regions can you find? Where can you find them? How long are the regions? 37 * Discuss how could you investigate further your findings.