University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Seminar: Machine Learning in Bioinformatics

58309106 Seminar: Machine Learning in Bioinformatics (3 cr) Time: Mondays 14-16, I period: 6.09-11.10.2010, II perriod: 01.11.-29.11.2010 Place: room C220

First session: Monday 6. September 2010 14.15-16, room C220

Prerequisites and enrolling to the course

The courses Introduction to bioinformatics and Introduction to machine learning or equivalent background knowledge.

Enroll to the seminar in the registration system.

Seminar goals

Machine learning is one of the key technologies in bioinformatics, making it possible to automatically generate predictive models from data. In this seminar we will get an overview of how machine learning techniques are used in bioinformatics. We will look at various prediction problems, including We will look at machine learning techniques in the context of above mentioned biological problems, including representative approaches of

Completing the seminar

The language of the seminar is English. To pass the seminar, you need to do the following four tasks: During Period I all students write their papers in English. The length of the paper is 6-10 pages formatted according to the format given below. The oral presentations, during Period II, should last for about 30-40 minutes, which should leave some time for questions.

Grading

Students will be graded based on i) their written paper (40%), ii) their oral presentation (40%), and iii) their activity in commenting other students' work and participating in the discussion (20%). To pass the course, the student must write the paper on the agreed subject and present his work. In addition, each student is required to attend at least 80% of the seminar presentations.

Grading will be on the scale 0-5 (0=fail,5=excellent)


Guidelines

In the following some addition guidelines for this seminar are given. Additional helpful material can be found from the home page of the scientific writing course Department of computer science.

Layout of the seminar paper

Using literature

Sources of information

Categories of information sources for the seminar, in the order of preference
  1. High-quality journals in bioinformatics, computer science, statistics, as well as biological and medical sciences. These are are the preferred source of seminar material. A non-exhaustive list of suitable journals: Bioinformatics, BMC Bioinformatics, Data Mining and Knowledge Discovery, Journal of Computational Biology, Journal of Machine Learning Research, Machine Learning.
  2. Proceedings of high-quality conferences in computational biology and machine learning: Intelligent Systems for Molecular Biology (ISMB), Research in Computational Molecular Biology (RECOMB), International Conference on Machine Learning (ICML), Neural Information Processing Systems (NIPS).
  3. Text books contain high-quality information. However, as the publication process of books takes very long, the information in text books is rarely the latest in science. Text books can be used as sources of information, but they should always be accompanied by journal and conference papers.
  4. Wikipedia contains a lot of information and sometimes is a good source to get an overview of the seminar topic. However, the quality of Wikipedia articles varies. In particular, the peer-review process behind a Wikipedia article is not always at the same level as high-quality scientific journals and conferences. As a consequence, sometimes Wikipedia contains opinions of small groups of scientists that are not shared by the research community. Guideline: You may use Wikipedia as a means to learn about some topic. However, avoid using Wikipedia as the only source of information. Always verify the facts using other sources of information. Whenever possible rely on journal and conference articles.
  5. Online course material is widely available in the www. These should be used even with more caution than Wikipedia. Some courses are very good some are not, and there is no peer-review process behind the material. Online courses should not be used as references in you seminar paper.
  6. The rest of WWW. A random web page of some individual/organization/group about some subject has typically very little quality control behind it. This material is not suitable for seminar paper material.

Finding information


Combination of two search strategies will lead to the best results

Oral presentation

Giving and receiving feedback

Two golden rules: A matrix on factors affecting grading in scientific writing may be used as basis of feedback. You are NOT supposed to give grades with your feedback, however, only suggestions and comments.

Seminar material and topics

The seminar will be based on recent scientific articles and text books. The following survey article will be a useful starting point: The following are preselected topics with an associated seed article. Seed article is meant to be used as a starting point for lietrature search, not the only or the best reference on a certain topic.
In addition to the preselected topics, you may suggest your own topic.

Gene prediction

  1. Axel Bernal,, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira. Global Discriminative Learning for Higher Accuracy Computational Gene Prediction. PLoS Computational Biology 3(3): e54
  2. Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke.Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008, 9:217

Protein-DNA binding

  1. Nitin Bhardwaj, Robert E. Langlois, Guijun Zhao, and Hui Lu.Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005; 33(20): 6486-6493.
  2. Pengyu Hong, X. Shirley Liu, Qing Zhou, Xin Lu, Jun S. Liu, and Wing H. Wong: A boosting approach for motif modeling using ChIP-chip data Bioinformatics 2005 21: 2636-2643

SNPs and haplotyping

  1. Eric P. Xing, Michael I. Jordan, Roded Sharan. Bayesian Haplotype Inference via the Dirichlet Process. Journal of Computational Biology. April 1, 2007, 14(3): 267-284.
  2. Lakshmi Matukumalli, John Grefenstette, David Hyten, Ik-Young Choi, Perry B Cregan and Curtis P Van Tassell. Application of machine learning in SNP discovery. BMC Bioinformatics 2006, 7:4

Protein structural classification and structure prediction

  1. Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie. Multi-class Protein Classification Using Adaptive Codes. Journal of Machine Learning Research 8 (2007, 1557-1581
  2. Scott Montgomerie, Joseph Cruz, Savita Shrivastava, David Arndt, Mark Berjanskii, David Wishart. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W202-W209
  3. Theodoros Damoulas and Mark Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detectionBioinformatics 24, 10 (2008):1264-1270

Protein identification

  1. Joshua Elias, Francis Gibbons, Oliver King, Frederick Roth and Steven Gygi. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature biotechnology 22, 2 (2004), 214-

Protein function prediction

  1. Iddo Friedberg. Automated protein function prediction - the genomic challenge. Briefings in Bioinformatics 2006 7(3):225-242
  2. Igor V. Tetko, Igor V. Rodchenkov, Mathias C. Walter, Thomas Rattei and Hans-Werner Mewes. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics 2008 24(5):621-628

Gene expression profiling

  1. Alexander Statnikov 1, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005 21(5):631-643
  2. Daxin Jiang, Chun Tang, Aidong Zhang Cluster analysis for gene expression data: a survey IEEE Transactions on Knowledge and Data Engineering 16, 11, (2004),1370 - 1386

Biological network inference

  1. Florian Markowetz and Rainer Spang: Inferring cellular networks - a review. BMC Bioinformatics 8, 6 (2007), S5
  2. Jean-Philippe Vert: Reconstruction of biological networks by supervised machine learning approaches. Technical Report HAL-00283945, June, 2008.
  3. Ashwin Srinivasan and Ross D. King. Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming.

Gene (regulatory) networks

  1. Robert Castelo and Alberto Roverato: A Robust Procedure for Gaussian Graphical Model Search From Microarray Data With p Larger Than n. Journal of Machine Learning Research 7 (2006), 2621-2650.
  2. Jason Enrst, Oded Vainas, Christopher Harbison, Itamar Simon and Ziv Bar-Joseph. Reconstructing Dynamic Regulatory Maps. Molecular Systems Biology 4 (2007):74

Protein-protein interaction networks

  1. Krogan et al.:Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440 (2006), 637-643
  2. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman. Evaluation of different biological data and computational classification methods for use in protein interaction prediction.. Proteins: Structure, Function, and Bioinformatics 63, 3 (2006), 490 - 500