Seminar: Machine Learning in Bioinformatics

58309106
3
Bioinformatics
Advanced studies
Year Semester Date Period Language In charge
2010 autumn 06.09-29.11. 1-2 English

Lectures

Time Room Lecturer Date
Mon 14-16 C220 Juho Rousu 06.09.2010-11.10.2010
Mon 14-16 C220 Juho Rousu 01.11.2010-29.11.2010

Information for international students

 The seminar is given in English.

General

The seminar has been graded. If you have questions on grading, please email Juho.

Machine learning is one of the key technologies in bioinformatics, making it possible to automatically generate predictive models from data. In this seminar we will get an overview of how machine learning techniques are used in bioinformatics. We will look at various prediction problems, including

  • prediction problems in biosequences (gene prediction, splice site prediction)
  • gene expression analysis and gene prioritazion
  • structure prediction (RNA, protein)
  • protein function prediction
  • interaction networks prediction (protein-protein, gene regulation)

We will look at machine learning techniques in the context of above mentioned biological problems, including representative approaches of

  • classification methods (classification trees, nearest neighbor, neural networks, support vector machines)
  • clustering (partition-based, hierarchical, mixture models)
  • probabilistic graphical models (HMM, bayesian network)

Completing the course

Recommended background: the courses Introduction to bioinformatics and Introduction to machine learning or equivalent background knowledge.

Enroll to the seminar in the registration system.

The language of the seminar is English. To pass the seminar, you need to do the following four tasks:

  • Write a paper about a topic agreed during the first meetings,
  • Review two papers written by other students,
  • Prepare a presentation and discuss it with the other students, and
  • Participate in the seminar by asking questions, raising discussions on the topic, and reviewing other students' work. Attending at least 80% of the seminar presentations is required for passing the course.

During Period I all students write their papers in English. The length of the paper is 6-10 pages formatted according to the format given below. The oral presentations, during Period II, should last for about 25 minutes plus 5 minutes for questions.

Grading  will be graded based on i) the written paper (40%), ii) the oral presentation (40%), and iii) review work and participation in discussion (20%). 

Grading will be on the scale 0-5 (0=fail,5=excellent)

Schedule

Presenter Topic Reviewers Pres. Date
Chengyu Liu

Biclustering of gene expression data by non-smooth non-negative matrix factorization

Y. Zhao

1.11
Yan Zhao

Intensity-based protein identification by machine learning from a library of tandem mass spectra

C. Liu 1.11
Markus Heinonen Static gene networks M. Al-Hello, M. Islam 8.11
Muhammed Al-Hello

Gene regulation networks

 M. Heinonen, M. Islam 8.11
Oskar Gross

 Finding genotype and phenotype associations

 K. Hyytiäinen, J. Xiong 15.11
Kirsi Hyytiäinen

 SNP discovery & haplotyping 

 O. Gross, P. Korhonen 15.11
Anni Nevanlinna

 Metabolic pathway prediction

 J. Liu, M. Du 15.11
Daniel Blande

 Protein function prediction

 P. Korhonen, J. Xiong 22.11
Jie Xiong

 Transporter prediction

 D. Blande, K. Hyytiäinen  22.11
 Pasi Korhonen

 Enzyme function prediction

 D. Blande, O. Gross  22.11
 Jia Liu

 Supervised inference and  reconstruction of biological networks 

 M. Du, A. Nevanlinna  29.11
 Mian Du

 Text mining for protein-protein interactions

 J. Liu, A. Nevanlinna  29.11
Mohammad Islam

DNA-binding proteins

M. Heinonen, M. Al-Hello 29.11

 

 

 

Literature and material

 In the following some addition guidelines for this seminar are given. Additional helpful material can be found from the home page of the scientific writing course Department of computer science.

Layout of the seminar paper

Using literature

  • The seminar paper should cover the biological problem and the computational methods used to solve the problem. You may need to use separate sources for the application and for the method.
  • Try to locate the best papers about the topic. You will probably end up reading more papers than you will eventually use.
  • How many references you should use and cite? A rule of thumb is "as many references as there are pages in the paper". This does not mean that you will write exactly one page about each reference, some require more than others.
  • Try to make a synthesis of the literature. What is the main message of the papers about some topic? How do the individual papers relate to or deviate from this main message.

Sources of information

 

Categories of information sources for the seminar, in the order of preference:

  1. High-quality journals in bioinformatics, computer science, statistics, as well as biological and medical sciences. These are are the preferred source of seminar material. A non-exhaustive list of suitable journals: Bioinformatics, BMC Bioinformatics, Data Mining and Knowledge Discovery, Journal of Computational Biology, Journal of Machine Learning Research, Machine Learning.

  2. Proceedings of high-quality conferences in computational biology and machine learning: Intelligent Systems for Molecular Biology (ISMB), Research in Computational Molecular Biology (RECOMB), International Conference on Machine Learning (ICML), Neural Information Processing Systems (NIPS).
  3. Text books contain high-quality information. However, as the publication process of books takes very long, the information in text books is rarely the latest in science. Text books can be used as sources of information, but they should always be accompanied by journal and conference papers.
  4. Wikipedia contains a lot of information and sometimes is a good source to get an overview of the seminar topic. However, the quality of Wikipedia articles varies. In particular, the peer-review process behind a Wikipedia article is not always at the same level as high-quality scientific journals and conferences. As a consequence, sometimes Wikipedia contains opinions of small groups of scientists that are not shared by the research community. Guideline: You may use Wikipedia as a means to learn about some topic. However, avoid using Wikipedia as the only source of information. Always verify the facts using other sources of information. Whenever possible rely on journal and conference articles.
  5. Online course material is widely available in the www. These should be used even with more caution than Wikipedia. Some courses are very good some are not, and there is no peer-review process behind the material. Online courses should not be used as references in you seminar paper.
  6. The rest of WWW. A random web page of some individual/organization/group about some subject has typically very little quality control behind it. This material is not suitable for seminar paper material.

Finding information

  • Google Scholar is perhaps the search engine to find literature on certain topic.

  • University of Helsinki has subscriptions to a wide range of electronic journals, you can access these from the university computers. (To access these via Google Scholar, remember to enable "Library Links" for University of Helsinki in Google Scholar preferences)

Combination of two search strategies will lead to the best results

  • Google Scholar will give you well-references articles that match to the keywords. These are often a bit older.
  • Systematic search through the tables of contents of latest issues of good journals will return you the latest of the latest in the topic.

Oral presentation

  • The oral presentation should not be a image of the written paper. You should concentrate in geeting the main message through and leave minor details to the seminar paper.
  • Explain both the biological problem and the machine learning method(s)
  • Allocate enough time for each slide so that the audience have time to understand the contents. 2 minutes per slide is a good rule of thumb.

Giving and receiving feedback

Two golden rules:

  • When giving feedback, be constructive, suggest improvements rather than just criticizing.
  • When receiving feedback, try to look at your paper through the reviewers eyes. Why did this particular comment/suggestion/criticism arise? Usually every bit of feedback contains something useful you can use to imporve your paper.

A matrix on factors affecting grading in scientific writing may be used as basis of feedback. You are NOT supposed to give grades with your feedback, however, only suggestions and comments.


Seminar material and topics

The seminar will be based on recent scientific articles and text books.

      The following are preselected topics with an associated seed article. Seed article is meant to be used as a starting point for literature search, not the only or the best reference on a certain topic. In addition to the preselected topics, you may suggest your own topic

    Survey on Machine Learning and Bioinformatics

    The following survey article is a useful reference for all presentations. It is also available as a topic for the seminar paper/presentation:

    Pedro Larrañaga , Borja Calvo , Roberto Santana , Concha Bielza , Josu Galdiano , Iñaki Inza , José A. Lozano , Rubén Armañanzas , Guzmán Santafé , Aritz Pérez , and Victor Robles (2006): Machine learning in bioinformatics. Brief Bioinform 7: 86-112.http://bib.oxfordjournals.org/cgi/content/abstract/7/1/86

 

Gene prediction

Axel Bernal,, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira.Global Discriminative Learning for Higher Accuracy Computational Gene Prediction. PLoS Computational Biology 3(3): e54

Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke.Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008, 9:217 

Ter-Hovhannisyan, V. and Lomsadze, A. and Chernoff, Y.O. and Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome research 18, 12 (2008), 1979

Protein-DNA binding

Nitin Bhardwaj, Robert E. Langlois, Guijun Zhao, and Hui Lu.Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005; 33(20): 6486-6493.

Pengyu Hong, X. Shirley Liu, Qing Zhou, Xin Lu, Jun S. Liu, and Wing H. Wong: A boosting approach for motif modeling using ChIP-chip dataBioinformatics 2005 21: 2636-2643

Fang, Y. and Guo, Y. and Feng, Y. and Li, M. Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino acids 34, 1 (2008), 103-109

SNPs and haplotyping

Eric P. Xing, Michael I. Jordan, Roded Sharan. Bayesian Haplotype Inference via the Dirichlet Process. Journal of Computational Biology. April 1, 2007, 14(3): 267-284.

Lakshmi Matukumalli, John Grefenstette, David Hyten, Ik-Young Choi, Perry B Cregan and Curtis P Van Tassell. Application of machine learning in SNP discovery. BMC Bioinformatics 2006, 7:4

Protein structural classification and structure prediction

Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie. Multi-class Protein Classification Using Adaptive Codes. Journal of Machine Learning Research 8 (2007, 1557-1581

Scott Montgomerie, Joseph Cruz, Savita Shrivastava, David Arndt, Mark Berjanskii, David Wishart. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W202-W209

Theodoros Damoulas and Mark Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detectionBioinformatics 24, 10 (2008):1264-1270

Protein identification

Joshua Elias, Francis Gibbons, Oliver King, Frederick Roth and Steven Gygi. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature biotechnology 22, 2 (2004), 214-

Spivak, M. and Weston, J. and Bottou, L. and Käll, L. and Noble, W.S. Improvements to the percolator algorithm for peptide identification from shotgun proteomics data sets. J. Proteome Research 8,7 (2009),  3737--3745

Protein function prediction: in general

Iddo Friedberg. Automated protein function prediction - the genomic challenge. Briefings in Bioinformatics 2006 7(3):225-242

Igor V. Tetko, Igor V. Rodchenkov, Mathias C. Walter, Thomas Rattei and Hans-Werner Mewes. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics 2008 24(5):621-628

Enzyme function prediction

Arakaki, A.K. and Huang, Y. and Skolnick, J. EFICAz 2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinformatics 10, 1 (2009),  107

Astikainen, K. and Holm, L. and Pitkänen, E. and Szedmak, S. and Rousu, J. Towards structured output prediction of enzyme function. BMC proceedings 2, Suppl 4, 2008, S2  

Transporter protein prediction

Li, H. and Dai, X. and Zhao, X. A nearest neighbor approach for automated transporter prediction and categorization from protein sequences. Bioinformatics 24, 9 (2008), 1129

Gromiha, M.M. and Yabuki, Y. Functional discrimination of membrane proteins using machine learning techniques. BMC Bioinformatics 9, 1 (2008), 135

Biological network inference

Florian Markowetz and Rainer Spang: Inferring cellular networks - a review. BMC Bioinformatics 8, 6 (2007), S5

Jean-Philippe Vert: Reconstruction of biological networks by supervised machine learning approaches. Technical Report HAL-00283945, June, 2008. 

Metabolic pathways

Joseph M Dale , Liviu Popescu  and Peter D Karp. Machine learning methods for metabolic pathway prediction. BMC Bioinformatics 2010, 11:15

Kashima, H. and Yamanishi, Y. and Kato, T. and Sugiyama, M. and Tsuda, K. Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach. Bioinformatics 25, 22 (2009), 2962

Gene networks

Robert Castelo and Alberto Roverato: A Robust Procedure for Gaussian Graphical Model Search From Microarray Data With p Larger Than n. Journal of Machine Learning Research 7 (2006), 2621-2650.

Jason Enrst, Oded Vainas, Christopher Harbison, Itamar Simon and Ziv Bar-Joseph. Reconstructing Dynamic Regulatory Maps. Molecular Systems Biology 4 (2007):74

Liu, B. and De La Fuente, A. and Hoeschele, I. Gene Network Inference via Structural Equation Modeling in Genetical Genomics Experiments. Genetics 178, 3 (2008),1763

Protein-protein interaction networks

Krogan et al.:Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440 (2006), 637-643

Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman. Evaluation of different biological data and computational classification methods for use in protein interaction prediction.. Proteins: Structure, Function, and Bioinformatics 63, 3 (2006), 490 - 500

Gene prioritization

Stein Aerts, Diether Lambrechts, Sunit Maity, Peter Van Loo, Bert Coessens, Frederik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem Hassan, Peter Carmeliet ,Yves Moreau. Gene prioritization through genomic data fusion. Nature Biotechnology, 24, 5 (2006) 537-544 

Tijl De Bie,Leon-Charles Tranchevent, Liesbeth M. M. van Oeffelen, Yves Moreau. Kernel-based data fusion for gene prioritization. Bioinformatics 23 (2007), i125-i132 

Chen, J. and Aronow, B.J. and Jegga, A.G. Disease candidate gene identification and prioritization using protein interaction networks. BMC bioinformatics 10, 1 (2009), 73

Drug Target Prediction

Jean-Loup Faulon, Milind Misra, Shawn Martin, Ken Sale, and Rajat Sapra. Genome scale enzyme–metabolite and drug–target interaction predictions using the signature molecular descriptor. Bioinformatics (2008) 24(2): 225-233

Yoshihiro Yamanishi, Masaaki Kotera, Minoru Kanehisa, and Susumu Goto. Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics (2010) 26(12): i246-i254

Genotype-phenotype prediction

Norman J. MacDonald and Robert G. Beiko. Efficient learning of microbial genotype–phenotype association rules. Bioinformatics (2010) 26(15): 1834-1840 

Seyoung Kim and Eric P. Xing. Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. PLoS Genet. 2009 August; 5(8): e1000587

Gene expression biclustering

 Pedro Carmona-Saez , Roberto D Pascual-Marqui , F Tirado , Jose M Carazo  and Alberto Pascual-Montano. Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 2006, 7:78

Amela Prelić, Stefan Bleuler, Philip Zimmermann, Anja Wille, Peter Bühlmann, Wilhelm Gruissem, Lars Hennig, Lothar Thiele, and Eckart Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics (2006) 22(9): 1122-1129 

Or...

Suggest your own topic!