INFO-F-528 Machine learning methods for bioinformatics  (1-1)


Master in Bioinformatics and Modelling


Teacher: Gianluca Bontempi

Assistant: Claudio Reggiani




The aim of the course is to present both conventional and recent techniques for creating data models and learning predictive models from biological and medical data. The course focuses on advanced statistical techniques for learning predictive models (classification, regression) from data. All the techniques will be illustrated with practical applications to relevant problems in bioinformatics.

The course will combine theoretical analysis of the methods and practical applications by making use of statistical and visualization packages (R).

The target audience are computer science and biology students with knowledge of basics of databases, statistics and programming.





  1. Basic notions of biology
  2. Data analysis and machine learning for bioinformatics

3.     Classification: basic notions

o   Conditional distribution

o   Bayes risk

o   Classification strategies

o   Generalization and the bias/variance dilemma

o   Validation

o   Model selection

§  Winner-takes-all

§  Averaging

o   Performance assessment

§  Confusion matrix

§  ROC curves

  1. Classification algorithms

5.     Feature selection

o   PCA

Exercises (computer hands-on)



·      Introduction to R (part 1, part 2)

·      Supervised Learning : parametric identification and model selection

·      Classification

·      Averaging methods

·      Dimensionality reduction in microarray data






Research articles

·      Javed Khan, Jun S. Wei, Markus Ringnér, Lao H. Saal1, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab, Cristina R. Antonescu, Carsten Peterson & Paul S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

·      Michael P.S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares, Jr., David Haussler. Support Vector Machine Classification of Microarray Gene Expression Data

·      Jorge Lepre, J. Jeremy Rice, Yuhai Tu and Gustavo Stolovitzky Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data

·      Olof Emanuelsson, Henrik Nielsen, Sùren Brunak, Gunnar von Heijne Predicting Subcellular Localization of Proteins Based
on their N-terminal Amino Acid Sequence

·      Tao Li , Chengliang Zhang and Mitsunori Ogihara  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

·      Jochen Jaeger, Rimli Sengupta and Walter L. Ruzzo Improved Gene Selection For Classification Of Microarrays

·      Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed Comparison of Discrimination Methods for the Classifi cation of Tumors Using Gene Expression Data

·      Sven Degroeve, Bernard De Baets, Yves Van de Peer and Pierre Rouze Feature subset selection for splice site prediction.

·      M. Bagirov , B. Ferguson, S. Ivkovic, G. Saunders and J. Yearwood New algorithms for multi-class cancer diagnosis using tumor gene expression signatures

·      Heles S, van der Laan M, Eisen M.B. Identification of regulatory elements using a feature selection method


The exam consists in

Students have to communicate to Prof. Gianluca Bontempi which article will be discussed by sending an email to .

The discussion (max. 20 minutes) should be supported by an electronic presentation (pdf or Power Point).




Syllabus: “Statistical foundations of machine learning”




·      W.J. Ewens, G. R. Grant (2002) Statistical methods in Bioinformatics: an introduction. Springer

·      T. Hastie, R. Tibshirani, J. Friedman (2002) The Elements of Statistical Learning. Springer.

·      D. M. Dziuda Data mining for genomics and proteomics (2010), Wiley

·       `R Project for Statistical Computing':



R references

The Comprehensive R Archive Network

R packages

R for beginners

Introduction to R

An introduction to R

CRAN: R News

ESS -- Emacs Speaks Statistics

CRAN: books Contributed Documentation




Kent Ridge Bio-medical Dataset




Transcription and Translation

DNA microarrays