The course will combine theoretical analysis of the methods and practical applications by making use of statistical and visualization packages (R).
The target audience are computer science and biology students with knowledge of basics of databases, statistics and programming.
o Conditional distribution
o Bayes risk
o Classification strategies
o Generalization and the bias/variance dilemma
o Model selection
o Performance assessment
§ Confusion matrix
§ ROC curves
Exercises (computer hands-on)
· Javed Khan, Jun S. Wei, Markus Ringnér, Lao H. Saal1, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab, Cristina R. Antonescu, Carsten Peterson & Paul S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
· Michael P.S. Brown, William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares, Jr., David Haussler. Support Vector Machine Classification of Microarray Gene Expression Data
· Jorge Lepre, J. Jeremy Rice, Yuhai Tu and Gustavo Stolovitzky Genes@Work: an efficient algorithm for pattern discovery and multivariate feature selection in gene expression data
· Olof Emanuelsson,
Henrik Nielsen, Sùren Brunak,
Gunnar von Heijne Predicting Subcellular Localization of Proteins Based
on their N-terminal Amino Acid Sequence
· Tao Li , Chengliang Zhang and Mitsunori Ogihara A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
· Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed Comparison of Discrimination Methods for the Classifi cation of Tumors Using Gene Expression Data
· Sven Degroeve, Bernard De Baets, Yves Van de Peer and Pierre Rouze Feature subset selection for splice site prediction.
· M. Bagirov , B. Ferguson, S. Ivkovic, G. Saunders and J. Yearwood New algorithms for multi-class cancer diagnosis using tumor gene expression signatures
· Heles S, van der Laan M, Eisen M.B. Identification of regulatory elements using a feature selection method
The exam consists in
Students have to communicate to Prof. Gianluca Bontempi which article will be discussed by sending an email to firstname.lastname@example.org .
The discussion (max. 20 minutes) should be supported by an electronic presentation (pdf or Power Point).
· W.J. Ewens, G. R. Grant (2002) Statistical methods in Bioinformatics: an introduction. Springer
· T. Hastie, R. Tibshirani, J. Friedman (2002) The Elements of Statistical Learning. Springer.
· D. M. Dziuda Data mining for genomics and proteomics (2010), Wiley
· `R Project for Statistical Computing': www.r-project.org.