2014-15

* *

*The
course will combine theoretical analysis of the methods and practical
applications by making use of statistical and visualization packages (R).*

The
target audience are computer science and biology
students with knowledge of basics of databases, statistics and programming.

3.
Classification:
basic notions

o Conditional distribution

- Bayes theorem
- Bayes classifier

o
Bayes risk

o
Classification
strategies

o
Generalization and the
bias/variance dilemma

o
Validation

o
Model selection

§ Winner-takes-all

§ Averaging

o
Performance assessment

§ Confusion matrix

§ ROC curves

- Multi-class problems
- Classification algorithms
- Discriminant analysis
- Linear discriminant analysis
- Perceptrons
- SVM
- Classification trees
- Density estimation
- Naive-Bayes
- Regression based techniques
- KNN
- Ensemble
methods

o PCA

- Hierarchical clustering
- Filters
- Wrappers

*Exercises**
(computer hands-on)*

1.

2.

· Introduction
to R (part
1, part
2)

· Supervised Learning : parametric
identification and model selection

· Dimensionality reduction in microarray data

**Research articles**

·
Javed
Khan, Jun S. Wei, Markus Ringnér, Lao H. Saal1, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab,
Cristina R. Antonescu, Carsten Peterson & Paul S. Meltzer. **Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks**

·
Michael P.S. Brown, William Noble Grundy, David
Lin, Nello Cristianini, Charles Sugnet,
Manuel Ares, Jr., David Haussler. **Support Vector
Machine Classification of Microarray Gene Expression
Data**

· Jorge Lepre, J. Jeremy Rice, Yuhai Tu and Gustavo Stolovitzky **Genes@Work****: an efficient algorithm
for pattern discovery and multivariate
feature selection in gene expression data**

· Olof Emanuelsson,
Henrik Nielsen, Sùren Brunak,
Gunnar von Heijne **Predicting**** Subcellular Localization of Proteins Based
on their N-terminal Amino Acid Sequence**

· Tao Li
, Chengliang Zhang and Mitsunori
Ogihara**
A comparative study of feature
selection and multiclass
classification methods for tissue classification based on gene expression **

·
Jochen
Jaeger, Rimli Sengupta and
Walter L. Ruzzo **Improved**** Gene Selection
For Classification Of Microarrays**

· Sandrine Dudoit, Jane Fridlyand, and
Terence P. Speed **Comparison**** of Discrimination Methods
for the Classifi cation of Tumors
Using Gene Expression Data **

· Sven Degroeve,
Bernard De Baets, Yves Van de Peer and Pierre Rouze **Feature**** subset selection for splice site prediction.**

· M. Bagirov , B. Ferguson,
S. Ivkovic, G. Saunders and
J. Yearwood **New
algorithms for multi-class cancer diagnosis
using tumor gene expression signatures**

·
Heles S, van der Laan M, Eisen M.B.**
Identification of regulatory elements
using a feature selection method**

**Exam **

The exam consists in

- Realizing a data analysis project by using the R language.
**PROJECT 2014-15. Data set.**- oral
- discussion
of the project results.
**Article discussion:**The discussion should deal with the following aspects of the article (see list above):- which
biological problem was addressed?
- how
data were collected
or retrieved?
- are
the data publicly available?
Did you try to download them?
- how
many samples and
variables were used?
- under
which form (e.g. supervised, unsupervised, classification, regression)
was put the data analysis
problem?
- which
learning or modelling
technique was used?
- which
parametric identification technique was used?
- how
was addressed the problem of structural identification (e.g. model or variable selection)?
- how
was validated the accuracy of the approach?
- could
you suggest some techniques which could have also been used in the same context or problem?

Students
have to communicate to Prof. Gianluca Bontempi which article will be discussed by sending an email to gbonte@ulb.ac.be .

The discussion (max. 20
minutes) should be supported by an __electronic____
presentation__ (pdf or
Power Point).

· W.J. Ewens, G.
R. Grant (2002) *Statistical
methods in Bioinformatics: an introduction.*
Springer

· T.
Hastie, R. Tibshirani, J. Friedman (2002) *The
Elements of Statistical Learning.* Springer.

· D. M. Dziuda
*Data
mining for genomics and proteomics* (2010), Wiley

·
`R
Project for Statistical Computing': www.r-project.org.

The Comprehensive R Archive Network

ESS -- Emacs Speaks Statistics

CRAN: books Contributed Documentation

**Datasets**

Kent
Ridge Bio-medical Dataset

**Videos**