Usual microarray data sets include only a handful of observations,
but several thousands of predictor variables. Transforming the high-dimensional
predictor space to make classification (for instance cancer diagnosis) possible
is a major challenge. This thesis deals with various dimension reduction
approaches which can handle such data.
Chapter 2 gives an introduction into classification with microarray
data as well as an overview of a few specific problems such as variable selection and
comparison of classification methods.
In Chapter 3, I discuss a particular class of interaction structures
in the classification framework: "emerging patterns". I propose a new
and more general definition referring to underlying
probabilities and present a new simple method which is based on
the CART algorithm to find the corresponding empirical patterns in concrete data sets. In addition, the detected patterns
can be used to define new variables for classification. Thus, I propose a simple scheme
to use the patterns to improve the performance of classification
procedures. I implemented the search algorithm as well as the classification
procedure in the language R. Some of these programs are publicly available from my
homepage.
Chapter 4 deals with classical linear dimension reduction methods.
In the context of binary classification with continuous predictors, I prove
two properties concerning the
connections between Partial Least Squares (PLS) dimension reduction, between-group PCA and between
linear discriminant analysis and between-group PCA.
PLS dimension reduction for classification is
examined thoroughly in Chapter 5. The classification procedure consisting of
PLS dimension reduction and linear discriminant analysis on the new components is compared
favorably with some of the best state-of-the-art classification methods using nine real microarray cancer data sets.
Moreover, I apply a boosting algorithm to this classification method, which is a novel
approach.
In addition, I suggest a simple procedure to choose the number of PLS
components. At last, I examine the connection between PLS dimension reduction and variable
selection and prove a property concerning the equivalence between a common
univariate selection criterion and a variable selection approach based on the
first PLS component.