As measurement of gene expression using microarrays has become a standard high throughput
method in molecular biology, the analysis of gene expression data is still a very active
area of research in bioinformatics and statistics. Despite some issues in quality and reproducibility
of microarray and derived data, they are
still considered as one of the most promising experimental techniques for the understanding
of complex molecular mechanisms.
This work approaches the problem of expression data analysis using contextual information.
While all analyses must be based on sound statistical data processing, it is also
important to include biological knowledge to arrive at biologically interpretable results.
After giving an introduction and some biological background, in chapter 2 some standard
methods for the analysis of microarray data including normalization, computation
of differentially expressed genes, and clustering are reviewed. The first source of context
information that is used to aid in the interpretation of the data, is functional annotation
of genes. Such information is often represented using ontologies such as gene ontology. GO annotations are provided by many gene and
protein databases and have been used to find functional groups that are significantly enriched
in differentially expressed, or otherwise conspicuous genes. In gene clustering approaches,
functional annotations have been used to find enriched functional classes within
each cluster. In chapter 3, a clustering method for the samples of an expression data set
is described that uses GO annotations during the clustering process in order to find functional
classes that imply a particularly strong separation of the samples. The resulting
clusters can be interpreted more easily in terms of GO classes. The clustering method was
developed in joint work with Henning Redestig.
More complex biological information that covers interactions between biological objects
is contained in networks. Such networks can be obtained from public databases of metabolic
pathways, signaling cascades, transcription factor binding sites, or high-throughput measurements
for the detection of protein-protein interactions such as yeast two hybrid experiments.
Furthermore, networks can be inferred using literature mining approaches or
network inference from expression data. The information contained in such networks is
very heterogenous with respect to the type, the quality and the completeness of the contained
data. ToPNet, a software tool for the interactive analysis of networks and gene
expression data has been developed in cooperation with Daniel Hanisch. The basic analysis and visualization methods as well as some important concepts
of this tool are described in chapter 4.
In order to access the heterogeneous data represented as networks with annotated experimental
data and functions, it is important to provide advanced querying functionality.
Pathway queries allow the formulation of
network templates that can include functional annotations as well as expression data. The
pathway search algorithm finds all instances of the template in a given network. In order
to do so, a special case of the well known subgraph isomorphism problem has to be
solved. Although the algorithm has exponential running time in the worst case, some implementation
tricks make it run fast enough for practical purposes. Often, a pathway query
has many matching instances, and it is important to assess the statistical significance of
the individual instances with respect to expression data or other criteria. In chapter 5
the pathway query language and the pathway search algorithm are described in detail and
some theoretical properties are derived. Furthermore, some scoring methods that have
been implemented are described. The possibility of combining different scoring schemes
for different parts of the query result in very flexible scoring capabilities.
In chapter 6, some applications of the methods are described, us