The goal of bioinformatics is to develop innovative and practical methods and algorithms for bio-
logical questions. In many cases, these questions are driven by new biotechnological techniques,
especially by genome and cell wide high throughput experiment studies.
In principle there are two approaches:
1. Reduction and abstraction of the question to a clearly defined optimization problem, which
can be solved with appropriate and efficient algorithms.
2. Development of context based methods, incorporating as much contextual knowledge as
possible in the algorithms, and derivation of practical solutions for relevant biological ques-
tions on the high-throughput data. These methods can be often supported by appropriate
software tools and visualizations, allowing for interactive evaluation of the results by ex-
perts.
Context based methods are often much more complex and require more involved algorithmic
techniques to get practical relevant and efficient solutions for real world problems, as in many
cases already the simplified abstraction of problems result in NP-hard problem instances. In
many cases, to solve these complex problems, one needs to employ efficient data structures and
heuristic search methods to solve clearly defined sub-problems using efficient (polynomial) op-
timization (such as dynamic programming, greedy, path- or tree-algorithms).
In this thesis, we present new methods and analyses addressing open questions of bioinformatics
from different contexts by incorporating the corresponding contextual knowledge.
The two main contexts in this thesis are the protein structure similarity context (Part I) and net-
work based interpretation of high-throughput data (Part II).
For the protein structure similarity context Part I we analyze the consistency of gold standard
structure classification systems and derive a consistent benchmark set usable for different ap-
plications. We introduce two methods (Vorolign, PPM) for the protein structure similarity recog-
nition problem, based on different features of the structures.
Derived from the idea and results of Vorolign, we introduce the concept of contact neighbor-
hood potential, aiming to improve the results of protein fold recognition and threading.
For the re-scoring problem of predicted structure models we introduce the method Vorescore,
clearly improving the fold-recognition performance, and enabling the evaluation of the contact
neighborhood potential for structure prediction methods in general.
We introduce a contact consistent Vorolign variant ccVorolign further improving the structure
based fold recognition performance, and enabling direct optimization of the neighborhood po-
tential in the future. Due to the enforcement of contact-consistence, the ccVorolign method has
much higher computational complexity than the polynomial Vorolign method - the cost of com-
puting interpretable and consistent alignments.
Finally, we introduce a novel structural alignment method (PPM) enabling the explicit modeling
and handling of phenotypic plasticity in protein structures. We employ PPM for the analysis of
effects of alternative splicing on protein structures. With the help of PPM we test the hypothesis,
whether splice isoforms of the same protein can lead to protein structures with different folds
(fold transitions).
In Part II of the thesis we present methods generating and using context information for the
interpretation of high-throughput experiments.
For the generation of context information of molecular regulations we introduce novel textmin-
ing approaches extracting relations automatically from scientific publications.
In addition to the fast NER (named entity recognition) method (syngrep) we also present a novel,
fully ontology-based context-sensitive method (SynTree) allowing for the context-specific dis-
ambiguation of ambiguous synonyms and resulting in much better identification performance.
This context information is important f