Modern database applications are characterized by two major aspects: the use of complex data types
with internal structure and the need for new data analysis methods. The focus of database users has shifted from
simple queries to complex analyses of the data, known as knowledge discovery in databases. Important
tasks in this area are the grouping of data objects (clustering), the classification of new data objects or the
detection of exceptional data objects (outlier detection). Most algorithms for solving those problems are based on
similarity search in databases. This makes efficient similarity search in large databases of structured objects
an important basic operation for modern database applications.
In this thesis we develop efficient methods for similarity search in large databases of
structured data and improve the efficiency of existing query processing techniques.
For the data objects, only a tree or graph structure is assumed which can be extended with arbitrary attribute
information.
Starting with an analysis of the demands from two example applications, several important requirements for
similarity measures are identified. One aspect is the adaptability of the similarity search method to the
requirements of the user and the application domain. This can even imply a change of the similarity measure
between two successive queries of the same user. An explanation component which makes clear why objects are considered
similar by the system is a necessary precondition for a purposeful adaption of the measure. Consequently,
the edit distance, well-known from string processing, is a common similarity measure for graph structured
objects. Its feature to allow a visualization of corresponding substructures and the possibility to weight
single operations are the reason for this popularity.
But it turns out that the edit distance and similar measures for tree structures are computationally extremely
complex which makes them unsuitable for today's large and even growing databases. Therefore, we develop a
multi-step query processing architecture which reduces the number of necessary distance calculations significantly.
This is achieved by employing suitable filter methods.
Furthermore, we show that by easing certain restrictions
on the similarity measure, a significant performance gain can be obtained without reducing the quality of the
measure. To achieve this, matchings of substructures (vertices or edges) of the data objects are determined.
An additional cost function for those matchings allows to derive a similarity measure for structured data, called
the edge matching distance, from
the cost optimal matching of the substructures. But even for this new similarity measure, efficiency can be improved
significantly by using a multi-step query processing approach. This allows the use of the edge matching distance for
knowledge discovery applications in large databases. Within the thesis, the properties of our new similarity search methods
are proved both theoretically and through experiments.