January 12, 2010

Data Integration on the (Semantic) Web with Rules and Rich Unification

For the last decade a multitude of new data formats for the World Wide Web

have been developed, and a huge amount of heterogeneous semi-structured data

is flourishing online. With the ever increasing number of documents on the

Web, rules have been identified as the means of choice for reasoning about

this

data, transforming and integrating it. Query languages such as SPARQL and rule

languages such as Xcerpt use compound queries that are matched or unified with

semi-structured data. This notion of unification is different from the one

that is known from logic programming engines in that it (i) provides

constructs that allow queries to be incomplete in several ways (ii) in that

variables may have different types, (iii) in that it results in sets of

substitutions for the variables in the query instead of a single substitution

and (iv) in that subsumption between queries is much harder to decide than in

logic programming.

This thesis abstracts from Xcerpt query term simulation, SPARQL graph pattern

matching and XPath XML document matching, and shows that all of them can be

considered as a form of rich unification. Given a set of mappings between

substitution sets of different languages, this abstraction opens up the

possibility for format-versatile querying, i.e. combination of queries in

different formats, or transformation of one format into another format within

a single rule.

To show the superiority of this approach, this thesis introduces an extension

of Xcerpt called Xcrdf, and describes use-cases for the combined querying

and integration of RDF and XML data. With XML being the predominant Web

format, and RDF the predominant Semantic Web format, Xcrdf extends Xcerpt

by a set of RDF query terms and construct terms, including query primitives

for RDF containers collections and reifications. Moreover, Xcrdf includes

an RDF path query language called RPL that is more expressive than previously

proposed polynomial-time RDF path query languages, but can still be evaluated

in polynomial time combined complexity.

Besides the introduction of this framework for data integration based on rich

unification, this thesis extends the theoretical knowledge about Xcerpt in

several ways: We show that Xcerpt simulation unification is decidable, and

give complexity bounds for subsumption in several fragments of Xcerpt query

terms. The proof is based on a set of subsumption monotone query term

transformations, and is only feasible because of the injectivity requirement

on subterms of Xcerpt queries. The proof gives rise to an algorithm for

deciding Xcerpt query term simulation. Moreover, we give a semantics to

locally and weakly stratified Xcerpt programs, but this semantics is

applicable not only to Xcerpt, but to any rule language with rich unification,

including multi-rule SPARQL programs. Finally, we show how Xcerpt grouping

stratification can be reduced to Xcerpt negation stratification, thereby also

introducing the notion of local grouping stratification and weak grouping

stratification.

...more

View all episodes

By Ludwig-Maximilians-Universität München

11 ratings