Ph.D. Theses

Planning and Evaluation of Federated Queries on the Web

By Gregory Williams
Advisor: James Hendler
February 19, 2013

The Web of Data continues to increase in size and diversity, providing access to large amounts of structured, linked data. However, existing approaches to querying this data often fail to make use of existing database access points and must resort to web crawling to collect data of interest. Furthermore, in order to provide efficient query answering over this data existing systems are forced to construct centralized database indexes, making it difficult to maintain up-to-date data. For approaches that do utilize existing databases, disregard for fundamental design principles of the Web results in query systems that lack some basic features of their web crawling counterparts. If an efficient query answering system can be provided that does not require centralized indexing, and leverages both existing databases and static web content, users may benefit from up-to-date access to structured, disparate data.

In this dissertation, we develop a federated query planning framework based on the RDF data model and the SPARQL query language. This framework is able to leverage the high performance of existing SPARQL databases while also providing access to linked data available as RDF documents on the web. These two access methods are used to provide a single interface to querying semantic data.

The primary challenge of evaluating queries over both SPARQL databases and linked data is in finding an efficient execution plan. Such a plan must perform better than the naive approach of completely decomposing the query and executing each subquery against each data source or traversing linked data by web crawling. Moreover, it must allow metadata discovered during query execution to be incorporated into the existing plan.

Given this, in this dissertation we develop three techniques to increase performance and flexibility of federated query evaluation: we develop a federated query planning algorithm that prioritizes the execution of subqueries that have high expected value (that is, expected relevant results with low latency); we develop a re-planning algorithm, able to augment an existing query plan with newly discovered data sources and a mechanism for discovering such sources; and we develop a server-side technique to greatly enhance the web cacheability of SPARQL query results.

Finally, the developed framework is designed using a traditional query planner, allowing it to integrate with and benefit from existing work on query planning and optimization.

To demonstrate the practicality of this federated query planning framework, we present results of empirical evaluation of the framework components over a real-world dataset of bibliographic data. These results show that the federated query planning, evaluation, and caching techniques are able to produce query results quickly and efficiently. The effects of several optimizations on the execution of federated queries is discussed, and their impact on performance is evaluated.

Return to main PhD Theses page