Efficient processing of complex join queries on the cloud

HKUST Electronic Theses

Efficient processing of complex join queries on the cloud

by Xiaofei Zhang

THESIS 2013

Ph.D. Computer Science and Engineering

xii, 125 pages : illustrations ; 30 cm

Abstract

Join operation is one of the most expressive and expensive data analytic tools in traditional Database systems. Along with the exponential growth of various data collections, NoSQL data storage has risen as the prevailing solution for Big Data. However, without the strong support of heavy index, the join operator becomes even more crucial and challenging for querying against or mining from massive data. There have been intensive studies over different types of join operations over distributed data, e.g. similarity join, set join, fuzzy join and etc., all of which focus on efficient join query evaluation by exploring the massive parallelism of the MapReduce computing framework on the Cloud platform. However, the multi-way generalized join problem, which is summarized as the “complex join” in this thesis, has not yet been thoroughly explored. The major challenge of “complex join” lies in, given a number of processing units, mapping a complex join query to a number of parallel tasks and having them executed in a well scheduled sequence, such that the total processing time span is minimized. In this thesis, we demonstrate how our “complex join” solution can be well applied to the query processing over various data analytic scenarios, i.e., querying RDF data, pattern matching over graph data and etc.. To summarize, our study covers four aspects:

1) We propose a cost model based RDF join processing solution using MapReduce and general purposed optimization strategy;

2) We propose an novel representation of RDF data on Cloud platforms, based on which we propose an I/O efficient strategy to evaluate SPARQL queries as quickly as possible;

3) We study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective; and

4) We develop a complete solution framework for join-based efficient analysis over distributed graphs using the distance join query as an example.

We validate our solutions through extensive experiments and discuss several interesting research directions of the complex join processing on the Cloud.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Chen, Lei Authors Zhang, Xiaofei Subjects Information storage and retrieval systems Data mining Querying (Computer science) Cloud computing Language English Call number Thesis CSED 2013 ZhangX DOI 10.14711/thesis-b1250336

Full record

Efficient processing of complex join queries on the cloud

by Xiaofei Zhang

Post a Comment Cancel reply