THESIS
2013
xii, 125 pages : illustrations ; 30 cm
Abstract
Join operation is one of the most expressive and expensive data analytic tools
in traditional Database systems. Along with the exponential growth of various
data collections, NoSQL data storage has risen as the prevailing solution for Big
Data. However, without the strong support of heavy index, the join operator
becomes even more crucial and challenging for querying against or mining from
massive data. There have been intensive studies over different types of join operations
over distributed data, e.g. similarity join, set join, fuzzy join and etc., all
of which focus on efficient join query evaluation by exploring the massive parallelism
of the MapReduce computing framework on the Cloud platform. However,
the multi-way generalized join problem, which is summarized as the “com...[
Read more ]
Join operation is one of the most expressive and expensive data analytic tools
in traditional Database systems. Along with the exponential growth of various
data collections, NoSQL data storage has risen as the prevailing solution for Big
Data. However, without the strong support of heavy index, the join operator
becomes even more crucial and challenging for querying against or mining from
massive data. There have been intensive studies over different types of join operations
over distributed data, e.g. similarity join, set join, fuzzy join and etc., all
of which focus on efficient join query evaluation by exploring the massive parallelism
of the MapReduce computing framework on the Cloud platform. However,
the multi-way generalized join problem, which is summarized as the “complex
join” in this thesis, has not yet been thoroughly explored. The major challenge
of “complex join” lies in, given a number of processing units, mapping a complex
join query to a number of parallel tasks and having them executed in a well
scheduled sequence, such that the total processing time span is minimized. In
this thesis, we demonstrate how our “complex join” solution can be well applied
to the query processing over various data analytic scenarios, i.e., querying RDF
data, pattern matching over graph data and etc.. To summarize, our study covers
four aspects:
1) We propose a cost model based RDF join processing solution using MapReduce
and general purposed optimization strategy;
2) We propose an novel representation of RDF data on Cloud platforms, based
on which we propose an I/O efficient strategy to evaluate SPARQL queries as
quickly as possible;
3) We study the problem of efficient processing of multi-way Theta-join queries
using MapReduce from a cost-effective perspective; and
4) We develop a complete solution framework for join-based efficient analysis over
distributed graphs using the distance join query as an example.
We validate our solutions through extensive experiments and discuss several
interesting research directions of the complex join processing on the Cloud.
Post a Comment