THESIS
2019
xi, 107 pages : illustrations ; 30 cm
Abstract
Due to the rapid development of massively parallel data processing systems such as MapReduce
and Spark, there have been revived interests in the theoretical computer science community to study
algorithms in a massively parallel computational model. Join evaluation, as one of the central algorithmic
problems in database theory, has received much more attention. In this thesis, we study how to compute
join queries efficiently (with theoretical guarantees) in today’s massively parallel systems.
We give an almost complete characterization of acyclic join queries on which instance-optimality,
output-optimality and worst-case optimality can be achieved respectively. We first give an instance-optimal algorithm for r-hierarchical joins and prove that instance-optimality cannot be achieved...[
Read more ]
Due to the rapid development of massively parallel data processing systems such as MapReduce
and Spark, there have been revived interests in the theoretical computer science community to study
algorithms in a massively parallel computational model. Join evaluation, as one of the central algorithmic
problems in database theory, has received much more attention. In this thesis, we study how to compute
join queries efficiently (with theoretical guarantees) in today’s massively parallel systems.
We give an almost complete characterization of acyclic join queries on which instance-optimality,
output-optimality and worst-case optimality can be achieved respectively. We first give an instance-optimal algorithm for r-hierarchical joins and prove that instance-optimality cannot be achieved for non-r-hierarchical joins. Then we give a new output-sensitive algorithm for arbitrary acyclic joins, which
is optimal if the output size is not too large. At last, we present a worst-case optimal algorithm for
arbitrary acyclic joins, whose complexity is characterized by the fractional edge cover number of the join hypergraph. On the other hand, we give evidence that cyclic joins are inherently more difficult than acyclic ones. In particular, we give the first output-sensitive lower bound for the triangle join, as well as a worst-case lower bound for the so-called ⊟-join.
Beyond natural joins, we also investigate similarity joins under l
1/l
2/l
∞ distance metrics by designing three algorithms: (1) a deterministic output-optimal algorithm for l
1/l
∞ metric in constant dimensions; (2) a randomized output-sensitive algorithm for l
2 metric in constant dimensions; (3) an LSH-based approximation algorithm for arbitrary metric in high dimensions.
Although this thesis is theoretical in nature, some of the developed algorithms are also quite practical.
We have implemented them in Spark and the experimental results have demonstrated their efficiency in
practice.
Post a Comment