THESIS
2014
ix, 52 pages : illustrations ; 30 cm
Abstract
Hash Ripple join is an online aggregation algorithm that can rapidly give good approximate join results increases with the progress of the join operation and converges to the
real result when the join finishes. Luo et al. proposed a parallel hash ripple join (PHRJ)
that runs in a distributed setting. However, the PHRJ has two draw backs when handling
large-scale data: 1) PHRJ updates approximate results in fine grain which induces extra
communication cost in a distributed environment 2) When data is out of memory, PHRJ
cannot provide unbiased approximate result.
In this thesis, a scalable hash ripple join is proposed that 1) runs on a distributed framework that can process distributed data in coarse-grain to speed up the join performance;
2) continuously gives unbiased and consis...[
Read more ]
Hash Ripple join is an online aggregation algorithm that can rapidly give good approximate join results increases with the progress of the join operation and converges to the
real result when the join finishes. Luo et al. proposed a parallel hash ripple join (PHRJ)
that runs in a distributed setting. However, the PHRJ has two draw backs when handling
large-scale data: 1) PHRJ updates approximate results in fine grain which induces extra
communication cost in a distributed environment 2) When data is out of memory, PHRJ
cannot provide unbiased approximate result.
In this thesis, a scalable hash ripple join is proposed that 1) runs on a distributed framework that can process distributed data in coarse-grain to speed up the join performance;
2) continuously gives unbiased and consistent approximate join results even in the presence of memory overflow; and 3) has good scalability handling growing amount of data.
We have implemented a prototype of the scalable hash ripple join (SHRJ) algorithm on
Spark. Experiment results show that SHRJ can give good approximate join result while
taking less than 10 % of the time of Spark's own join operator.
Post a Comment