Random sampling and size estimation over relational joins

HKUST Electronic Theses

Random sampling and size estimation over relational joins

by Yu Chen

THESIS 2019

Ph.D. Computer Science and Engineering

x, 97 pages : illustrations ; 30 cm

Abstract

Join is one of the central problems in database theory, which has been extensively studied in both theory and practice. In this thesis, we focus on estimating join size and sampling from joins.

We present three results in this area. The first result is a new algorithm for estimating the size of cycle join of k binary relations in the property-testing model. The algorithm gives an unbiased estimator for the join size and it achieves running time of O (IN^k/2/OUT) in expectation to obtain a constant-factor approximation with constant probability, where IN is total number of tuples in the input, and OUT is the join size. Moreover, for cycle join of k binary relations, IN^k/2 is a tight upper bound for OUT.

The second result is an algorithm for generating uniform samples from the join results without computing the full join in a model that allows a linear-time preprocessing step. For the class of sequenceable queries, a sample can be drawn uniformly at random from the join results in O(IN^p/OUT) time in expectation where p is fractional edge cover number of the hypergraph. For non-sequenceable queries, the running time for drawing a sample is O(IN^p+1/OUT). These results hold for both full join queries and join-project queries. Prior to this work, the only solution to this problem with formal guarantees is to either precompute the full join results, which has O(IN^w) preprocessing and storage cost where w is the fractional hypertree width of the query and O(1) sampling cost, or compute the full join at sampling time, which has no preprocessing cost but O(IN^w) sampling cost. We also adapt our algorithm to solve the join size estimation problem. We show that after drawing a constant number of samples, a constant-factor approximation to the join size can be obtained with constant probability.

The previous two results for join size estimation all collect samples after the query is given. In the third work, we present an algorithm for join size estimation using synopses collected separately for each relation before the query is given. A major challenge here is that join queries often come with ad hoc selection predicates that are only known at query time, thus there is no way to take the predicates into account when the statistics are collected. Our new sampling method, called two-level sampling, combines the advantages of Bernoulli sampling, Correlated sampling and End-biased sampling while making further improvements. Two-level sampling gives unbiased estimator for join size and provides error guarantees on its estimates. It is also very easy to implement, requiring just one pass over the data. It only relies on some basic statistical information about the data, such as the l_k-norms and the heavy hitters. The algorithm can also be extended to support estimating other aggregates like SUM and AVG. Both analytical and empirical comparisons show that the new algorithm outperforms all the previous algorithms on a variety of joins, including primary key-foreign key joins, many-to-many joins, and multi-table joins.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Yi, Ke Authors Chen, Yu Subjects Database searching Mathematical models Sampling (Statistics) Database management Estimation theory Language English Call number Thesis CSED 2019 ChenY DOI 10.14711/thesis-991012757468603412

Full record

Random sampling and size estimation over relational joins

by Yu Chen

Post a Comment Cancel reply