THESIS
2015
x, 121 pages : illustrations ; 30 cm
Abstract
Living in the era of big data, we often need to process and analyze data sets that have never been so large and fast-growing. Random sampling has thus received much attention as an effective tool for turning big data “small”. It allows us to significantly reduce the size of input while maintaining the main features of the original data set we need. It is also easy to trade off between the computation complexity and the accuracy of the result, by tweaking the sample size.
Although random sampling is a classical problem with a long history, it has received revived attention lately motivated by new applications as well as new constraints in the big data era. This thesis presents several new techniques and applications of random sampling: (1) a new randomized streaming algorithm for findin...[
Read more ]
Living in the era of big data, we often need to process and analyze data sets that have never been so large and fast-growing. Random sampling has thus received much attention as an effective tool for turning big data “small”. It allows us to significantly reduce the size of input while maintaining the main features of the original data set we need. It is also easy to trade off between the computation complexity and the accuracy of the result, by tweaking the sample size.
Although random sampling is a classical problem with a long history, it has received revived attention lately motivated by new applications as well as new constraints in the big data era. This thesis presents several new techniques and applications of random sampling: (1) a new randomized streaming algorithm for finding approximate quantiles in a data stream, which achieves the smallest space complexity of all such algorithms; (2) an augmented B-tree index that, for any given range query, returns a sampling-based summary containing the quantiles and heavy hitters of all tuples in the query range; (3) a sample-augmented R-tree that, given any range query, returns random samples from the query range in an online fashion. Apart from the description and analysis of each algorithm we propose, experimental results are also provided, confirming the advantages of the new algorithms. Finally, we showcase a system for large-scale spatio-temporal data analysis using the developed techniques.
Post a Comment