THESIS
2017
x, 107 pages : illustrations ; 30 cm
Abstract
Sublinear algorithms address the rapid growth in data volume with a simple yet powerful premise,
that useful tasks can be performed with even fewer resources than required to simply store the data.
This thesis studies randomized algorithms for massive data. We either devise new algorithms,
or improve analysis on existing algorithms, resulting in meaningful theoretical guarantees with sublinear space or communication.
First, we present an algorithm that uses sublinear communication to perform set reconciliation
under a ‘noisy data’ model, where two data points shall be considered ‘the same’ when the distance
between them is small, modelling tiny perturbations caused in data due to some form of noise.
The second is a 1-pass streaming algorithm for estimating the number of distinct...[
Read more ]
Sublinear algorithms address the rapid growth in data volume with a simple yet powerful premise,
that useful tasks can be performed with even fewer resources than required to simply store the data.
This thesis studies randomized algorithms for massive data. We either devise new algorithms,
or improve analysis on existing algorithms, resulting in meaningful theoretical guarantees with sublinear space or communication.
First, we present an algorithm that uses sublinear communication to perform set reconciliation
under a ‘noisy data’ model, where two data points shall be considered ‘the same’ when the distance
between them is small, modelling tiny perturbations caused in data due to some form of noise.
The second is a 1-pass streaming algorithm for estimating the number of distinct entities in the
same noisy data model. Geometrically, this may be seen as counting sparsely placed clusters of
small diameter, using space that is sublinear in the number of clusters.
Finally, we give an improved analysis of random Fourier features (RFF) for the Gaussian kernel,
a popular data-oblivious embedding of kernel functions. In contrast to competing techniques that
are typically data-dependent, RFF can be used under the data stream setting. Our work justifies the
use of RFF in a variety of learning problems under the Gaussian kernel distance.
Post a Comment