THESIS
2015
xiii, 132 pages : illustrations ; 30 cm
Abstract
Many of today’s applications generate huge amounts of data in a streaming fashion. The
data often has to be processed in real time with limited memory space. On the other
hand, it has been observed that in most cases, the value density of these data is rather
low, and they can usually be compacted in a small summary that can be used to serve
queries on the original data without losing much accuracy. Therefore, building various
small summaries over streaming data has received much attention in both the algorithms
and database communities in recent years.
This thesis makes four contributions to this area. First, we continue the study of
building quantile summaries over streaming data. This is a fundamental problem that
has been studied for over 30 years, yet there has been li...[
Read more ]
Many of today’s applications generate huge amounts of data in a streaming fashion. The
data often has to be processed in real time with limited memory space. On the other
hand, it has been observed that in most cases, the value density of these data is rather
low, and they can usually be compacted in a small summary that can be used to serve
queries on the original data without losing much accuracy. Therefore, building various
small summaries over streaming data has received much attention in both the algorithms
and database communities in recent years.
This thesis makes four contributions to this area. First, we continue the study of
building quantile summaries over streaming data. This is a fundamental problem that
has been studied for over 30 years, yet there has been limited formal comparison of the
competing methods, and no comprehensive study of their performance. In this thesis, we
remedy this deficit by providing a taxonomy of different methods, and describe efficient
implementations. In doing so, we also propose and analyze variations that have not been
explicitly studied before, yet which turn out to perform the best.
Second, we revisit another classical problem, which is to build a piecewise linear application
(PLA) over streaming time series data. Prior work addressed two versions of
the problem, where either the PLA consists of disjoint segments, or it is required to be
a continuous piecewise linear function. However, we observe that neither minimizes the
true representation size of the PLA, i.e., the number of parameters required to represent
it. In this thesis, we design an algorithm that generates the optimal PLA in terms of representation size while meeting the prescribed max-error guarantee.
Thirdly, we propose a new type of summaries, called persistent data sketching. All
existing data summaries are updated upon the arrival of every element in the stream, thus
destroying its previous version. A persistent sketch, on the other hand, preserves all its
previous versions as it is updated over time. Every update creates a new version, while
all the versions are kept compactly so that any previous version can still be queried. This
gives the summary the ability to answer queries about the stream at any prior time.
Finally we study a simplified version of above persistent data sketching problem which
only needs to answer historical queries in the cache-register model. We will first show a
theoretical reduction between sliding window summaries and persistent summaries, and
then propose a more practical persistent summary for this model.
Post a Comment