Building summaries over streaming data

HKUST Electronic Theses

Building summaries over streaming data

by Ge Luo

THESIS 2015

Ph.D. Computer Science and Engineering

xiii, 132 pages : illustrations ; 30 cm

Abstract

Many of today’s applications generate huge amounts of data in a streaming fashion. The data often has to be processed in real time with limited memory space. On the other hand, it has been observed that in most cases, the value density of these data is rather low, and they can usually be compacted in a small summary that can be used to serve queries on the original data without losing much accuracy. Therefore, building various small summaries over streaming data has received much attention in both the algorithms and database communities in recent years.

This thesis makes four contributions to this area. First, we continue the study of building quantile summaries over streaming data. This is a fundamental problem that has been studied for over 30 years, yet there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this thesis, we remedy this deficit by providing a taxonomy of different methods, and describe efficient implementations. In doing so, we also propose and analyze variations that have not been explicitly studied before, yet which turn out to perform the best.

Second, we revisit another classical problem, which is to build a piecewise linear application (PLA) over streaming time series data. Prior work addressed two versions of the problem, where either the PLA consists of disjoint segments, or it is required to be a continuous piecewise linear function. However, we observe that neither minimizes the true representation size of the PLA, i.e., the number of parameters required to represent it. In this thesis, we design an algorithm that generates the optimal PLA in terms of representation size while meeting the prescribed max-error guarantee.

Thirdly, we propose a new type of summaries, called persistent data sketching. All existing data summaries are updated upon the arrival of every element in the stream, thus destroying its previous version. A persistent sketch, on the other hand, preserves all its previous versions as it is updated over time. Every update creates a new version, while all the versions are kept compactly so that any previous version can still be queried. This gives the summary the ability to answer queries about the stream at any prior time.

Finally we study a simplified version of above persistent data sketching problem which only needs to answer historical queries in the cache-register model. We will first show a theoretical reduction between sliding window summaries and persistent summaries, and then propose a more practical persistent summary for this model.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Yi, Ke Authors Luo, Ge Subjects Streaming technology (Telecommunications) Mathematical models Data processing Language English Call number Thesis CSED 2015 Luo DOI 10.14711/thesis-b1584876

Full record

Building summaries over streaming data

by Ge Luo

Post a Comment Cancel reply