THESIS
2018
ix, 66 pages : illustrations ; 30 cm
Abstract
The advent of big data caused huge, rapid and volatile data streams to emerge, pushing
research community into designing both real-time Distributed Stream Processing Systems
(DSPSs) and streaming algorithms that run on top of those systems. The DSPSs must
exhibit a variety of features such as hight throughput and low latency processing of data
streams. In the first part of this thesis, we present the state of the art DSPSs and describe
certain features that make them unique. In the second part, we focus on the problem
of join processing in the streaming context. Specifically, we present the first output-optimal
join algorithm for stream join processing, called Streaming Randomized HyperCube
(SRHC). The algorithm operates optimally in the presence of high skew, considering both...[
Read more ]
The advent of big data caused huge, rapid and volatile data streams to emerge, pushing
research community into designing both real-time Distributed Stream Processing Systems
(DSPSs) and streaming algorithms that run on top of those systems. The DSPSs must
exhibit a variety of features such as hight throughput and low latency processing of data
streams. In the first part of this thesis, we present the state of the art DSPSs and describe
certain features that make them unique. In the second part, we focus on the problem
of join processing in the streaming context. Specifically, we present the first output-optimal
join algorithm for stream join processing, called Streaming Randomized HyperCube
(SRHC). The algorithm operates optimally in the presence of high skew, considering both
the input and the output sizes of the join, a feature that makes it quite suitable for
many-to-many joins. Finally, we implement SRHC on top of Flink [2] and evaluate its
efficiency compared to state of the art join algorithms after conducting experiments on
both synthetic and real datasets.
Post a Comment