Towards optimal delay and throughput in data-parallel computing clusters

HKUST Electronic Theses

Towards optimal delay and throughput in data-parallel computing clusters

by Jingjie Jiang

THESIS 2017

Ph.D. Computer Science and Engineering

xv, 109 pages : illustrations ; 30 cm

Abstract

Data-parallel computing frameworks are designed to support the processing of large volumes in computing clusters for big data analytics, such as search engines, personalized recommendation, video analytics and graph processing. Due to the distributed nature of big data analytics, computation and network resources both serve as the most critical factors to improve individual job performance and overall system throughput. There is a pressing need to coordinate the allocation of network bandwidth and the scheduling of computation tasks.

This thesis handles the allocation of both network and computation resources through delay-aware bandwidth allocation schemes and network-aware task scheduling frameworks. Specifically, we make the following three contributions.

First, we design Tailor, a dynamic monitoring and routing system to reduce network transfer times between successive computation stages of a job (captured as coflow completion time). Tailor is transparent to data-parallel applications and requires minimum modifications of end-hosts. For clusters where only edge networks experience severe and persistent congestion, we identify the non-trivial tradeoff between coflow performance and network utilization. Through in-depth analysis, we show that achieving work conservation is insufficient to maximizing the utilization of access links. We propose a hierarchical bandwidth allocation framework, Adia, that maximizes link utilization while achieves near-optimal coflow performance.

Secondly, we propose to embrace network-awareness into task scheduling, since network communication still serves as the determining factor for job performance even with the state-of-the-art bandwidth allocation schemes. By introducing a novel network-aware queueing model, we decouple the usage of network and computation resources and thus accurately capture the total processing time of each task. We then propose a network-aware scheduling algorithm, Adrestia, and prove it is throughput-optimal given the demand for network and computation resources as priori.

Last but not least, we propose an online scheduling framework, Symbiosis, that identifies resource imbalance and coordinates computation- bound and network-bound tasks in a large cluster, with the objective of utilizing all types of resources in a cluster with optimal system throughput. Symbiosis provides both a substrate and an application programming interface (API) to support existing task schedulers in data analytics frameworks. With network-awareness, our framework fully considers network and computation resources, making task scheduling and bandwidth allocation decisions based on live analytics of cluster states. We have implemented Symbiosis on top of Spark and demonstrated it improves both delay and throughput in a real-world cloud testbed using diversified analytic workloads.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Li, Bo Authors Jiang, Jingjie Subjects Electronic data processing Mathematical models Computer networks Data processing Parallel algorithms Language English Call number Thesis CSED 2017 Jiang DOI 10.14711/thesis-991012554564803412

Full record

Towards optimal delay and throughput in data-parallel computing clusters

by Jingjie Jiang

Post a Comment Cancel reply