Job scheduling in the cloud : a tale on fairness and efficiency

HKUST Electronic Theses

Job scheduling in the cloud : a tale on fairness and efficiency

by Chen Chen

THESIS 2018

Ph.D. Computer Science and Engineering

xvi, 116 pages : illustrations ; 30 cm

Abstract

With the burst of data volume and application complexity, it has become prevalent to host large-scale computations in clusters of distributed servers. To shared production clusters, job scheduling is of paramount importance to the cluster performance. The two basic scheduling objectives are efficiency and fairness-an ideal scheduler shall facilitate fast job response, and meanwhile avoid starvation by guaranteeing worst-case service quality.

For inter-job scheduling, efficiency and fairness are conflicting with each other, leading to a dilemma of either predictable performance at the expense of long response time, or minimum mean response time at the risk of starvation. As a result, it's critical to develop resource scheduling strategies that can do well in both worlds. In this regard, we make the following contributions.

First, we present Cluster Fair Queuing (CFQ), a scheduling mechanism to minimize the mean job response time while ensuring predictable performance. It works by preferentially offering resources to jobs that finishes earliest under an idealized fair sharing policy. Second, we reveal that service isolation is crucial for both fairness and efficiency, but has not been guaranteed even when the jobs are assigned high priorities. We identify the reasons behind and propose Speculative Slot Reservation to achieve service isolation, which works by reserving slots if and only if that's appropriate according to job inner dependencies. Third, we observe that the marginal benefit from additional resources varies significantly for different jobs, and then propose Performance-Aware Fair (PAF) scheduling to reallocate certain resources for better overall efficiency while ensuring near-optimal fairness.

For intra-job scheduling however, fairness regarding workloads allocation on distributed workers, i.e., load-balancing, can help to improve the efficiency. We apply that insight to distributed deep learning applications, which might suffer salient performance degradation when running in heterogeneous clusters. Specifically, we propose a new worker-coordinating scheme, called Load-balanced Bulk Synchronous Parallel (LB-BSP), that can adaptively adjust workers' loads based on their progressing capabilities for fast distributed deep learning.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Li, Bo Authors Chen, Chen Subjects Computer scheduling Electronic data processing Distributed processing Cloud computing Language English Call number Thesis CSED 2018 ChenC DOI 10.14711/thesis-991012636769003412

Full record

Job scheduling in the cloud : a tale on fairness and efficiency

by Chen Chen

Post a Comment Cancel reply