THESIS
2018
xvi, 116 pages : illustrations ; 30 cm
Abstract
With the burst of data volume and application complexity, it has become prevalent to host large-scale
computations in clusters of distributed servers. To shared production clusters, job scheduling
is of paramount importance to the cluster performance. The two basic scheduling objectives are
efficiency and fairness-an ideal scheduler shall facilitate fast job response, and meanwhile avoid
starvation by guaranteeing worst-case service quality.
For inter-job scheduling, efficiency and fairness are conflicting with each other, leading to
a dilemma of either predictable performance at the expense of long response time, or minimum
mean response time at the risk of starvation. As a result, it's critical to develop resource scheduling
strategies that can do well in both worlds. In this...[
Read more ]
With the burst of data volume and application complexity, it has become prevalent to host large-scale
computations in clusters of distributed servers. To shared production clusters, job scheduling
is of paramount importance to the cluster performance. The two basic scheduling objectives are
efficiency and fairness-an ideal scheduler shall facilitate fast job response, and meanwhile avoid
starvation by guaranteeing worst-case service quality.
For inter-job scheduling, efficiency and fairness are conflicting with each other, leading to
a dilemma of either predictable performance at the expense of long response time, or minimum
mean response time at the risk of starvation. As a result, it's critical to develop resource scheduling
strategies that can do well in both worlds. In this regard, we make the following contributions.
First, we present Cluster Fair Queuing (CFQ), a scheduling mechanism to minimize the mean
job response time while ensuring predictable performance. It works by preferentially offering
resources to jobs that finishes earliest under an idealized fair sharing policy. Second, we reveal that
service isolation is crucial for both fairness and efficiency, but has not been guaranteed even when
the jobs are assigned high priorities. We identify the reasons behind and propose Speculative
Slot Reservation to achieve service isolation, which works by reserving slots if and only if that's
appropriate according to job inner dependencies. Third, we observe that the marginal benefit from additional resources varies significantly for different jobs, and then propose Performance-Aware
Fair (PAF) scheduling to reallocate certain resources for better overall efficiency while ensuring
near-optimal fairness.
For intra-job scheduling however, fairness regarding workloads allocation on distributed workers,
i.e., load-balancing, can help to improve the efficiency. We apply that insight to distributed
deep learning applications, which might suffer salient performance degradation when running in
heterogeneous clusters. Specifically, we propose a new worker-coordinating scheme, called Load-balanced
Bulk Synchronous Parallel (LB-BSP), that can adaptively adjust workers' loads based on
their progressing capabilities for fast distributed deep learning.
Post a Comment