THESIS
2018
xii, 125 pages : illustrations ; 30 cm
Abstract
Distributed and parallel computing systems are cornerstones of this era of Big Data, machine learning,
and artificial intelligence. This type of computing systems spans over hundreds or thousands of machines
in datacenter(s), so as to cope with the ever-expanding data volume and the increasing complexity
of models/problems. Most of the recent and important applications, such as web search, business analytics,
recommendation systems, and deep neural networks, run on thousands of machines for both small
companies and large enterprises. As such scale, the communication between machines is a bottleneck
issue, and the scheduling of communication sessions (i.e. network flows) is key to acceleration of these
applications.
This thesis focuses on optimizing flow-level scheduling in datac...[
Read more ]
Distributed and parallel computing systems are cornerstones of this era of Big Data, machine learning,
and artificial intelligence. This type of computing systems spans over hundreds or thousands of machines
in datacenter(s), so as to cope with the ever-expanding data volume and the increasing complexity
of models/problems. Most of the recent and important applications, such as web search, business analytics,
recommendation systems, and deep neural networks, run on thousands of machines for both small
companies and large enterprises. As such scale, the communication between machines is a bottleneck
issue, and the scheduling of communication sessions (i.e. network flows) is key to acceleration of these
applications.
This thesis focuses on optimizing flow-level scheduling in datacenters, namely, its three essential
aspects: information collection, scheduling algorithm (decision making), and scheduling decision enforcement.
We begin with the design of flow-level information collection systems for parallel computing
applications. We proceed to study three important but previously-ignored scheduling problems in real-world
datacenter applications:
• Scheduling with incomplete information: Flows without complete information, such as database
query/response, cannot be handled by existing flow schedulers. We adopt the multilevel-feedback
queues in operating systems to flows scheduling, and develop a queueing theory model to determine
the optimal parameter settings.
• Scheduling heterogeneous flows with diverging objectives: Flows from user-facing applications
have completion time constraints (deadlines), and they coexist with flows without such constraints.
We identify and abstract this type of problems as mix-flow scheduling, for which we find out
that state-of-the-art flow schedulers cannot achieve objectives of different types of flows at the
same time. We approach this problem with a systematic formulation, and derive control-theoretic
solution using Lyapunov Optimization techniques.
• Scheduling with erroneous information: Machine learning techniques are increasing popular in
inferencing flow information. However, machine learning results are not always accurate. Thus,
we design error-tolerant scheduling algorithm to mitigate the impact of prediction errors.
We present the proposed solutions for each problem, and demonstrate their effectiveness via simulations
and experiments using the flow-level enforcement mechanisms. Our work is integrated into Chukonu
, a comprehensive flow scheduling toolkit for distributed and parallel computing applications.
Chukonu is composed of a flow information collection framework, a programmable flow scheduler,
and a distributed enforcement system for scheduling decisions. It has been deployed in small-scale in
production datacenters of large Internet service companies, such as Tencent and Huawei.
Post a Comment