Flow scheduling for parallel computing applications in datacenters

HKUST Electronic Theses

Flow scheduling for parallel computing applications in datacenters

by Li Chen

THESIS 2018

Ph.D. Computer Science and Engineering

xii, 125 pages : illustrations ; 30 cm

Abstract

Distributed and parallel computing systems are cornerstones of this era of Big Data, machine learning, and artificial intelligence. This type of computing systems spans over hundreds or thousands of machines in datacenter(s), so as to cope with the ever-expanding data volume and the increasing complexity of models/problems. Most of the recent and important applications, such as web search, business analytics, recommendation systems, and deep neural networks, run on thousands of machines for both small companies and large enterprises. As such scale, the communication between machines is a bottleneck issue, and the scheduling of communication sessions (i.e. network flows) is key to acceleration of these applications.

This thesis focuses on optimizing flow-level scheduling in datacenters, namely, its three essential aspects: information collection, scheduling algorithm (decision making), and scheduling decision enforcement. We begin with the design of flow-level information collection systems for parallel computing applications. We proceed to study three important but previously-ignored scheduling problems in real-world datacenter applications:

• Scheduling with incomplete information: Flows without complete information, such as database query/response, cannot be handled by existing flow schedulers. We adopt the multilevel-feedback queues in operating systems to flows scheduling, and develop a queueing theory model to determine the optimal parameter settings.

• Scheduling heterogeneous flows with diverging objectives: Flows from user-facing applications have completion time constraints (deadlines), and they coexist with flows without such constraints. We identify and abstract this type of problems as mix-flow scheduling, for which we find out that state-of-the-art flow schedulers cannot achieve objectives of different types of flows at the same time. We approach this problem with a systematic formulation, and derive control-theoretic solution using Lyapunov Optimization techniques.

• Scheduling with erroneous information: Machine learning techniques are increasing popular in inferencing flow information. However, machine learning results are not always accurate. Thus, we design error-tolerant scheduling algorithm to mitigate the impact of prediction errors.

We present the proposed solutions for each problem, and demonstrate their effectiveness via simulations and experiments using the flow-level enforcement mechanisms. Our work is integrated into Chukonu , a comprehensive flow scheduling toolkit for distributed and parallel computing applications. Chukonu is composed of a flow information collection framework, a programmable flow scheduler, and a distributed enforcement system for scheduling decisions. It has been deployed in small-scale in production datacenters of large Internet service companies, such as Tencent and Huawei.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Chen, Kai Authors Chen, Li Subjects Data libraries Data processing Mathematical models Database management Computer scheduling Parallel scheduling (Computer scheduling) Language English Call number Thesis CSED 2018 Chen DOI 10.14711/thesis-991012615464503412

Full record

Flow scheduling for parallel computing applications in datacenters

by Li Chen

Post a Comment Cancel reply