Sharing the deep learning cluster network

HKUST Electronic Theses

Sharing the deep learning cluster network

by Jingrong Chen

THESIS 2020

M.Phil. Computer Science and Engineering

x, 48 pages : illustrations ; 30 cm

Abstract

The performance bottleneck of distributed deep learning training (DLT) is shifting from computation to communication as GPUs getting faster and model sizes growing larger. Despite continuous efforts in communication optimization, prior researches focus mostly on one single job. Such practice ignoring the diversified network demands across different DLT jobs, and the heterogeneous computing resource demands of workers and aggregators may finally double the communication time, cause a significant waste of network resources, and guarantee no performance objectives.

Our goal is to design a novel framework that enables efficient network resource sharing and minimizes the average completion time for DLT jobs. We present DEEPSCHEDULER to achieve our goal. At its core, a dedicated comm...[ Read more ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science and Engineering Authors Chen, Jingrong Subjects Computer scheduling Mathematical models Machine learning Language English Call number Thesis CSED 2020 ChenJ DOI 10.14711/thesis-991012879862903412

Full record

Sharing the deep learning cluster network

by Jingrong Chen

Post a Comment Cancel reply