Towards scalable deep learning with communication optimizations

HKUST Electronic Theses

Towards scalable deep learning with communication optimizations

by Lin Zhang

THESIS 2023

Ph.D. Computer Science and Engineering

1 online resource (xvi, 123 pages) : color illustrations

Abstract

With the burst of data and model sizes, it has become prevalent to parallelize deep neural networks (DNNs) training in a cluster of distributed devices, which however introduces extensive communication overheads. In this thesis, we study both system-level and algorithm-level communication optimization techniques to improve training efficiency.

First, existing data parallel training systems rely on the all-reduce primitive for gradient aggregation, which only achieve sub-optimal training performance. We present DeAR, that decouples the all-reduce primitive to two operators to enable fine-grained communication scheduling, and then we use dynamic tensor fusion to derive an optimal solution.

Second, many gradient compression algorithms have been proposed to compress communication data in synchronous stochastic gradient descent (S-SGD) to accelerate distributed training, but we find they fail to outperform S-SGD in most cases. To this end, we propose ACP-SGD, which largely reduces the compression and communication overheads, and enjoys three system optimizations (all-reduce, pipelining, and tensor fusion).

Third, we are concerned with the advancement of second-order methods such as distributed K-FAC (D-KFAC) for training DNNs because of their utilization of curvature information to accelerate the training process. However, D-KFAC incurs extensive computations and communications for curvature information. We present smart parallel D-KFAC (SPD-KFAC) and placement-aware D-KFAC (PAD-KFAC) to accelerate D-KFAC with efficient pipelining and optimal tensor placement scheduling techniques, respectively.

Fourth, we present a memory- and time-efficient second-order algorithm named Eva, with two novel techniques: 1) we approximate the curvature information with two small stochastic vectors to reduce the memory and communication consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices to address the high computation overhead.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Li, Bo Authors Zhang, Lin Language English Call number Thesis CSE 2023 ZhangL DOI 10.14711/thesis-991013222946703412

Full record

Towards scalable deep learning with communication optimizations

by Lin Zhang

Post a Comment Cancel reply