THESIS
2023
1 online resource (xvi, 123 pages) : color illustrations
Abstract
With the burst of data and model sizes, it has become prevalent to parallelize deep
neural networks (DNNs) training in a cluster of distributed devices, which however introduces
extensive communication overheads. In this thesis, we study both system-level and
algorithm-level communication optimization techniques to improve training efficiency.
First, existing data parallel training systems rely on the all-reduce primitive for gradient
aggregation, which only achieve sub-optimal training performance. We present DeAR,
that decouples the all-reduce primitive to two operators to enable fine-grained communication
scheduling, and then we use dynamic tensor fusion to derive an optimal solution.
Second, many gradient compression algorithms have been proposed to compress communication
data in sy...[
Read more ]
With the burst of data and model sizes, it has become prevalent to parallelize deep
neural networks (DNNs) training in a cluster of distributed devices, which however introduces
extensive communication overheads. In this thesis, we study both system-level and
algorithm-level communication optimization techniques to improve training efficiency.
First, existing data parallel training systems rely on the all-reduce primitive for gradient
aggregation, which only achieve sub-optimal training performance. We present DeAR,
that decouples the all-reduce primitive to two operators to enable fine-grained communication
scheduling, and then we use dynamic tensor fusion to derive an optimal solution.
Second, many gradient compression algorithms have been proposed to compress communication
data in synchronous stochastic gradient descent (S-SGD) to accelerate distributed
training, but we find they fail to outperform S-SGD in most cases. To this end,
we propose ACP-SGD, which largely reduces the compression and communication overheads,
and enjoys three system optimizations (all-reduce, pipelining, and tensor fusion).
Third, we are concerned with the advancement of second-order methods such as distributed
K-FAC (D-KFAC) for training DNNs because of their utilization of curvature information to accelerate the training process. However, D-KFAC incurs extensive computations
and communications for curvature information. We present smart parallel D-KFAC
(SPD-KFAC) and placement-aware D-KFAC (PAD-KFAC) to accelerate D-KFAC with efficient
pipelining and optimal tensor placement scheduling techniques, respectively.
Fourth, we present a memory- and time-efficient second-order algorithm named Eva,
with two novel techniques: 1) we approximate the curvature information with two small
stochastic vectors to reduce the memory and communication consumption, and 2) we
derive an efficient update formula without explicitly computing the inverse of matrices to
address the high computation overhead.
Post a Comment