THESIS
2020
Abstract
The performance bottleneck of distributed deep learning training (DLT) is shifting
from computation to communication as GPUs getting faster and model sizes growing
larger. Despite continuous efforts in communication optimization, prior researches focus
mostly on one single job. Such practice ignoring the diversified network demands across
different DLT jobs, and the heterogeneous computing resource demands of workers and
aggregators may finally double the communication time, cause a significant waste of network
resources, and guarantee no performance objectives.
Our goal is to design a novel framework that enables efficient network resource sharing
and minimizes the average completion time for DLT jobs. We present DEEPSCHEDULER
to achieve our goal. At its core, a dedicated comm...[
Read more ]
The performance bottleneck of distributed deep learning training (DLT) is shifting
from computation to communication as GPUs getting faster and model sizes growing
larger. Despite continuous efforts in communication optimization, prior researches focus
mostly on one single job. Such practice ignoring the diversified network demands across
different DLT jobs, and the heterogeneous computing resource demands of workers and
aggregators may finally double the communication time, cause a significant waste of network
resources, and guarantee no performance objectives.
Our goal is to design a novel framework that enables efficient network resource sharing
and minimizes the average completion time for DLT jobs. We present DEEPSCHEDULER
to achieve our goal. At its core, a dedicated communication layer constituting with
aggregators across all machines throughout the cluster allows “borrowing” the network
resource from other jobs. Furthermore, it makes several algorithmic innovations on inter-job
interference minimization and prioritization by de-colocating aggregators and workers
to optimize the average DLT job completion time. We have implemented DEEPSCHEDULER
and evaluated its performance on a small-scaled testbed with NVIDIA V100 GPUs
and 40G RDMA network environment. Testbed experiments show that DEEPSCHEDULER
speeds up DLT jobs by 1.72x through de-colocation and outperforms NCCL by up to 1.8x.
Post a Comment