THESIS
2019
xii, 60 pages : illustrations ; 30 cm
Abstract
Distributed machine learning (DML) is of growing importance. Due to the growing
scale of data and complexity of models, many important machine learning problems cannot
be effectively solved by single machine. Existing scheduling algorithms are insufficient
due to the complex computation-communication pattern of DML. In the training
stage of DML, networking becomes bottleneck as the models trained on different machines
need synchronization and updates frequently, transmitting MB to GB scale of parameters
at second to sub-second level. In this thesis, we focus on the network scheduling
problems for DML.
Firstly, we propose SaSP, a intra-job scheduler for allocating resources to processes
on the same DML job on different servers. We show that DML attains faster speed with
decou...[
Read more ]
Distributed machine learning (DML) is of growing importance. Due to the growing
scale of data and complexity of models, many important machine learning problems cannot
be effectively solved by single machine. Existing scheduling algorithms are insufficient
due to the complex computation-communication pattern of DML. In the training
stage of DML, networking becomes bottleneck as the models trained on different machines
need synchronization and updates frequently, transmitting MB to GB scale of parameters
at second to sub-second level. In this thesis, we focus on the network scheduling
problems for DML.
Firstly, we propose SaSP, a intra-job scheduler for allocating resources to processes
on the same DML job on different servers. We show that DML attains faster speed with
decoupling the computation and communication processes at scheduler design. Our prototype
shows a 25% to 50% speed compared over different parameter synchronization
schemes on various DML applications. Secondly, we present DeepProphet, a tool to
analyze the computation and network resource requirements offline via analyzing the
dataflow graph representing the DML application. With given hardware configuration, DeepProphet accurately predicts the iteration completion time within below 10% average
error. We demonstrate the resource requirements for DML can be conducted accurately
via offline analysis, a feature that benefits later inter-job scheduler designs.
Post a Comment