THESIS
2021
1 online resource (xii, 95 pages) : illustrations (chiefly color)
Abstract
As Deep Neural Networks (DNNs) become increasingly popular, there is a growing trend to accelerate the DNN applications on hardware platforms like GPUs, FPGAs, etc., to gain higher performance and efficiency. However, it is time-consuming to design the hardware efficient architectures and tune the performance due to the strong background requirements in hardware details, large design space, and the expensive cost to evaluate each design point. In this work, we propose comprehensive frameworks for DNN design optimization on FPGA and GPU, respectively. For FPGA, a novel data structure, LoopTree, is proposed to provide a high-level abstraction of the OpenCL-based DNN design. A coarse-grained model and a fine-grained model are
developed to predict the performance of the LoopTree so that the...[
Read more ]
As Deep Neural Networks (DNNs) become increasingly popular, there is a growing trend to accelerate the DNN applications on hardware platforms like GPUs, FPGAs, etc., to gain higher performance and efficiency. However, it is time-consuming to design the hardware efficient architectures and tune the performance due to the strong background requirements in hardware details, large design space, and the expensive cost to evaluate each design point. In this work, we propose comprehensive frameworks for DNN design optimization on FPGA and GPU, respectively. For FPGA, a novel data structure, LoopTree, is proposed to provide a high-level abstraction of the OpenCL-based DNN design. A coarse-grained model and a fine-grained model are
developed to predict the performance of the LoopTree so that the design space of the LoopTree can be explored efficiently. An average estimation error of 8.87% and 4.8% has been observed for our coarse-grained model and fine-grained model, respectively, which are much smaller than the widely used operation statistics based estimation. For GPU counterparts, we automatically generate the design according to the templates and the corresponding parameters combinations. A novel transfer learning and Guided Genetic Algorithm (GGA) based framework are proposed to speed up the hardware tuning process. Our experiments show that we can achieve superior performance than the state-of-the-art work, such as auto-tuning framework TVM and the handcraft optimized library cuDNN, while reducing the searching time by 8.96x and 4.58x compared with the XGBoost tuner and GA tuner in TVM.
We also dive into auto-pruning to achieve more compact neural networks to ease the hardware constraints. We observe that the widely used reinforcement learning (RL) algorithm has become the timing performance bottleneck of the auto-pruning process. Therefore, we propose a framework which can significantly accelerate the RL algorithm by taking advantage of the pruning history in other pruning scenarios. The experiments have shown that our framework can accelerate the auto-pruning process by 1.5 ~ 2.5x for ResNet-20, and 1.81 ~ 2.375x for other neural networks like ResNet-56, ResNet-18, and MobileNet v1.
Post a Comment