Energy-efficient deep learning accelerator : from algorithm to architecture

HKUST Electronic Theses

Energy-efficient deep learning accelerator : from algorithm to architecture

by Zhu Jingyang

THESIS 2018

Ph.D. Electronic and Computer Engineering

xvi, 136 pages : illustrations ; 30 cm

Abstract

With the development of artificial intelligence (AI), the deep learning algorithm has become the main stream for a wide range of machine learning applications, including computer vision (CV), natural language processing (NLP), and robotics. While deep learning algorithm outperforms many other state-of-the-art algorithms in AI domain, like support vector machine (SVM) and scale-invariant feature transform (SIFT), the high computational complexity and storage complexity pose a challenge to the extensive deployment in the energy-stringent embedded platforms. Accordingly, in this thesis, we will focus on the exploration of energy-efficient solution to deep learning accelerators. Typically, the deep learning algorithm is composed of two phases: the training phase and the inference phase. From the perspective of consumer end, the energy efficiency plays a critical role in the deployment of deep learning. As a result, only the inference phase of the deep learning will be considered in this thesis for the hardware accelerator. Conventional design of computer architecture solely focuses on the hardware exploration, while algorithms are kept untouched and regarded as the standard benchmarks to verify the performance of architecture. However, the deep learning algorithm is of high resiliency, which tolerates a great amount of approximate computing. Therefore, to achieve the optimal energy-efficiency of hardware implementation, the co-optimization from both algorithmic level and architectural level becomes necessary. In this thesis, we will bridge these two requirements together. More specifically, the algorithm optimization of the deep learning is targeted to improve the energy efficiency of hardware execution, while the specialized hardware architecture is explored to better exploit the algorithmic superiority.

Firstly, we will present a general high-performance Network-on-Chip (NoC) router design, namely BiLink. For the contemporary deep learning accelerator, the tile-based architecture is commonly adopted to address the high internal bandwidth requirement for deep learning applications. Therefore, the high-performance on-chip router design plays a critical role for transferring the activations between the central memory and the processing tiles. In BiLink, an extra link stage is inserted between two neighboring routers. Then, up to 4 flits can be transferred over the link stage under both even and uneven traffic patterns at each clock cycle. Then, we will mainly focus on the domain-specific architecture for deep learning applications. Concretely, we first address the issue of high computational complexity of the deep learning algorithm. The conventional low rank approximation (LRA) is adopted to bypass the redundant operations within the inference phase of deep learning. The single instruction multiple data (SIMD) architecture is designed to take advantage of the reduced operations to improve both throughput and energy consumption. Next, we propose a novel end-to-end training algorithm to improve the performance of the conventional LRA approach with the better computation efficiency (in terms of less computation overhead and faster training speed). To address the better hardware scalability and the sparsity of the deep learning algorithm, SparseNN, a fully-distributed architecture adopting dedicated on-chip network, is proposed. Thirdly, we address the issue of large memory footprints for the deep learning algorithm by compressing the weights of the neural network. To improve the hardware efficiency, we enhance the conventional hashing compression technique by introducing another level of spatial locality. An FPGA prototype on SIMD architecture is demonstrated that by adopting the compression technique, a faster inference throughput can be achieved compared with the direct implementation on CPU and GPU. Fourthly, we will enhance SparseNN by combining triple levels of sparsity: the input activation sparsity, the output activation sparsity and the weight sparsity. A new compression algorithm, SparserNN, is presented to further reduce the computational complexity of neural networks. The algorithm will also be examined on the more realistic and challenging benchmark, i.e. AlexNet for Imagenet Large Scale Visual Recognition Competition (ILSRC). Finally, we will discuss several possible research directions for our future work, including the microarchitecture design of SparserNN and the architectural exploration of advanced neural network models.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Electronic and Computer Engineering Supervisors Tsui, Chi-Ying Authors Zhu, Jingyang Language English Call number Thesis ECED 2018 ZhuJ DOI 10.14711/thesis-991012662666803412

Full record

Energy-efficient deep learning accelerator : from algorithm to architecture

by Zhu Jingyang

Post a Comment Cancel reply