Hardware-software codesign for high-throughput and energy-efficient deep learning accelerators

HKUST Electronic Theses

Hardware-software codesign for high-throughput and energy-efficient deep learning accelerators

by Chen Xizi

THESIS 2021

Ph.D. Electronic and Computer Engineering

1 online resource (xiv, 119 pages) : illustrations (some color)

Abstract

Deep learning (DL) algorithm has aroused great attention in recent years due to its success in various fields, such as visual recognition, object detection, semantic segmentation, speech analysis, and text recognition. At the same time, however, the superiority of DL models usually comes at the cost of intensive computations and large storage requirements. As a result, it is challenging to implement the contemporary DL models in embedded systems that have limited hardware resources and critical constraints on power consumption. Specific hardware accelerators are required to tackle this problem. In this thesis, we focus on the hardware-software codesign to achieve high-throughput and energy-efficient implementations of the DL models. The codesign is mainly of two folds. From the algorithmic perspective, the DL models are optimized to reduce the computational complexity as well as the model size. Several techniques are explored, including dynamic quantization, model compression, and the runtime detection and elimination of redundant computations. From the hardware perspective, dedicated architectures based on the conventional CMOS technology and the emerging processing-in-memory technology are designed to fully exploit the above algorithmic optimization for energy saving and speedup.

In this thesis, we will first address the issue of high computational complexity in DL and propose an efficient accelerator, namely SubMac, based on the Resistive Random Access Memory (RRAM). Different from the conventional accelerators, SubMac slices each weight and activation into multiple subwords and breaks the multiply-accumulate operation (MAC) into a serial of subword computations. A fine-grained dynamic quantization method is proposed. The effective bit-length of each weight and activation dynamically changes throughout the serial computations to only compute the most significant subword multiplication-accumulation results. A large portion of computations is saved by the proposed method, which leads to significant improvements in energy efficiency and throughput.

Then, we will present an algorithm for detecting the redundant computations at runtime and early terminating the corresponding MAC operations. Specifically, we focus on two types of redundancy. The first one is related to the widely used activation function, ReLU. Since the negative MAC results will be clamped to zero by ReLU, their specific values are irrelevant to the following layers. Thus, the MAC operations can be terminated once the negative sign is detected. The second source of redundancy is when the partially accumulated result is significant enough to dictate the final MAC result. In this case, the remaining computations can be skipped to save the execution time and energy. A dedicated RRAM-based architecture is designed to exploit this early termination method.

In the third part, we will propose a novel model compression method to improve the throughput and energy efficiency of the DL accelerators. The unstructured sparsity after pruning poses a challenge to the efficient implementation of the pruned models in existing regular architectures like systolic arrays. A weight permutation scheme is proposed to tackle this problem. Through permutation, the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Experimental results show that the matrix compression rate can be effectively improved by the proposed method.

In the fourth part, we will further improve the throughput and energy efficiency of DL accelerators by a flexible subword-level model compression method. The weight matrix is quantized and pruned at the subword level. After pruning, most nonzero weights only have one nonzero subword. Two weights are allowed to be merged and mapped to the same MAC unit as long as their nonzero subwords are not in the same position. The proposed method enhances the flexibility of column packing and facilitates matrix compression. As a result, the compression performance can be effectively improved.

Finally, we will discuss several possible research directions for our future work, including the elimination of redundant computations for the compressed DL models and the data-independent compression.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Electronic and Computer Engineering Supervisors Tsui, Chi-Ying Authors Chen, Xizi Subjects Machine learning Neural networks (Computer science) Computational complexity Soft computing Computer algorithms Language English Call number Thesis ECE 2021 ChenX DOI 10.14711/thesis-991012936267803412

Full record

Hardware-software codesign for high-throughput and energy-efficient deep learning accelerators

by Chen Xizi

Post a Comment Cancel reply