THESIS
2021
1 online resource (xiv, 119 pages) : illustrations (some color)
Abstract
Deep learning (DL) algorithm has aroused great attention in recent years due to its success
in various fields, such as visual recognition, object detection, semantic segmentation, speech
analysis, and text recognition. At the same time, however, the superiority of DL models usually
comes at the cost of intensive computations and large storage requirements. As a result, it is
challenging to implement the contemporary DL models in embedded systems that have limited
hardware resources and critical constraints on power consumption. Specific hardware accelerators
are required to tackle this problem. In this thesis, we focus on the hardware-software
codesign to achieve high-throughput and energy-efficient implementations of the DL models.
The codesign is mainly of two folds. From the algorith...[
Read more ]
Deep learning (DL) algorithm has aroused great attention in recent years due to its success
in various fields, such as visual recognition, object detection, semantic segmentation, speech
analysis, and text recognition. At the same time, however, the superiority of DL models usually
comes at the cost of intensive computations and large storage requirements. As a result, it is
challenging to implement the contemporary DL models in embedded systems that have limited
hardware resources and critical constraints on power consumption. Specific hardware accelerators
are required to tackle this problem. In this thesis, we focus on the hardware-software
codesign to achieve high-throughput and energy-efficient implementations of the DL models.
The codesign is mainly of two folds. From the algorithmic perspective, the DL models are optimized
to reduce the computational complexity as well as the model size. Several techniques
are explored, including dynamic quantization, model compression, and the runtime detection
and elimination of redundant computations. From the hardware perspective, dedicated architectures
based on the conventional CMOS technology and the emerging processing-in-memory
technology are designed to fully exploit the above algorithmic optimization for energy saving
and speedup.
In this thesis, we will first address the issue of high computational complexity in DL and
propose an efficient accelerator, namely SubMac, based on the Resistive Random Access Memory
(RRAM). Different from the conventional accelerators, SubMac slices each weight and
activation into multiple subwords and breaks the multiply-accumulate operation (MAC) into
a serial of subword computations. A fine-grained dynamic quantization method is proposed.
The effective bit-length of each weight and activation dynamically changes throughout the serial
computations to only compute the most significant subword multiplication-accumulation results. A large portion of computations is saved by the proposed method, which leads to significant
improvements in energy efficiency and throughput.
Then, we will present an algorithm for detecting the redundant computations at runtime and
early terminating the corresponding MAC operations. Specifically, we focus on two types of
redundancy. The first one is related to the widely used activation function, ReLU. Since the
negative MAC results will be clamped to zero by ReLU, their specific values are irrelevant to
the following layers. Thus, the MAC operations can be terminated once the negative sign is detected.
The second source of redundancy is when the partially accumulated result is significant
enough to dictate the final MAC result. In this case, the remaining computations can be skipped
to save the execution time and energy. A dedicated RRAM-based architecture is designed to
exploit this early termination method.
In the third part, we will propose a novel model compression method to improve the throughput
and energy efficiency of the DL accelerators. The unstructured sparsity after pruning poses a
challenge to the efficient implementation of the pruned models in existing regular architectures
like systolic arrays. A weight permutation scheme is proposed to tackle this problem. Through
permutation, the sparse weight matrix is further compressed to a small and dense format to make
full use of the hardware resources. Experimental results show that the matrix compression rate
can be effectively improved by the proposed method.
In the fourth part, we will further improve the throughput and energy efficiency of DL
accelerators by a flexible subword-level model compression method. The weight matrix is
quantized and pruned at the subword level. After pruning, most nonzero weights only have one
nonzero subword. Two weights are allowed to be merged and mapped to the same MAC unit as
long as their nonzero subwords are not in the same position. The proposed method enhances the
flexibility of column packing and facilitates matrix compression. As a result, the compression
performance can be effectively improved.
Finally, we will discuss several possible research directions for our future work, including
the elimination of redundant computations for the compressed DL models and the data-independent compression.
Post a Comment