THESIS
2019
xi, 55 pages : illustrations (some color) ; 30 cm
Abstract
Despite convolutional neural networks (CNNs) being applied to a vast range of applications,
deploying CNNs on a portable system is challenging due to the enormous data
volume, intensive computation, and frequent memory access.
For data volume and frequent memory access, many approaches have been proposed
to reduce the CNN complexity, such as pruning and quantization. However, existing designs
adopt channel dimension tiling, which requires a regular channel number. After
pruning, the channel number may become highly irregular, which will incur heavy zero
padding. As for quantization, simple aggressive bit reduction usually results in a large
accuracy drop. To address these challenges, row-based tiling in the kernel dimension is
adapted to different kernel shapes and significantl...[
Read more ]
Despite convolutional neural networks (CNNs) being applied to a vast range of applications,
deploying CNNs on a portable system is challenging due to the enormous data
volume, intensive computation, and frequent memory access.
For data volume and frequent memory access, many approaches have been proposed
to reduce the CNN complexity, such as pruning and quantization. However, existing designs
adopt channel dimension tiling, which requires a regular channel number. After
pruning, the channel number may become highly irregular, which will incur heavy zero
padding. As for quantization, simple aggressive bit reduction usually results in a large
accuracy drop. To address these challenges, row-based tiling in the kernel dimension is
adapted to different kernel shapes and significantly reduces the zero padding. Moreover,
a configurable processing units (PUs) design is developed to dynamically group or split to
enable efficient resource sharing. As for quantization, the recently proposed incremental
network quantization (INQ) algorithm is considered, which uses low bit weights with a
power of 2 format. We further propose an approximate-shifter-based processing element
(PE) design as the building block of the PUs to facilitate the convolution computation. To
evaluate, the RTL-based INQ quantized AlexNet is realized on a standalone FPGA. Compared
with the state-of-art designs, our accelerator achieves 1.87x higher performance,
which demonstrates the efficiency of the proposed design methods.
Apart from reducing the data volume, reducing the CNN intensive computation is
also critical for acceleration. Convolution under the spectral-domain has been proposed in order to simplify the compute-intensive convolution layers. However, utilizing the
spectral-domain introduces domain incompatibility to other layers. To address these challenges,
a spectral-domain approximate activation is proposed with the recently proposed
spectral-domain pooling to solve the domain incompatibilities. Lastly, compared with latest
spectral-domain activation algorithms, our proposed activation algorithm is evaluated
under the Tensorflow software with the dataset CIFAR-10 and achieves an ~3% accuracy
improvement.
Post a Comment