THESIS
2018
xvi, 136 pages : illustrations ; 30 cm
Abstract
With the development of artificial intelligence (AI), the deep learning algorithm has
become the main stream for a wide range of machine learning applications, including computer vision (CV), natural language processing (NLP), and robotics. While deep learning
algorithm outperforms many other state-of-the-art algorithms in AI domain, like support
vector machine (SVM) and scale-invariant feature transform (SIFT), the high computational complexity and storage complexity pose a challenge to the extensive deployment in
the energy-stringent embedded platforms. Accordingly, in this thesis, we will focus on the
exploration of energy-efficient solution to deep learning accelerators. Typically, the deep
learning algorithm is composed of two phases: the training phase and the inference phas...[
Read more ]
With the development of artificial intelligence (AI), the deep learning algorithm has
become the main stream for a wide range of machine learning applications, including computer vision (CV), natural language processing (NLP), and robotics. While deep learning
algorithm outperforms many other state-of-the-art algorithms in AI domain, like support
vector machine (SVM) and scale-invariant feature transform (SIFT), the high computational complexity and storage complexity pose a challenge to the extensive deployment in
the energy-stringent embedded platforms. Accordingly, in this thesis, we will focus on the
exploration of energy-efficient solution to deep learning accelerators. Typically, the deep
learning algorithm is composed of two phases: the training phase and the inference phase.
From the perspective of consumer end, the energy efficiency plays a critical role in the
deployment of deep learning. As a result, only the inference phase of the deep learning
will be considered in this thesis for the hardware accelerator. Conventional design of computer architecture solely focuses on the hardware exploration, while algorithms are kept
untouched and regarded as the standard benchmarks to verify the performance of architecture. However, the deep learning algorithm is of high resiliency, which tolerates a great
amount of approximate computing. Therefore, to achieve the optimal energy-efficiency
of hardware implementation, the co-optimization from both algorithmic level and architectural level becomes necessary. In this thesis, we will bridge these two requirements
together. More specifically, the algorithm optimization of the deep learning is targeted
to improve the energy efficiency of hardware execution, while the specialized hardware
architecture is explored to better exploit the algorithmic superiority.
Firstly, we will present a general high-performance Network-on-Chip (NoC) router
design, namely BiLink. For the contemporary deep learning accelerator, the tile-based
architecture is commonly adopted to address the high internal bandwidth requirement for
deep learning applications. Therefore, the high-performance on-chip router design plays
a critical role for transferring the activations between the central memory and the processing
tiles. In BiLink, an extra link stage is inserted between two neighboring routers.
Then, up to 4 flits can be transferred over the link stage under both even and uneven
traffic patterns at each clock cycle. Then, we will mainly focus on the domain-specific
architecture for deep learning applications. Concretely, we first address the issue of high
computational complexity of the deep learning algorithm. The conventional low rank
approximation (LRA) is adopted to bypass the redundant operations within the inference
phase of deep learning. The single instruction multiple data (SIMD) architecture is
designed to take advantage of the reduced operations to improve both throughput and
energy consumption. Next, we propose a novel end-to-end training algorithm to improve
the performance of the conventional LRA approach with the better computation efficiency
(in terms of less computation overhead and faster training speed). To address the better
hardware scalability and the sparsity of the deep learning algorithm, SparseNN, a fully-distributed
architecture adopting dedicated on-chip network, is proposed. Thirdly, we
address the issue of large memory footprints for the deep learning algorithm by compressing
the weights of the neural network. To improve the hardware efficiency, we enhance
the conventional hashing compression technique by introducing another level of spatial
locality. An FPGA prototype on SIMD architecture is demonstrated that by adopting
the compression technique, a faster inference throughput can be achieved compared with
the direct implementation on CPU and GPU. Fourthly, we will enhance SparseNN by
combining triple levels of sparsity: the input activation sparsity, the output activation
sparsity and the weight sparsity. A new compression algorithm, SparserNN, is presented
to further reduce the computational complexity of neural networks. The algorithm
will also be examined on the more realistic and challenging benchmark, i.e. AlexNet for
Imagenet Large Scale Visual Recognition Competition (ILSRC). Finally, we will discuss
several possible research directions for our future work, including the microarchitecture
design of SparserNN and the architectural exploration of advanced neural network models.
Post a Comment