A low bitwidth CNN accelerator on FPGA using winograd and block floating point arithmetic

HKUST Electronic Theses

A low bitwidth CNN accelerator on FPGA using winograd and block floating point arithmetic

by Yuk Wong

THESIS 2021

M.Phil. Electronic and Computer Engineering

1 online resource (x, 40 pages) : illustrations (some color)

Abstract

Convolutional neural networks (CNNs) have achieved near or exceeding human performance in computer vision, yet their large computational and memory requirements made them difficult to deploy in both large data centers and embedded systems. One main factor for the computational and memory burden is the large number of floating point (FP) operations. Fixed-point (FXP) quantization is promising towards reducing the resource requirements of CNNs, yet low bitwidth implementations require fine-tuning to recover accuracy. Another method to reduce the number of arithmetic operations is to apply fast algorithms such as Winograd filtering algorithm. However, the numerical errors of Winograd filtering algorithm result in CNN accuracy penalty when combined with low bitwidth arithmetic.

In this thesis, we propose a CNN accelerator utilizing a novel block floating point (BFP) scheme for reducing bitwidth down to 10-bit and supporting Winograd filtering algorithm. First, we derive our block floating point processing element (PE) design from a fused floating point dot-product unit. Our VLSI synthesis results show that our PE design can reduce area and power by 27.11% and 44.86% respectively.

Second, we develop our novel block floating point scheme to combine quantization with Winograd filtering algorithm. Our block floating point quantization enables integer arithmetic for both Winograd encoding and channel accumulation within BFP blocks, reducing the hardware cost for both operations.

Third, we implement our PE design and block floating point scheme on an end-to-end CNN accelerator on FPGA. We evaluate three alternate BFP schemes and selects BFP10+F2 Winograd to balance between accuracy and throughput. We compare our design with a baseline FP16 design and show that our BFP quantization reduces 50.1% LUTs, 48.3% Register, 27.3% BRAM and 43.8% DSP as well as achieves 32.1% higher frequency. Finally, we perform case studies with different CNNs and show that the accuracy drop is within 1% of the FP32 network.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Electronic and Computer Engineering Supervisors Zhang, Wei Authors Wong, Yuk Subjects Neural networks (Computer science) Mathematical models Field programmable gate arrays Language English Call number Thesis ECE 2021 WongY DOI 10.14711/thesis-991012980220303412

Full record

A low bitwidth CNN accelerator on FPGA using winograd and block floating point arithmetic

by Yuk Wong

Post a Comment Cancel reply