基于快速滤波算法的卷积神经网络加速器设计

王巍; 周凯利; 王伊昌; 王广; 袁军

doi:10.11999/JEIT190037

基于快速滤波算法的卷积神经网络加速器设计

doi: 10.11999/JEIT190037

重庆邮电大学光电工程学院/国际半导体学院重庆 400065

基金项目: 国家自然科学基金(61404019)，重庆市集成电路产业重大主题专项(cstc2018jszx-cyztzx0211, cstc2018jszx-cyztzx0217)

详细信息

作者简介:
王巍：男，1967年生，博士后，教授，研究方向为集成电路设计

周凯利：女，1991年生，硕士生，研究方向为数字集成电路设计

王伊昌：男，1996年生，硕士生，研究方向为模拟集成电路设计

王广：男，1994年生，硕士生，研究方向为半导体光电器件设计

袁军：男，1984年生，博士，副教授，研究方向为数模混合集成电路设计

通讯作者:
周凯利　2508005354@qq.com

中图分类号: TN432
计量
- 文章访问数: 3299
- HTML全文浏览量: 1483
- PDF下载量: 140
- 被引次数: 0
出版历程
- 收稿日期: 2019-01-15
- 修回日期: 2019-03-20
- 网络出版日期: 2019-05-23
- 刊出日期: 2019-11-01

Design of Convolutional Neural Networks Accelerator Based on Fast Filter Algorithm

College of Electronics Engineering/International Semiconductor College, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Funds: The National Natural Science Foundation of China (61404019), Major Themes of Integrated Circuit Industry in Chongqing (cstc2018jszx-cyztzx0211, cstc2018jszx-cyztzx0217)

摘要

摘要: 为减少卷积神经网络(CNN)的计算量，该文将2维快速滤波算法引入到卷积神经网络，并提出一种在FPGA上实现CNN逐层加速的硬件架构。首先，采用循环变换方法设计行缓存循环控制单元，用于有效地管理不同卷积窗口以及不同层之间的输入特征图数据，并通过标志信号启动卷积计算加速单元来实现逐层加速；其次，设计了基于4并行快速滤波算法的卷积计算加速单元，该单元采用若干小滤波器组成的复杂度较低的并行滤波结构来实现。利用手写数字集MNIST对所设计的CNN加速器电路进行测试，结果表明：在xilinx kintex7平台上，输入时钟为100 MHz时，电路的计算性能达到了20.49 GOPS，识别率为98.68%。可见通过减少CNN的计算量，能够提高电路的计算性能。
- 卷积神经网络 /
- 快速滤波算法 /
- FPGA /
- 并行结构
Abstract: In order to reduce the computational complexity of Convolutional Neural Network(CNN), the two-dimensional fast filtering algorithm is introduced into the CNN, and a hardware architecture for implementing CNN layer-by-layer acceleration on FPGA is proposed. Firstly, the line buffer loop control unit is designed by using the cyclic transformation method to manage effectively different convolution windows and the input feature map data between different layers, and starts the convolution calculation acceleration unit by the flag signal to realize layer-by-layer acceleration. Secondly, a convolution calculation accelerating unit based on 4 parallel fast filtering algorithm is designed. The unit is realized by a less complex parallel filtering structure composed of several small filters. Using the handwritten digit set MNIST to test the designed CNN accelerator circuit, the results show that on the xilinx kintex7 platform, when the input clock is 100 MHz, the computational performance of the circuit reaches 20.49 GOPS, and the recognition rate is 98.68%. It can be seen that the computational performance of the circuit can be improved by reducing the amount of calculation of the CNN.
- Convolution Neural Network(CNN) /
- Fast filter algorithms /
- FPGA /
- Parallel structure

HTML全文

图 2 卷积层的卷积计算过程

下载: 全尺寸图片幻灯片

图 1 卷积神经网络的结构

下载: 全尺寸图片幻灯片

图 3 卷积神经网络的逐层加速硬件架构图

下载: 全尺寸图片幻灯片

图 4 行缓存循环控制单元

下载: 全尺寸图片幻灯片

图 5 卷积计算加速单元的结构图

下载: 全尺寸图片幻灯片

图 6 各部分的具体电路

下载: 全尺寸图片幻灯片

表 1 MATLAB实现与FPGA实现的比较

类型	时间(ms/frams)	精度(bad/10000 frames)	数据类型
.m文件	0.7854	1.19%	双精度
.v 文件	0.01986	1.32%	16 bit定点数

下载: 导出CSV

表 2 卷积神经网络FPGA实现的性能比较

参数	文献[4]	文献[6]	文献[7]	本文方案
FPGA	Virttex-7xc7vx485t	Zynqzc702	Virttex-7xc7vx485t	Kirtex-7xc7k325t
频率(MHz)	100	166	150	100
时间(ms)	2.6368	0.1510	0.0254	0.0199
BRAM	27	96	0	30
DSP	20	95	638	284
FF	54075	27664	66346	36973
LUT	14832	38836	51125	51748
识别率(%)	98.62	99.01	96.80	98.68
GOPS	1.58	2.70	15.87	20.49

下载: 导出CSV

参考文献(15)

ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170.

KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. The 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2012: 1097–1105.

DONG Han, LI Tao, LENG Jiabing, et al. GCN: GPU-based cube CNN framework for hyperspectral image classification[C]. The 201746th International Conference on Parallel Processing, Bristol, UK, 2017: 41–49.

GHAFFARI S and SHARIFIAN S. FPGA-based convolutional neural network accelerator design using high level synthesize[C]. The 20162nd International Conference of Signal Processing and Intelligent Systems, Tehran, Iran, 2016: 1–6.

CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127–138. doi: 10.1109/JSSC.2016.2616357

FENG Gan, HU Zuyi, CHEN Song, et al. Energy-efficient and high-throughput FPGA-based accelerator for Convolutional Neural Networks[C]. The 201613th IEEE International Conference on Solid-State and Integrated Circuit Technology, Hangzhou, China, 2016: 624–626.

ZHOU Yongmei and JIANG Jingfei. An FPGA-based accelerator implementation for deep convolutional neural networks[C]. The 20154th International Conference on Computer Science and Network Technology, Harbin, China, 2015: 829–832.

HOSEINI F, SHAHBAHRAMI A, and BAYAT P. An efficient implementation of deep convolutional neural networks for MRI segmentation[J]. Journal of Digital Imaging, 2018, 31(5): 738–747. doi: 10.1007/s10278-018-0062-2

HUANG Jiahao, WANG Tiejun, ZHU Xuhui, et al. A parallel optimization of the fast algorithm of convolution neural network on CPU[C]. The 201810th International Conference on Measuring Technology and Mechatronics Automation, Changsha, China, 2018: 5–9.

LAVIN A and GRAY S. Fast algorithms for convolutional neural networks[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, 4013–4021.

VINCHURKAR P P, RATHKANTHIWAR S V, and KAKDE S M. HDL implementation of DFT architectures using winograd fast Fourier transform algorithm[C]. The 2015 5th International Conference on Communication Systems and Network Technologies, Gwalior, India, 2015: 397–401.

WANG Xuan, WANG Chao, and ZHOU Xuehai. Work-in-progress: WinoNN: Optimising FPGA-based neural network accelerators using fast winograd algorithm[C]. 2018 International Conference on Hardware/Software Codesign and System Synthesis, Turin, Italy, 2018: 1–2.

NAITO Y, MIYAZAKI T, and KURODA I. A fast full-search motion estimation method for programmable processors with a multiply-accumulator[C]. 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, USA, 1996: 3221–3224.

JIANG Jingfei, HU Rongdong, and LUJÁN M. A flexible memory controller supporting deep belief networks with fixed-point arithmetic[C]. The 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, Cambridge, USA, 2013: 144–152.

LI Sicheng, WEN Wei, WANG Yu, et al. An FPGA design framework for CNN sparsification and acceleration[C]. The 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines, Napa, USA, 2017: 28.

施引文献

资源附件(0)

访问统计