高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于FPGA的卷积神经网络硬件加速器设计

秦华标 曹钦平

秦华标, 曹钦平. 基于FPGA的卷积神经网络硬件加速器设计[J]. 电子与信息学报, 2019, 41(11): 2599-2605. doi: 10.11999/JEIT190058
引用本文: 秦华标, 曹钦平. 基于FPGA的卷积神经网络硬件加速器设计[J]. 电子与信息学报, 2019, 41(11): 2599-2605. doi: 10.11999/JEIT190058
Huabiao QIN, Qinping CAO. Design of Convolutional Neural Networks Hardware Acceleration Based on FPGA[J]. Journal of Electronics & Information Technology, 2019, 41(11): 2599-2605. doi: 10.11999/JEIT190058
Citation: Huabiao QIN, Qinping CAO. Design of Convolutional Neural Networks Hardware Acceleration Based on FPGA[J]. Journal of Electronics & Information Technology, 2019, 41(11): 2599-2605. doi: 10.11999/JEIT190058

基于FPGA的卷积神经网络硬件加速器设计

doi: 10.11999/JEIT190058
基金项目: 广东省科技计划项目(2014B090910002)
详细信息
    作者简介:

    秦华标:男,1967年生,教授,研究方向为智能信息处理、无线通信网络、嵌入式系统、FPGA设计

    曹钦平:男,1995年生,硕士生,研究方向为集成电路设计

    通讯作者:

    秦华标 eehbqin@scut.edu.cn

  • 中图分类号: TP331

Design of Convolutional Neural Networks Hardware Acceleration Based on FPGA

Funds: The Science and Technology Project of Guangdong Provience (2014B090910002)
  • 摘要: 针对卷积神经网络(CNN)计算量大、计算时间长的问题,该文提出一种基于现场可编程逻辑门阵列(FPGA)的卷积神经网络硬件加速器。首先通过深入分析卷积层的前向运算原理和探索卷积层运算的并行性,设计了一种输入通道并行、输出通道并行以及卷积窗口深度流水的硬件架构。然后在上述架构中设计了全并行乘法-加法树模块来加速卷积运算和高效的窗口缓存模块来实现卷积窗口的流水线操作。最后实验结果表明,该文提出的加速器能效比达到32.73 GOPS/W,比现有的解决方案高了34%,同时性能达到了317.86 GOPS。
  • 图  1  卷积层运算过程

    图  2  1个输入通道的卷积运算过程

    图  3  N个输入通道的卷积窗口并行计算

    图  4  累加器并行运算

    图  5  经典加法树

    图  6  本文设计的加法树

    图  7  乘法-加法树模块

    图  8  卷积窗口数据重用

    图  9  窗口缓存结构

    图  10  窗口缓存时序

    图  11  输出通道并行模块

    图  12  并行加速方案结构

    图  13  卷积窗口流水线

    图  14  FPGA, CPU, GPU的性能对比

    表  1  卷积神经网络结构参数

    层名称层结构参数量(个)
    卷积层1卷积核大小3×3,卷积核个数15,步长1150
    激活层10
    池化层1池化大小2×2,步长20
    卷积层2卷积核大小6×6,卷积核个数20,步长110820
    激活层20
    池化层2池化大小2×2,步长20
    全连接层输出神经元个数103210
    下载: 导出CSV

    表  2  FPGA资源消耗情况

    资源比例(%)
    ALMs89423/11356079
    Block Memory730151/124928006
    DSPs342/342100
    下载: 导出CSV

    表  3  与文献FPGA硬件加速对比

    文献[7]文献[11]文献[12]本文方法
    FPGAZynq XC7Z045ZynqXC7Z045Virtex-7 VX690TCyclone V 5CGXF
    频率(MHz)150100150100
    DSP资源780(86.7%)824(91.6%)1376(38%)342(100%)
    量化策略16 bit fixed16 bit fixed16 bit fixed16 bit fixed
    功耗(W)9.6309.40025.0009.711
    性能(GOPS)136.97229.50570.00317.86
    能效比(GOPS/W)14.2224.4222.8032.73
    下载: 导出CSV
  • LIU Weibo, WANG Zidong, LIU Xiaohui, et al. A survey of deep neural network architectures and their applications[J]. Neurocomputing, 2017, 234: 11–26. doi: 10.1016/j.neucom.2016.12.038
    HAN Song, MAO Huizi, and DALLY W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding[J]. arXiv preprint arXiv: 1510.00149, 2015.
    COATES A, HUVAL B, WANG Tao, et al. Deep learning with COTS HPC systems[C]. Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, USA, 2013: III-1337–III-1345.
    JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]. Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, Canada, 2017: 1–12. doi: 10.1145/3079856.3080246.
    MOTAMEDI M, GYSEL P, AKELLA V, et al. Design space exploration of FPGA-based deep convolutional neural networks[C]. Proceedings of the 21st Asia and South Pacific Design Automation Conference, Macau, China, 2016: 575–580. doi: 10.1109/ASPDAC.2016.7428073.
    ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. Proceedings of 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2017: 25–34. doi: 10.1145/3020078.3021698.
    QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. Proceedings of 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2016: 26–35. doi: 10.1145/2847263.2847265.
    余奇. 基于FPGA的深度学习加速器设计与实现[D]. [硕士论文], 中国科学技术大学, 2016: 30–38.

    YU Qi. Deep learning accelerator design and implementation based on FPGA[D]. [Master dissertation], University of Science and Technology of China, 2016: 30–38.
    LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791
    ABADI M, BARHAM P, CHEN Jianmin, et al. Tensorflow: A system for large-scale machine learning[C]. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, USA, 2016: 265–283.
    XIAO Qingcheng, LIANG Yun, LU Liqiang, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]. Proceedings of the 54th Annual Design Automation Conference, Austin, USA, 2017: 62. doi: 10.1145/3061639.3062244.
    SHEN Junzhong, HUANG You, WANG Zelong, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]. Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2018: 97–106. doi: 10.1145/3174243.3174257.
  • 加载中
图(14) / 表(3)
计量
  • 文章访问数:  4862
  • HTML全文浏览量:  2132
  • PDF下载量:  336
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-01-22
  • 修回日期:  2019-06-10
  • 网络出版日期:  2019-06-20
  • 刊出日期:  2019-11-01

目录

    /

    返回文章
    返回