高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向深度神经网络加速芯片的高效硬件优化策略

张萌 张经纬 李国庆 吴瑞霞 曾晓洋

张萌, 张经纬, 李国庆, 吴瑞霞, 曾晓洋. 面向深度神经网络加速芯片的高效硬件优化策略[J]. 电子与信息学报, 2021, 43(6): 1510-1517. doi: 10.11999/JEIT210002
引用本文: 张萌, 张经纬, 李国庆, 吴瑞霞, 曾晓洋. 面向深度神经网络加速芯片的高效硬件优化策略[J]. 电子与信息学报, 2021, 43(6): 1510-1517. doi: 10.11999/JEIT210002
Meng ZHANG, Jingwei ZHANG, Guoqing LI, Ruixia WU, Xiaoyang ZENG. Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip[J]. Journal of Electronics & Information Technology, 2021, 43(6): 1510-1517. doi: 10.11999/JEIT210002
Citation: Meng ZHANG, Jingwei ZHANG, Guoqing LI, Ruixia WU, Xiaoyang ZENG. Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip[J]. Journal of Electronics & Information Technology, 2021, 43(6): 1510-1517. doi: 10.11999/JEIT210002

面向深度神经网络加速芯片的高效硬件优化策略

doi: 10.11999/JEIT210002
基金项目: 国家重点研发计划(2018YFB2202703),江苏省自然科学基金(BK20201145)
详细信息
    作者简介:

    张萌:男,1964年生,研究员,研究方向为数字信号处理、深度学习算法及硬件加速

    张经纬:男,1997年生,硕士生,研究方向为深度学习硬件加速器设计

    李国庆:男,1991年生,博士生,研究方向为计算机视觉和深度学习硬件加速器设计

    吴瑞霞:女,1996年生,硕士生,研究方向为深度学习算法

    曾晓洋:男,1972年生,教授,研究方向为高能效系统芯片(SoC)

    通讯作者:

    张经纬 zhangjingwei@seu.edu.cn

  • 中图分类号: TN79.1

Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip

Funds: The National Key R&D Program of China(2018YFB2202703), Jiangsu Province of Natural Science and Technology(BK20201145)
  • 摘要: 轻量级神经网络部署在低功耗平台上的解决方案可有效用于无人机(UAV)检测、自动驾驶等人工智能(AI)、物联网(IOT)领域,但在资源有限情况下,同时兼顾高精度和低延时来构建深度神经网络(DNN)加速器是非常有挑战性的。该文针对此问题提出一系列高效的硬件优化策略,包括构建可堆叠共享计算引擎(PE)以平衡不同卷积中数据重用和内存访问模式的不一致;提出了可调的循环次数和通道增强方法,有效扩展加速器与外部存储器之间的访问带宽,提高DNN浅层网络计算效率;优化了预加载工作流,从整体上提高了异构系统的并行度。经Xilinx Ultra96 V2板卡验证,该文的硬件优化策略有效地改进了iSmart3-SkyNet和SkrSkr-SkyNet类的DNN加速芯片设计。结果显示,优化后的加速器每秒处理78.576帧图像,每幅图像的功耗为0.068 J。
  • 图  1  iSmart3-SkyNet加速器上的SkyNet Roofline模型分析

    图  2  系统-计算模块-线性缓冲区结构示意图

    图  3  通道增强流程说明图

    图  4  3种工作流比较图

    图  5  优化后加速器上的SkyNet Roofline模型分析

    图  6  iSmart3和Skrskr加速优化前后性能对比

    表  1  SkyNet的体系结构和每个捆绑包的推理速度表格

    捆绑包层数输入尺寸操作类型计算量、计算量占比(%)延迟占比(%)
    #113×160×320DW-Conv3119.61M, 20.633.90
    23×160×320PW-Conv1
    348×160×320POOLING
    #2448×80×160DW-Conv386.02M, 14.4216.54
    548×80×160PW-Conv1
    696×80×160POOLING
    #3796×40×80DW-Conv361.75M, 10.366.23
    896×40×80PW-Conv1
    9192×40×80POOLING
    #410192×20×40DW-Conv360.36M, 10.134.92
    11192×20×40PW-Conv1
    #512384×20×40DW-Conv3160.05M, 26.8512.43
    13384×20×40PW-Conv1
    #6合并第9层输出107.52M, 18.0420.08
    141280×20×40[旁路] DW-Conv3
    151280×20×40PW-Conv1
    #71696×20×40PW-Conv10.77M, 0.140.10
    1710×20×40计算回归框0.16
    CPU5.64
    下载: 导出CSV

    表  2  优化策略效果对比

    加速器iSmart3 [9]SEUer ASkrskr [10]SEUer B
    网络模型SkyNetSkyNetSkyNetSkyNet
    量化精度A9/W11A9/W11A8/W6A8/W6
    硬件平台Ultra96V2Ultra96V2Ultra96V2Ultra96V2
    准确率(DJI)0.7160.7240.7310.731
    时钟频率(MHz)215215300300
    DSP数量329287360360
    LUT数量(k)54545646
    FF数量(k)60706851
    帧率(fps)25.0537.39352.42978.576
    GOPS/W3.215.957.2211.19
    Energy/Pic.(J)0.2890.1350.1290.068
    下载: 导出CSV
  • [1] 王巍, 周凯利, 王伊昌, 等. 基于快速滤波算法的卷积神经网络加速器设计[J]. 电子与信息学报, 2019, 41(11): 2578–2584. doi: 10.11999/JEIT190037

    WANG Wei, ZHOU Kaili, WANG Yichang, et al. Design of convolutional neural networks accelerator based on fast filter algorithm[J]. Journal of Electronics &Information Technology, 2019, 41(11): 2578–2584. doi: 10.11999/JEIT190037
    [2] ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8.
    [3] LI Huimin, FAN Xitian, JIAO Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]. The 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 2016: 1–9.
    [4] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real-time object detection[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 779–788.
    [5] REN Shaoqing, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031
    [6] TAN Mingxing, PANG Ruoming, and LE Q V. EfficientDet: Scalable and efficient object detection[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 10781–10790.
    [7] YU Yunxuan, WU Chen, ZHAO Tiandong, et al. OPU: An FPGA-based overlay processor for convolutional neural networks[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020, 28(1): 35–47. doi: 10.1109/TVLSI.2019.2939726
    [8] YU Yunxuan, ZHAO Tiandong, WANG Kun, et al. Light-OPU: An FPGA-based overlay processor for lightweight convolutional neural networks[C]. 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, USA, 2020: 122–132.
    [9] ZHANG Xiaofan, LU Haoming, HAO Cong, et al. SkyNet: A hardware-efficient method for object detection and tracking on embedded systems[J]. arXiv: 1909.09709, 2019.
    [10] JIANG W, LIU X, SUN H, et al. Skrskr: Dacsdc. 2020 2nd place winner in fpga track[EB/OL]. https://github.com/jiangwx/SkrSkr/, 2020.
    [11] ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170.
    [12] HAO Cong, ZHANG Xiaofan, LI Yuhong, et al. FPGA/DNN Co-Design: An efficient design methodology for 1ot intelligence on the edge[C]. The 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, USA, 2019: 1–6.
    [13] MOTAMEDI M, GYSEL P, AKELLA V, et al. Design space exploration of FPGA-based deep convolutional neural networks[C]. The 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Macao, China, 2016: 575–580.
    [14] FAN Hongxiang, LIU Shuanglong, FERIANC M, et al. A real-time object detection accelerator with compressed SSDLite on FPGA[C]. 2018 International Conference on Field-Programmable Technology (FPT), Naha, Japan, 2018: 14–21.
    [15] LI Fanrong, MO Zitao, WANG Peisong, et al. A system-level solution for low-power object detection[C]. 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea (South), 2019: 2461–2468.
    [16] DONG Zhen, WANG Dequan, HUANG Qijing, et al. CoDeNet: Efficient deployment of input-adaptive object detection on embedded FPGAs[J]. arXiv: 2006.08357, 2020.
    [17] WU Di, ZHANG Yu, JIA Xijie, et al. A high-performance CNN processor based on FPGA for MobileNets[C]. The 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019: 136–143.
  • 加载中
图(6) / 表(2)
计量
  • 文章访问数:  1451
  • HTML全文浏览量:  503
  • PDF下载量:  192
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-01-04
  • 修回日期:  2021-04-21
  • 网络出版日期:  2021-04-29
  • 刊出日期:  2021-06-18

目录

    /

    返回文章
    返回