高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

应用于极致边缘计算场景的卷积神经网络加速器架构设计

吴瑞东 刘冰 付平 纪兴龙 鲁文帅

吴瑞东, 刘冰, 付平, 纪兴龙, 鲁文帅. 应用于极致边缘计算场景的卷积神经网络加速器架构设计[J]. 电子与信息学报, 2023, 45(6): 1933-1943. doi: 10.11999/JEIT220130
引用本文: 吴瑞东, 刘冰, 付平, 纪兴龙, 鲁文帅. 应用于极致边缘计算场景的卷积神经网络加速器架构设计[J]. 电子与信息学报, 2023, 45(6): 1933-1943. doi: 10.11999/JEIT220130
WU Ruidong, LIU Bing, FU Ping, JI Xinglong, LU Wenshuai. Convolutional Neural Network Accelerator Architecture Design for Ultimate Edge Computing Scenario[J]. Journal of Electronics & Information Technology, 2023, 45(6): 1933-1943. doi: 10.11999/JEIT220130
Citation: WU Ruidong, LIU Bing, FU Ping, JI Xinglong, LU Wenshuai. Convolutional Neural Network Accelerator Architecture Design for Ultimate Edge Computing Scenario[J]. Journal of Electronics & Information Technology, 2023, 45(6): 1933-1943. doi: 10.11999/JEIT220130

应用于极致边缘计算场景的卷积神经网络加速器架构设计

doi: 10.11999/JEIT220130
基金项目: 国家自然科学基金(62171156)
详细信息
    作者简介:

    吴瑞东:男,博士生,研究方向为高性能异构计算

    刘冰:男,副教授,研究方向为高性能计算、计算视觉

    付平:男,教授,研究方向为自动测试

    纪兴龙:男,研究员,研究方向为嵌入式智能计算

    鲁文帅:男,研究员,研究方向为智能微系统

    通讯作者:

    刘冰 liubing66@hit.edu.cn

  • 中图分类号: TN929.5; TP331

Convolutional Neural Network Accelerator Architecture Design for Ultimate Edge Computing Scenario

Funds: The National Natural Science Foundation of China (62171156)
  • 摘要: 针对卷积神经网络在极致边缘计算(UEC)场景应用中的性能和功耗需求,该文针对场景中16 Bit量化位宽的网络模型提出一种不依赖外部存储的卷积神经网络(CNN)加速器架构,该架构基本结构设计为基于现场可编程逻辑门阵列( FPGA)的多核CNN全流水加速器。在此基础上,实现了该加速器的层内映射与层间融合优化。然后,通过构建资源评估模型在理论上完成架构中的计算资源与存储资源评估,并在该理论模型指导下,通过设计空间探索来最大化资源使用率与计算效率,进而充分挖掘加速器在计算资源约束条件下的峰值算力。最后,以纳型无人机(UAV)自主快速人体检测UEC场景为例,通过实验完成了加速器架构性能验证与分析。结果表明,在实现基于单步多框目标检测(SSD)的人体检测神经网络推理中,加速器在100 MHz和25 MHz主频下分别实现了帧率为137和34的推理速度,对应功耗分别为0.514 W和0.263 W,满足纳型无人机自主计算这种典型UEC场景对图像实时处理的性能与功耗需求。
  • 图  1  纳型无人机结构图

    图  2  计算系统硬件框图

    图  3  FPGA功能模块框图

    图  4  通用加速器示意图

    图  5  卷积流水计算示意图

    图  6  层内映射示意图

    图  7  层间融合优化示意图

    图  8  多核 CNN 加速示意图

    表  1  网络结构参数与并行度探索结果

    网络层输入大小卷积核大小TMTN周期
    0160×120×13×3×1×3218691200
    1160×120×321×1×32×32321614400
    2160×120×323×3×3218691200
    3160×120×321×1×32×32321614400
    480×60×323×3×32×16321691200
    580×60×161×1×16×1621614400
    680×60×163×3×1611691200
    780×60×161×1×16×1621614400
    840×30×163×3×16×1641691200
    940×30×161×1×16×1611307200
    1040×30×163×3×1611172800
    1140×30×161×1×16×1611307200
    SSD040×30×16
    1220×15×163×3×16×1611691200
    1320×15×161×1×16×1611196800
    1420×15×163×3×16
    1520×15×161×1×16×16
    SSD120×15×16
    1610×7×163×3×16×1611207200
    1710×7×161×1×16×16
    1810×7×163×3×16
    1910×7×161×1×16×16
    SSD210×7×16
    下载: 导出CSV

    表  2  加速器资源消耗理论值(个)

    网络层加速器DSPBRAM36K
    0PE080.5
    1PE1328
    2PE2812
    3PE3328
    4PE4328
    5PE521.5
    6PE613
    7PE721.5
    8PE842
    9PE911
    10PE1011.5
    11PE1111
    12PE1212.5
    13~15X0110.5
    16~19X116
    总计1512767
    下载: 导出CSV

    表  3  部署资源消耗

    资源类别消耗(个)片内总计(个)占比(%)
    LUT248146340039.14
    LUTRAM1236190006.51
    FF1751612680013.81
    BRAM36K10613578.52
    DSP15624065.00
    IO172108.10
    下载: 导出CSV

    表  4  SSD资源消耗(个)

    DSPBRAM36K
    SSD0189
    SSD166
    SSD246
    总计2821
    下载: 导出CSV

    表  5  不同频率下功耗结果(W)

    类别100 MHz 功耗25 MHz功耗
    Clock0.0620.016
    Logic0.0730.019
    Signal0.0980.022
    BRAM0.0400.010
    DSP0.0500.012
    IO<0.001<0.001
    Static0.1090.109
    评估值0.4320.188
    测量值0.5140.263
    下载: 导出CSV

    表  6  软硬件处理性能对比结果(ms)

    ARM计算时间
    1.2 GHz
    FPGA加速时间
    100 MHz
    PE0~PE12140.4249.291
    PE0~X1141.17913.280
    SSD02.7821.117
    SSD11.9520.939
    SSD20.3640.233
    输入间隔146.2777.287
    帧率7 FPS137 FPS
    下载: 导出CSV

    表  7  不同平台性能对比结果

    类别ARMGAP8FPGA
    1.2 GHz200 MHz25 MHz100 MHz
    每秒操作数1.1010.4825.52822.111
    推理时间(ms)146.277334.62529.1487.287
    功耗(W)2.6170.1960.2630.514
    每瓦操作数0.4212.45921.01943.018
    下载: 导出CSV

    表  8  与相关工作比较结果

    类别文献[20]文献[20]文献[22]文献[23]本文
    平台 GX1150GX115010AS066NXC7Z045XC7A100T
    网络类型卷积深度分离卷积 MobileNetV2Face DetectorBody Detection
    量化位宽(bit)1616161616
    DSP7607121278128
    频率(MHz)15018013315025100
    算力(GOPS)87.50098.910170.6001375.52822.111
    功耗(W) 8.69 8.52 9.63 0.263 0.514
    计算效率0.7680.7721.0041.7281.727
    每瓦操作数10.06911.60914.22621.01943.018
    下载: 导出CSV
  • [1] BIANCHI V, BASSOLI M, LOMBARDO G, et al. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment[J]. IEEE Internet of Things Journal, 2019, 6(5): 8553–8562. doi: 10.1109/JIOT.2019.2920283
    [2] 施巍松, 张星洲, 王一帆, 等. 边缘计算: 现状与展望[J]. 计算机研究与发展, 2019, 56(1): 69–89. doi: 10.7544/issn1000-1239.2019.20180760

    SHI Weisong, ZHANG Xingzhou, WANG Yifan, et al. Edge computing: State-of-the-art and future directions[J]. Journal of Computer Research and Development, 2019, 56(1): 69–89. doi: 10.7544/issn1000-1239.2019.20180760
    [3] KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. The 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2012: 1097–1105.
    [4] ROY S K, KRISHNA G, DUBEY S R, et al. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification[J]. IEEE Geoscience and Remote Sensing Letters, 2020, 17(2): 277–281. doi: 10.1109/LGRS.2019.2918719
    [5] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318–327. doi: 10.1109/TPAMI.2018.2858826
    [6] USAMA M, AHMAD B, SONG Enmin, et al. Attention-based sentiment analysis using convolutional and recurrent neural network[J]. Future Generation Computer Systems, 2020, 113: 571–578. doi: 10.1016/j.future.2020.07.022
    [7] WAN Shaohua and GOUDOS S. Faster R-CNN for multi-class fruit detection using a robotic vision system[J]. Computer Networks, 2020, 168: 107036. doi: 10.1016/j.comnet.2019.107036
    [8] ACHARYA J and BASU A. Deep neural network for respiratory sound classification in wearable devices enabled by patient specific model tuning[J]. IEEE Transactions on Biomedical Circuits and Systems, 2020, 14(3): 535–544. doi: 10.1109/TBCAS.2020.2981172
    [9] WANG Yu, YANG Jie, LIU Miao, et al. LightAMC: Lightweight automatic modulation classification via deep learning and compressive sensing[J]. IEEE Transactions on Vehicular Technology, 2020, 69(3): 3491–3495. doi: 10.1109/TVT.2020.2971001
    [10] WU Huaqiang, LYU Feng, ZHOU Conghao, et al. Optimal UAV caching and trajectory in aerial-assisted vehicular networks: A learning-based approach[J]. IEEE Journal on Selected Areas in Communications, 2020, 38(12): 2783–2797. doi: 10.1109/JSAC.2020.3005469
    [11] Bitcraze. Crazyflie 2.1[EB/OL]. https://www.bitcraze.io/products/crazyflie-2-1/, 2022.
    [12] PALOSSI D, LOQUERCIO A, CONTI F, et al. A 64-mW DNN-based visual navigation engine for autonomous nano-drones[J]. IEEE Internet of Things Journal, 2019, 6(5): 8357–8371. doi: 10.1109/JIOT.2019.2917066
    [13] NICULESCU V, LAMBERTI L, CONTI F, et al. Improving autonomous nano-drones performance via automated end-to-end optimization and deployment of DNNs[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2021, 11(4): 548–562. doi: 10.1109/JETCAS.2021.3126259
    [14] PALOSSI D, ZIMMERMAN N, BURRELLO A, et al. Fully onboard AI-powered human-drone pose estimation on ultralow-power autonomous flying nano-UAVs[J]. IEEE Internet of Things Journal, 2022, 9(3): 1913–1929. doi: 10.1109/JIOT.2021.3091643
    [15] 刘勤让, 刘崇阳. 利用参数稀疏性的卷积神经网络计算优化及其FPGA加速器设计[J]. 电子与信息学报, 2018, 40(6): 1368–1374. doi: 10.11999/JEIT170819

    LIU Qinrang and LIU Chongyang. Calculation optimization for convolutional neural networks and FPGA-based accelerator design using the parameters sparsity[J]. Journal of Electronics &Information Technology, 2018, 40(6): 1368–1374. doi: 10.11999/JEIT170819
    [16] 秦华标, 曹钦平. 基于FPGA的卷积神经网络硬件加速器设计[J]. 电子与信息学报, 2019, 41(11): 2599–2605. doi: 10.11999/JEIT190058

    QIN Huabiao and CAO Qinping. Design of convolutional neural networks hardware acceleration based on FPGA[J]. Journal of Electronics &Information Technology, 2019, 41(11): 2599–2605. doi: 10.11999/JEIT190058
    [17] YUAN Tian, LIU Weiqiang, HAN Jie, et al. High performance CNN accelerators based on hardware and algorithm co-optimization[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2021, 68(1): 250–263. doi: 10.1109/TCSI.2020.3030663
    [18] GONG Lei, WANG Chao, LI Xi, et al. MALOC: A fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(11): 2601–2612. doi: 10.1109/TCAD.2018.2857078
    [19] WANG Chao, GONG Lei, YU Qi, et al. DLAU: A scalable deep learning accelerator unit on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(3): 513–517. doi: 10.1109/TCAD.2016.2587683
    [20] DING Wei, HUANG Zeyu, HUANG Zunkai, et al. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA[J]. Journal of Systems Architecture, 2019, 97: 278–286. doi: 10.1016/j.sysarc.2018.12.008
    [21] BLOTT M, PREUßER T B, FRASER N J, et al. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks[J]. ACM Transactions on Reconfigurable Technology and Systems, 2018, 11(3): 16. doi: 10.1145/3242897
    [22] BAI Lin, ZHAO Yiming, and HUANG Xinming. A CNN accelerator on FPGA using depthwise separable convolution[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2018, 65(10): 1415–1419. doi: 10.1109/TCSII.2018.2865896
    [23] GUO Kaiyuan, SUI Lingzhi, QIU Jiantao, et al. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(1): 35–47. doi: 10.1109/TCAD.2017.2705069
    [24] ZHU Jiang, WANG Lizan, LIU Haolin, et al. An efficient task assignment framework to accelerate DPU-based convolutional neural network inference on FPGAs[J]. IEEE Access, 2020, 8: 83224–83237. doi: 10.1109/ACCESS.2020.2988311
  • 加载中
图(8) / 表(8)
计量
  • 文章访问数:  1124
  • HTML全文浏览量:  410
  • PDF下载量:  238
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-02-15
  • 修回日期:  2022-07-10
  • 网络出版日期:  2022-07-15
  • 刊出日期:  2023-06-10

目录

    /

    返回文章
    返回