高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

脉动阵列协同层融合的卷积神经网络加速器设计

卢迪 王振发

卢迪, 王振发. 脉动阵列协同层融合的卷积神经网络加速器设计[J]. 电子与信息学报. doi: 10.11999/JEIT250867
引用本文: 卢迪, 王振发. 脉动阵列协同层融合的卷积神经网络加速器设计[J]. 电子与信息学报. doi: 10.11999/JEIT250867
LU Di, WANG Zhen Fa. Design of a CNN Accelerator Based on Systolic Array Collaboration with Inter-Layer Fusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250867
Citation: LU Di, WANG Zhen Fa. Design of a CNN Accelerator Based on Systolic Array Collaboration with Inter-Layer Fusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250867

脉动阵列协同层融合的卷积神经网络加速器设计

doi: 10.11999/JEIT250867 cstr: 32379.14.JEIT250867
详细信息
    作者简介:

    卢迪:女,教授,博士,研究方向为数据融合、图像处理等

    王振发:男,硕士生,研究方向为图像处理、边缘计算等

    通讯作者:

    卢迪 ludizeng@hrbust.edu.cn

  • 中图分类号: TP331; TN47

Design of a CNN Accelerator Based on Systolic Array Collaboration with Inter-Layer Fusion

  • 摘要: 卷积神经网络在边缘计算和嵌入式领域的实时应用对硬件加速器的性能和能效提出了严峻挑战。针对基于FPGA的卷积神经网络加速器中普遍存在的数据搬运瓶颈、资源利用率不足和计算单元效率低下等核心问题,提出一种脉动阵列协同层融合的混合卷积神经网络加速器架构,将计算密集型邻接层进行深度绑定,在同一级阵列内完成连续计算,减少中间结果向片外存储的频繁存取,降低数据搬运次数和功耗,提升计算速度和整体能效比;设计动态可配置脉动阵列方法,在硬件层面自适应支持多维度矩阵乘法计算,避免为不同规模运算分别部署专用硬件的资源浪费,降低整体FPGA逻辑资源的消耗,提升硬件资源的适应性与灵活性;通过精心规划计算流与控制逻辑,设计流式脉动阵列计算方法,确保脉动阵列计算单元始终保持在高效工作状态,数据在计算引擎中以高度流水化和并行方式持续流动,提升脉动阵列内部处理单元利用率,减少计算空洞期,提升整体吞吐率。实验结果表明,在Xilinx Zynq-7100平台上,VGG16、ResNet50以及Yolov8n在所提出加速器上的性能分别达到390.25GOPS、360.27GOPS和348.08GOPS,为部署高性能、低功耗的CNN推理至资源受限的边缘设备提供了有效的FPGA实现途径。
  • 图  1  内积堆叠矩阵单元结构示意图

    图  2  二维脉动阵列架构示意图

    图  3  脉动阵列协同层融合架构示意图

    图  4  卷积神经网络加速器整体架构

    图  5  脉动阵列协同层融合架构计算过程实现

    图  6  Image2col实现示意图

    图  7  第一层三维脉动阵列计算引擎示意图

    图  8  梯形行缓冲器

    图  9  相邻卷积层融合示意图

    图  10  动态可配置脉动阵列示意图

    图  11  流式脉动阵列设计

    表  1  CIFAR100、CIFAR10、MNIST数据集上VGG16和ResNet50网络不同量化位宽分析

    数据集网络模型准确率
    32-float12-bit10-bit8-bit6-bit
    CIFAR100VGG1670.4470.4570.4469.9953.33
    ResNet5074.9874.9474.9173.9840.43
    CIFAR10VGG1693.7892.4591.8891.2385.33
    ResNet5094.9893.9492.9192..3586.43
    MNISTVGG1698.4297.9896.9596.5593.33
    ResNet5099.3598.9498.9097.3895.43
    下载: 导出CSV

    表  2  部署资源消耗

    资源类别消耗(个)片内总计(个)占比(%)
    LUT24567227740088.56
    FF29302855480052.81
    BRAM29775539.33
    DSP2000202099.00
    下载: 导出CSV

    表  3  SP和LUT生成乘法器性能对比

    乘法器综合资源数据类型最大频率(MHZ)
    DSP48E18位有符号544
    LUT8位有符号384.61
    下载: 导出CSV

    表  4  各算子资源消耗

    算子类型LUT(个)FF(个)BRAM(个)DSP(个)
    Conv184293229438157.51824
    Max_pool1237010698200
    Add225628851296
    Cat212023671680
    Upsample106683280
    下载: 导出CSV

    表  5  与相关FPGA加速器的比较结果

    文献[6]文献[7]本文文献[7]文献[8]文献[9]本文文献[14]文献[15]本文
    模型YOLOv2YOLOv4-TinyYOLOv8VGG16VGG16VGG16VGG16ResNet50ResNet50ResNet50
    计算平台Pynq-z2Zynq7020Zynq7100Zynq7020ZC706ZU15EGZynq7100ZC706XCZU19EGZynq7100
    制程(nm)28282828281628281628
    频率(MHz)125200200150150300200200100200
    数据类型Int16Int16Int8Int16Int16Int8Int8Int8Int8Int8
    DSP(个)153220200022044935282000840-2000
    GOP29.476.959.1615.5215.5215.5215.524.134.134.13
    算力(GOPS)54.6224.25348.0848.36115.21217.62390.25330.2292360.27
    功耗(W)2.74.587.13.753.83.727.1--7.1
    能效比(GOPS/W)20.235.2949.0212.8930.3258.3654.96119.647.8150.74
    下载: 导出CSV

    表  6  与CPU和GPU加速器实现对比

    CPU GPU 本文
    模型 VGG16 ResNet50 Yolov8 VGG16 ResNet50 Yolov8 VGG16 ResNet50 Yolov8
    计算平台 Intel i5-12600kf Nvidia RTX 4060 Xilinx Zynq-7100
    制程(nm) 7 5 28
    频率 3.05 GHz 2.73 GHz 200 MHz
    数据类型 Float32 Float32 Int8
    GOP 15.52 4.13 9.16 15.52 4.13 9.16 15.52 4.13 9.16
    GOPS 220.69 90.28 132.87 1974.60 2027.83 1832 390.25 360.27 348.08
    功耗(W) 74.27 92.14 76.35 111.55 105.88 110.58 7.1 7.1 7.1
    能效比(GOPS/W) 2.97 0.97 1.74 17.70 19.15 16.56 54.96 50.74 49.02
    下载: 导出CSV
  • [1] SHAO Jie and CHENG Qiyu. E-FCNN for tiny facial expression recognition[J]. Applied Intelligence, 2021, 51(1): 549–559. doi: 10.1007/s10489-020-01855-5.
    [2] KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[C]. Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2012: 1106–1114.
    [3] SIMONYAN K and ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015: 236–238. (查阅网上资料, 未找到本条文献页码, 请确认).
    [4] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Identity mappings in deep residual networks[C]. 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 630–645. doi: 10.1007/978-3-319-46493-0_38.
    [5] IOFFE S and SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 448–456.
    [6] BAO Chun, XIE Tao, FENG Wenbin, et al. A power-efficient optimizing framework FPGA accelerator based on Winograd for YOLO[J]. IEEE Access, 2020, 8: 94307–94317. doi: 10.1109/ACCESS.2020.2995330.
    [7] YE Jinlin, LIU Yuhan, CHEN Haiyong, et al. Edge computing accelerator for real-time defect detection of photovoltaic panel on lightweight FPGAs[J]. IEEE Transactions on Instrumentation and Measurement, 2025, 74: 3001815. doi: 10.1109/TIM.2025.3563001.
    [8] ZHANG Chen, WANG Xin’an, YONG Shanshan, et al. An energy-efficient convolutional neural network processor architecture based on a systolic array[J]. Applied Sciences, 2022, 12(24): 12633. doi: 10.3390/app122412633.
    [9] XU Yuhua, LUO Jie, and SUN Wei. Flare: An FPGA-based full precision low power CNN accelerator with reconfigurable structure[J]. Sensors, 2024, 24(7): 2239. doi: 10.3390/s24072239.
    [10] ZHANG Yonghua, WANG Haojie, and PAN Zhenhua. An efficient CNN accelerator for pattern-compressed sparse neural networks on FPGA[J]. Neurocomputing, 2025, 611: 128700. doi: 10.1016/j.neucom.2024.128700.
    [11] HU Xianghong, FU Shansen, LIN Yuanmiao, et al. An FPGA-based bit-level weight sparsity and mixed-bit accelerator for neural networks[J]. Journal of Systems Architecture, 2025, 166: 103463. doi: 10.1016/j.sysarc.2025.103463.
    [12] LI Gang, LIU Zejian, LI Fanrong, et al. Block convolution: Toward memory-efficient inference of large-scale CNNs on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(5): 1436–1447. doi: 10.1109/TCAD.2021.3082868.
    [13] PACINI T, RAPUANO E, DINELLI G, et al. A multi-cache system for on-chip memory optimization in FPGA-based CNN accelerators[J]. Electronics, 2021, 10(20): 2514. doi: 10.3390/electronics10202514.
    [14] OU Yaozhong, YU Weihan, UN K F, et al. A 119.64 GOPs/W FPGA-based ResNet50 mixed-precision accelerator using the dynamic DSP packing[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71(5): 2554–2558. doi: 10.1109/TCSII.2024.3377356.
    [15] FUKUSHIMA Y, IIZUKA K, and AMANO H. Parallel implementation of CNN on multi-FPGA cluster[J]. IEICE Transactions on Information and Systems, 2023, E106. D(7): 1198–1208. doi: 10.1587/transinf.2022EDP7175.
    [16] ALWANI M, CHEN Han, FERDMAN M, et al. Fused-layer CNN accelerators[C]. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, China, 2016: 1–12. doi: 10.1109/MICRO.2016.7783725.
    [17] 陈云霁, 李玲, 赵永威, 等. 智能计算系统: 从深度学习到大模型[M]. 2版. 北京: 机械工业出版社, 2024: 256–257.

    CHEN Yunji, LI Ling, ZHAO Yongwei, et al. AI Computing Systems[M]. 2nd ed. Beijing: China Machine Press, 2024: 256–257.
    [18] LIU Yanyi, DU Hang, WU Yin, et al. FPGA accelerated deep learning for industrial and engineering applications: Optimal design under resource constraints[J]. Electronics, 2025, 14(4): 703. doi: 10.3390/electronics14040703.
  • 加载中
图(11) / 表(6)
计量
  • 文章访问数:  25
  • HTML全文浏览量:  13
  • PDF下载量:  3
  • 被引次数: 0
出版历程
  • 修回日期:  2025-12-29
  • 录用日期:  2025-12-29
  • 网络出版日期:  2026-01-05

目录

    /

    返回文章
    返回