高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于三维可变换CNN加速结构的并行度优化搜索算法

屈心媛 徐宇 黄志洪 蔡刚 方震

屈心媛, 徐宇, 黄志洪, 蔡刚, 方震. 一种基于三维可变换CNN加速结构的并行度优化搜索算法[J]. 电子与信息学报, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
引用本文: 屈心媛, 徐宇, 黄志洪, 蔡刚, 方震. 一种基于三维可变换CNN加速结构的并行度优化搜索算法[J]. 电子与信息学报, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
QU Xinyuan, XU Yu, HUANG Zhihong, CAI Gang, FANG Zhen. A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture[J]. Journal of Electronics & Information Technology, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059
Citation: QU Xinyuan, XU Yu, HUANG Zhihong, CAI Gang, FANG Zhen. A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture[J]. Journal of Electronics & Information Technology, 2022, 44(4): 1503-1512. doi: 10.11999/JEIT210059

一种基于三维可变换CNN加速结构的并行度优化搜索算法

doi: 10.11999/JEIT210059
基金项目: 国家自然科学基金(61704173, 61974146),北京市科技重大专项(Z171100000117019)
详细信息
    作者简介:

    屈心媛:女,1994年生,博士生,研究方向为基于FPGA的CNN加速器架构设计

    徐宇:男,1990年生,博士,研究方向为大规模集成电路设计自动化

    黄志洪:男,1984年生,高级工程师,研究方向为可编程芯片设计与FPGA硬件加速

    蔡刚:男,1980年生,正高级工程师,硕士生导师,研究方向为集成电路设计、抗辐照加固设计、人工智能系统设计

    方震:男,1976年生,研究员,博士生导师,研究方向为新型医疗电子及医学人工智能技术

    通讯作者:

    黄志洪 huangzhihong@mail.ie.ac.cn

  • 1)本节给出的数据均为基于KCU1500的AlexNet加速器的实验结果。2) (Parain=1, Paraseg=2)相当于Parain=1/2;(Parain=3, Paraseg=5)相当于Parain=3/5;以此类推。
  • 中图分类号: TN47

A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture

Funds: The National Natural Science Foundation of China (61704173, 61974146), The Major Program of Beijing Science and Technology (Z171100000117019)
  • 摘要: 现场可编程门阵列(FPGA)被广泛应用于卷积神经网络(CNN)的硬件加速中。为优化加速器性能,Qu等人(2021)提出了一种3维可变换的CNN加速结构,但该结构使得并行度探索空间爆炸增长,搜索最优并行度的时间开销激增,严重降低了加速器实现的可行性。为此该文提出一种细粒度迭代优化的并行度搜索算法,该算法通过多轮迭代的数据筛选,高效地排除冗余的并行度方案,压缩了超过99%的搜索空间。同时算法采用剪枝操作删减无效的计算分支,成功地将计算所需时长从106 h量级减少到10 s内。该算法可适用于不同规格型号的FPGA芯片,其搜索得到的最优并行度方案性能突出,可在不同芯片上实现平均(R1, R2)达(0.957, 0.962)的卓越计算资源利用率。
  • 图  1  CNN加速器单层结构示意图

    图  2  矩阵卷积分段计算示意图

    图  3  β=0.20时,算法搜索的计算量随α的变化情况

    图  4  基于不同规格FPGA的AlexNet加速器性能随(α, β)变化色温图

    表  1  AlexNet网络结构参数

    NinNoutSIZEinSIZEoutSIZEkerStrideNpad
    CONV1396227551140
    POOL196965527320
    CONV2482562727512
    POOL22562562713320
    CONV32563841313311
    CONV41923841313311
    CONV51922561313311
    POOL5256256136320
    FC19216409611
    FC24096409611
    FC34096100011
    下载: 导出CSV

    表  2  不同FPGA CNN加速器的资源利用率

    文献VGG文献AlexNet
    R1R2R1R2
    [5]0.80.8[3]0.320.38
    [11]0.710.71[4]0.420.55
    [14]0.770.84[6]0.500.85
    [8]0.780.99[8]0.670.76
    [15]0.660.80[14]0.620.78
    下载: 导出CSV

    表  3  细粒度并行度迭代算法

     输入:片上可用DSP数#DSPlimit、可用BRAM数量#BRAMlimit、CNN网络结构参数及α, β
     输出:Parain,Paraout及Paraseg
     (1) 计算各层计算量#OPi与网络总计算量#OPtotal之比γi
     (2) 按照计算量分布比例将片上可用DSP分配给各层,各层分配到的DSP数#DSPiallocγi·#DSPtotal
     (3)根据计算总量和计算资源总数,算出理论最小计算周期数#cyclebaseline
     (4) 第i层,遍历Parain,Paraout及ROWout的所有离散可行取值(即3者定义域形成的笛卡儿积),生成全组合情况下的并行度参数配置
       集S0i,计算对应的#cyclei、#BRAMi与#DSPi
     (5) 筛选满足α, β约束的数据集Si
     Si←select ele from S0i where (#cyclei/#cyclebaseline in [1–α,1+α] and #DSPi/#DSPialloc in [1–β,1+β])
     (6)数据粗筛,集合Si:任意相邻的两个元素不存在“KO”偏序关系。
       for i in range(5):
         orders←[(cycle, dsp, bram), (dsp, cycle, bram), (bram, cycle, dsp)]
         for k in range(3):
           Si.sort_ascend_by(orders[k])
           p←0
           for j in range(1, size(Si)):
             if σj KO σp then Si.drop(σp), pj
             else Si.drop(σj)
         SiSi
     (7)数据精筛,集合Ti:任意两个元素不存在“KO”偏序关系。
      for i in range(5):
       Si.sort_ascend_by((cycle,dsp,bram))
       for j in range(1, size(Si)):
          for k in range(j):
           if σk KO σj then S’i.drop(σj), break
       TiS’i
     (8)搜索剪枝。
      maxCycle←INT_MAX, dspUsed←0, bramUsed←0
      def calc(i):
       if i==5 then
         update(maxCycle)
         return
       for j in range(size(Ti)):
         tmpDsp←dspUsed+dspji, tmpBram←bramUsed+bramji
         if not(tmpDsp>dspTotal or tmpBram>bramTotal or
           cycleji≥maxCycle) then
           dspUsed←tmpDsp, bramUsed←tmpBram
           calc(i+1)
           dspUsed←tmpDsp-dspji, bramUsed←tmpBram-bramji
         else
           continue
      calc(0)
     (9)选出maxCycle(即min{max{#cyclei}})对应的并行度元素,输出约束条件下最优并行度的参数信息。
    下载: 导出CSV

    表  4  不同规格FPGA上AlexNet加速器资源利用率、计算量与计算时长

    FPGA型号DSP资源数R1R2原始计算量压缩比(%)执行时间(s)
    Arria10 GT 115015180.9870.9895.683×10799.8921.544
    KU06027600.9470.9513.026×10899.9796.444
    Virtex7 VX485T28000.9360.9419.903×10899.9945.841
    Virtex7 VX690T36000.9600.9672.082×10899.9982.775
    KCU150055200.9550.9625.772×10999.9998.115
    下载: 导出CSV

    表  5  AlexNet加速器性能对比

    型号文献[4]文献[11]文献[12]文献[8]本文
    量化位宽16 bit定点16 bit定点16 bit定点16 bit定点16 bit定点
    频率(MHz)250/500200150220230
    FPGA型号KCU1500Arria10GX1150ZynqXC7Z045KCU1500KCU1500
    吞吐率(GOP/s)2335.4584.8137.01633.02425.5
    性能功耗比(GOP/s/W)37.31无数据14.2172.3162.35
    资源利用率(R1, R2)(0.42, 0.55)(0.48, 0.48)(0.51, 0.59)(0.67, 0.76)(0.96, 0.96)
    下载: 导出CSV
  • [1] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791
    [2] QU Xinyuan, HUANG Zhihong, XU Yu, et al. Cheetah: An accurate assessment mechanism and a high-throughput acceleration architecture oriented toward resource efficiency[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 40(5): 878–891. doi: 10.1109/TCAD.2020.3011650
    [3] REGGIANI E, RABOZZI M, NESTOROV A M, et al. Pareto optimal design space exploration for accelerated CNN on FPGA[C]. 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 2019: 107–114. doi: 10.1109/IPDPSW.2019.00028.
    [4] YU Xiaoyu, WANG Yuwei, MIAO Jie, et al. A data-center FPGA acceleration platform for convolutional neural networks[C]. 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019: 151–158. doi: 10.1109/FPL.2019.00032.
    [5] LIU Zhiqiang, CHOW P, XU Jinwei, et al. A uniform architecture design for accelerating 2D and 3D CNNs on FPGAs[J]. Electronics, 2019, 8(1): 65. doi: 10.3390/electronics8010065
    [6] LI Huimin, FAN Xitian, JIAO Li, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks[C]. 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Swiss, 2016: 1–9. doi: 10.1109/FPL.2016.7577308.
    [7] QIU Jiantao, WANG Jie, YAO Song, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]. The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2016: 26–35.
    [8] ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8. doi: 10.1145/3240765.3240801.
    [9] LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Automatic code generation of convolutional neural networks in FPGA implementation[C]. 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 2016: 61–68. doi: 10.1109/FPT.2016.7929190.
    [10] KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386
    [11] MA Yufei, CAO Yu, VRUDHULA S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(7): 1354–1367. doi: 10.1109/TVLSI.2018.2815603
    [12] GUO Kaiyuan, SUI Lingzhi, QIU Jiantao, et al. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(1): 35–47. doi: 10.1109/TCAD.2017.2705069
    [13] ZHANG Chen, SUN Guangyu, FANG Zhenman, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072–2085. doi: 10.1109/TCAD.2017.2785257
    [14] ZHANG Jialiang and LI Jing. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network[C]. The 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, California, USA, 2017: 25–34. doi: 10.1145/3020078.3021698.
    [15] LIU Zhiqiang, DOU Yong, JIANG Jingfei, et al. Throughput-optimized FPGA accelerator for deep convolutional neural networks[J]. ACM Transactions on Reconfigurable Technology and Systems, 2017, 10(3): 17. doi: 10.1145/3079758
  • 加载中
图(4) / 表(5)
计量
  • 文章访问数:  1048
  • HTML全文浏览量:  412
  • PDF下载量:  110
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-01-08
  • 修回日期:  2021-08-04
  • 网络出版日期:  2021-09-09
  • 刊出日期:  2022-04-18

目录

    /

    返回文章
    返回