高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于FPGA的卷积神经网络和视觉Transformer通用加速器

李天阳 张帆 王松 曹伟 陈立

李天阳, 张帆, 王松, 曹伟, 陈立. 基于FPGA的卷积神经网络和视觉Transformer通用加速器[J]. 电子与信息学报, 2024, 46(6): 2663-2672. doi: 10.11999/JEIT230713
引用本文: 李天阳, 张帆, 王松, 曹伟, 陈立. 基于FPGA的卷积神经网络和视觉Transformer通用加速器[J]. 电子与信息学报, 2024, 46(6): 2663-2672. doi: 10.11999/JEIT230713
LI Tianyang, ZHANG Fan, WANG Song, CAO Wei, CHEN Li. FPGA-Based Unified Accelerator for Convolutional Neural Network and Vision Transformer[J]. Journal of Electronics & Information Technology, 2024, 46(6): 2663-2672. doi: 10.11999/JEIT230713
Citation: LI Tianyang, ZHANG Fan, WANG Song, CAO Wei, CHEN Li. FPGA-Based Unified Accelerator for Convolutional Neural Network and Vision Transformer[J]. Journal of Electronics & Information Technology, 2024, 46(6): 2663-2672. doi: 10.11999/JEIT230713

基于FPGA的卷积神经网络和视觉Transformer通用加速器

doi: 10.11999/JEIT230713
基金项目: 国家重点研发计划(2022YFB4500900)
详细信息
    作者简介:

    李天阳:男,博士生,研究方向为高性能计算、可重构计算技术

    张帆:男,副研究员,研究方向为高性能计算、大数据、片上系统芯片设计

    王松:男,工程师,研究方向为信息安全

    曹伟:男,讲师,研究方向为可重构计算技术、片上系统芯片设计

    陈立:男,博士生,研究方向为先进计算与类脑智能、计算机视觉

    通讯作者:

    张帆 17034203@qq.com

  • 中图分类号: TP331; TN47

FPGA-Based Unified Accelerator for Convolutional Neural Network and Vision Transformer

Funds: The TNational Key R&D Program of China (2022YFB4500900)
  • 摘要: 针对计算机视觉领域中基于现场可编程逻辑门阵列(FPGA)的传统卷积神经网(CNN)络加速器不适配视觉Transformer网络的问题,该文提出一种面向卷积神经网络和Transformer的通用FPGA加速器。首先,根据卷积和注意力机制的计算特征,提出一种面向FPGA的通用计算映射方法;其次,提出一种非线性与归一化加速单元,为计算机视觉神经网络模型中的多种非线性和归一化操作提供加速支持;然后,在Xilinx XCVU37P FPGA上实现了加速器设计。实验结果表明,所提出的非线性与归一化加速单元在提高吞吐量的同时仅造成很小的精度损失,ResNet-50和ViT-B/16在所提FPGA加速器上的性能分别达到了589.94 GOPS和564.76 GOPS。与GPU实现相比,能效比分别提高了5.19倍和7.17倍;与其他基于FPGA的大规模加速器设计相比,能效比有明显提高,同时计算效率较对比FPGA加速器提高了8.02%~177.53%。
  • 图  1  卷积示意图

    图  2  注意力计算过程

    图  3  ViT各种配置下的运行时间占比

    图  4  卷积和注意力矩阵乘法划分

    图  5  面向ResNet-50和ViT-B/16的计算阵列设计空间探索

    图  6  自然指数近似函数对比

    图  7  非线性与归一化加速单元

    图  8  加速器架构

    图  9  能效比对比

    图  10  计算效率对比

    算法1 片上模型计算量穷举法
     输入:神经网络模型Model, 乘法器数量$ {\text{Mul}}{{\text{t}}_{{\text{num}}}} $
     输出:$({N_{\rm{i}}},{N_{\rm{o}}})$设计空间探索结果,片上模型计算量${\text{Com}}{{\text{p}}_{{\text{num}}}}$
     (1) 初始化$({N_{\rm{i}}},{N_{\rm{o}}})$,${\text{Com}}{{\text{p}}_{{\text{num}}}}$
     (2) for $i = 0$;$ i < \left\lfloor {\sqrt {{\text{Mul}}{{\text{t}}_{{\text{num}}}}} } \right\rfloor $;$i + + $ do:
     (3)  ${n_{\rm{i}}} = i + 1$
     (4)  ${n_{\rm{o}}} = \left\lfloor {\dfrac{ { {\text{Mul} }{ {\text{t} }_{ {\text{num} } } } } }{ { {n_{\rm{i}}} } } } \right\rfloor$
     (5)  if $\text{mod} ({N_{\rm{o}}},{N_{\rm{i}}}) = = 0$ then
     (6)   $({N_{\rm{i}}},{N_{\rm{o}}})$.append$({n_{\rm{i}}},{n_{\rm{o}}})$
     (7)  end if
     (8) end for
     (9) for each$({n_{\rm{i}}},{n_{\rm{o}}})$in$({N_{\rm{i}}},{N_{\rm{o}}})$do
     (10) ${\text{com}}{{\text{p}}_{{\text{total}}}} = 0$
     (11) for each layer in Model do
     (12) ${\text{ic, oc, mad}}{{\text{d}}_{{\text{num}}}}$= layer.configuration
     (13) ${n_{ {\text{iu} } } } = \text{mod} ({\text{ic} },{n_{\rm{i}}}) = = 0{\text{ } }?{\text{ } }1{\text{ } }:{\text{ } }\dfrac{ {\text{mod} ({\text{ic} },{n_{\rm{i} } })} }{ { {n_{\rm{i} } } }}$
     (14) ${n_{ {\text{ou} } } } = \text{mod} ({\text{oc} },{n_{\rm{o} } }) = = 0{\text{ } }?{\text{ } }1{\text{ } }:{\text{ } }\dfrac{ {\text{mod} ({\text{oc} },{n_{\rm{o} } })} }{ { {n_{\rm{o} } } }}$
     (15) ${\text{com}}{{\text{p}}_{{\text{total}}}} + = {\text{mad}}{{\text{d}}_{{\text{num}}}}/({n_{{\text{iu}}}} \times {n_{{\text{ou}}}})$
     (16) end for
     (17) ${\text{Com}}{{\text{p}}_{{\text{num}}}}$.append(${\text{com}}{{\text{p}}_{{\text{total}}}}$)
     (18) end for
     (19) return$({N_{\rm{i}}},{N_{\rm{o}}})$, $ {\text{Com}}{{\text{p}}_{{\text{num}}}} $
    下载: 导出CSV

    表  1  计算阵列有效配置

    $({N_i},{N_o})$25651210242048
    1(1, 256)(1, 512)(1, 1 024)(1, 2 048)
    2(2, 128)(2, 256)(2, 512)(2, 1 024)
    3(4, 64)(4, 128)(4, 256)(4, 512)
    4(6, 42)(8, 64)(8, 128)(8, 256)
    5(8, 32)(13, 39)(13, 78)(16, 128)
    6(16, 16)(16, 32)(16, 64)(26, 78)
    7//(32, 32)(32, 64)
    8///(45, 45)
    下载: 导出CSV

    表  2  模型精度消融实验(%)

    模型模式数据类型Top-1 AccuracyTop-5 AccuracyTop-1 DiffTop-5 Diff
    DeiT-S/16基线FP3279.83494.950//
    OSFP1679.81494.968–0.020+0.018
    OGFP1679.81294.952–0.022+0.002
    OLFP1679.81694.948–0.018–0.002
    ALLFP1679.81494.984–0.020+0.034
    DeiT-B/16基线FP3281.79895.594//
    OSFP1681.84295.620+0.044+0.026
    OGFP1681.83295.610+0.034+0.016
    OLFP1681.82895.600+0.030+0.006
    ALLFP1681.82495.634+0.026+0.040
    ViT-B/16基线FP3284.52897.294//
    OSFP1684.52297.262–0.006–0.032
    OGFP1684.52097.306–0.008+0.012
    OLFP1684.52697.292–0.002–0.002
    ALLFP1684.52497.262–0.004–0.032
    ViT-L/16基线FP3285.84097.818//
    OSFP1685.80097.816–0.040–0.002
    OGFP1685.81897.818–0.0220.000
    OLFP1685.82097.818–0.0200.000
    ALLFP1685.78497.81–0.056–0.008
    Swin-T基线FP3281.17295.320//
    OSFP1681.15295.304–0.020–0.016
    OGFP1681.15695.320–0.0160.000
    OLFP1681.16495.322–0.0080.002
    ALLFP1681.14895.300–0.024–0.02
    Swin-S基线FP3283.64897.05//
    OSFP1683.64297.02–0.006–0.030
    OGFP1683.64697.08–0.0020.030
    OLFP1683.63897.04–0.010–0.010
    ALLFP1683.63696.966–0.012–0.084
    下载: 导出CSV

    表  3  与相关FPGA加速器的比较结果

    类别文献[20]文献[21]文献[22]文献[23]本文
    灵活性Softmax
    GELU×××
    层归一化××××
    资源占用LUT178702564222932452639
    Slice Register16400279422431824403
    DSP00810
    FPGA设备ZYNQ-7000Kintex-7 KC705Zynq-7000 ZC706Kintex XCKU15P
    频率 (MHz)150436154410200
    数据精度(bit)3216169+1216+32
    吞吐量 (Gbit/s)2.43.491.20.4134.13
    100 MHz下TPL(Mbit/s)0.0890.3120.3490.3090.324
    下载: 导出CSV

    表  4  与GPU实现和其他文献FPGA加速器的比较结果

    GPU文献[13]文献[24]文献[9]文献[10]本文
    模型ResNet-50ViT-B/16ResNet-50ResNet-50Swin-TViT-B/16ResNet-50ViT-B/16
    计算平台Nvidia V100Xilinx KCU1500Xilinx XCVU9PXilinx Alveo U50Xilinx ZC7020Xilinx XCVU37P
    制程(nm)122016162816
    频率(GHz)1.46200125300150200/400
    数据类型FP32INT8INT8FP16INT8INT8INT8+FP16
    输入尺寸224×224256×256224×224224×224224×224224×224
    GOP7.7417.5611.767.7417.567.7417.56
    DSP224060052420220608
    计算延迟(ms)6.3217.7411.6928.90363.6413.1231.09
    帧率(FPS)158.1956.3785.5434.602.7576.2232.16
    下载: 导出CSV
  • [1] SIMONYAN K and ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]. 3rd International Conference on Learning Representations, San Diego, USA, 2015.
    [2] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [3] SZEGEDY C, LIU Wei, JIA Yangqing, et al. Going deeper with convolutions[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 1–9.
    [4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [5] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]. The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 213–229.
    [6] 陈莹, 匡澄. 基于CNN和TransFormer多尺度学习行人重识别方法[J]. 电子与信息学报, 2023, 45(6): 2256–2263. doi: 10.11999/JEIT220601.

    CHEN Ying and KUANG Cheng. Pedestrian re-identification based on CNN and Transformer multi-scale learning[J]. Journal of Electronics &Information Technology, 2023, 45(6): 2256–2263. doi: 10.11999/JEIT220601.
    [7] ZHAI Xiaohua, KOLESNIKOV A, HOULSBY N, et al. Scaling vision transformers[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 1204–1213.
    [8] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]. 9th International Conference on Learning Representations, 2021.
    [9] WANG Teng, GONG Lei, WANG Chao, et al. ViA: A novel vision-transformer accelerator based on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(11): 4088–4099. doi: 10.1109/TCAD.2022.3197489.
    [10] NAG S, DATTA G, KUNDU S, et al. ViTA: A vision transformer inference accelerator for edge applications[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5.
    [11] LI Zhengang, SUN Mengshu, LU A, et al. Auto-ViT-Acc: an FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization[C]. 2022 32nd International Conference on Field-Programmable Logic and Applications, Belfast, UK, 2022: 109–116.
    [12] 吴瑞东, 刘冰, 付平, 等. 应用于极致边缘计算场景的卷积神经网络加速器架构设计[J]. 电子与信息学报, 2023, 45(6): 1933–1943. doi: 10.11999/JEIT220130.

    WU Ruidong, LIU Bing, FU Ping, et al. Convolutional neural network accelerator architecture design for ultimate edge computing scenario[J]. Journal of Electronics &Information Technology, 2023, 45(6): 1933–1943. doi: 10.11999/JEIT220130.
    [13] NGUYEN D T, JE H, NGUYEN T N, et al. ShortcutFusion: from tensorflow to FPGA-based accelerator with a reuse-aware memory allocation for shortcut data[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2022, 69(6): 2477–2489. doi: 10.1109/TCSI.2022.3153288.
    [14] LI Tianyang, ZHANG Fan, FAN Xitian, et al. Unified accelerator for attention and convolution in inference based on FPGA[C]. 2023 IEEE International Symposium on Circuits and Systems, Monterey, USA, 2023: 1–5.
    [15] LOMONT C. Fast inverse square root[EB/OL]. http://lomont.org/papers/2003/InvSqrt.pdf, 2023.
    [16] WU E, ZHANG Xiaoqian, BERMAN D, et al. A high-throughput reconfigurable processing array for neural networks[C]. 27th International Conference on Field Programmable Logic and Applications, Ghent, Belgium, 2017: 1–4.
    [17] FU Yao, WU E, SIRASAO A, et al. Deep learning with INT8 optimization on Xilinx devices[EB/OL]. Xilinx. https://www.origin.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp486-deep-learning-int8.pdf, 2017.
    [18] ZHU Feng, GONG Ruihao, YU Fengwei, et al. Towards unified INT8 training for convolutional neural network[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 1966–1976.
    [19] JACOB B, KLIGYS S, CHEN Bo, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2704–2713.
    [20] SUN Qiwei, DI Zhixiong, LV Zhengyang, et al. A high speed SoftMax VLSI architecture based on basic-split[C]. 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology, Qingdao, China, 2018: 1–3.
    [21] WANG Meiqi, LU Siyuan, ZHU Danyang, et al. A high-speed and low-complexity architecture for softmax function in deep learning[C]. 2018 IEEE Asia Pacific Conference on Circuits and Systems, Chengdu, China, 2018: 223–226.
    [22] GAO Yue, LIU Weiqiang, and LOMBARDI F. Design and implementation of an approximate softmax layer for deep neural networks[C]. 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, 2020: 1–5.
    [23] LI Yue, CAO Wei, ZHOU Xuegong, et al. A low-cost reconfigurable nonlinear core for embedded DNN applications[C]. 2020 International Conference on Field-Programmable Technology, Maui, USA, 2020: 35–38.
    [24] HADJIS S and OLUKOTUN K. TensorFlow to cloud FPGAs: Tradeoffs for accelerating deep neural networks[C]. 29th International Conference on Field Programmable Logic and Applications, Barcelona, Spain, 2019: 360–366.
  • 加载中
图(10) / 表(5)
计量
  • 文章访问数:  1311
  • HTML全文浏览量:  698
  • PDF下载量:  330
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-07-15
  • 修回日期:  2023-09-27
  • 网络出版日期:  2023-10-08
  • 刊出日期:  2024-06-30

目录

    /

    返回文章
    返回