高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向边缘计算的可重构CNN协处理器研究与设计

李伟 陈億 陈韬 南龙梅 杜怡然

李伟, 陈億, 陈韬, 南龙梅, 杜怡然. 面向边缘计算的可重构CNN协处理器研究与设计[J]. 电子与信息学报, 2024, 46(4): 1499-1512. doi: 10.11999/JEIT230509
引用本文: 李伟, 陈億, 陈韬, 南龙梅, 杜怡然. 面向边缘计算的可重构CNN协处理器研究与设计[J]. 电子与信息学报, 2024, 46(4): 1499-1512. doi: 10.11999/JEIT230509
LI Wei, CHEN Yi, CHEN Tao, NAN Longmei, DU Yiran. A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1499-1512. doi: 10.11999/JEIT230509
Citation: LI Wei, CHEN Yi, CHEN Tao, NAN Longmei, DU Yiran. A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing[J]. Journal of Electronics & Information Technology, 2024, 46(4): 1499-1512. doi: 10.11999/JEIT230509

面向边缘计算的可重构CNN协处理器研究与设计

doi: 10.11999/JEIT230509
基金项目: 基础加强计划重点基础研究项目(2019-JCJQ-ZD-187-00-02)
详细信息
    作者简介:

    李伟:男,教授,博士生导师,研究方向为密码处理器设计、ASIC专用芯片设计

    陈億:男,硕士生,研究方向为智能化可重构芯片电路与算法

    陈韬:男,副教授,硕士生导师,研究方向为安全专用芯片设计

    南龙梅:女,副教授,研究方向为大规模集成电路设计、专用集成电路设计

    杜怡然:男,讲师,研究方向为可重构密码芯片设计

    通讯作者:

    陈億 18236403130@163.com

  • 中图分类号: TN492; TP183

A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing

Funds: The Fundamental Enhancement Program Focused Essential Research Projects (2019-JCJQ-ZD-187-00-02)
  • 摘要: 随着深度学习技术的发展,卷积神经网络模型的参数量和计算量急剧增加,极大提高了卷积神经网络算法在边缘侧设备的部署成本。因此,为了降低卷积神经网络算法在边缘侧设备上的部署难度,减小推理时延和能耗开销,该文提出一种面向边缘计算的可重构CNN协处理器结构。基于按通道处理的数据流模式,提出的两级分布式存储方案解决了片上大规模的数据搬移和重构运算时PE单元间的大量数据移动导致的功耗开销和性能下降的问题;为了避免加速阵列中复杂的数据互联网络传播机制,降低控制的复杂度,该文提出一种灵活的本地访存机制和基于地址转换的填充机制,使得协处理器能够灵活实现任意规格的常规卷积、深度可分离卷积、池化和全连接运算,提升了硬件架构的灵活性。本文提出的协处理器包含256个PE运算单元和176 kB的片上私有存储器,在55 nm TT Corner(25 °C,1.2 V)的CMOS工艺下进行逻辑综合和布局布线,最高时钟频率能够达到328 MHz,实现面积为4.41 mm2。在320 MHz的工作频率下,该协处理器峰值运算性能为163.8 GOPs,面积效率为37.14 GOPs/mm2,完成LeNet-5和MobileNet网络的能效分别为210.7 GOPs/W和340.08 GOPs/W,能够满足边缘智能计算场景下的能效和性能需求。
  • 图  1  常规3D卷积运算示意图

    图  2  深度可分离卷积运算示意图

    图  3  按通道处理数据流框架示意图

    图  4  常规3D卷积计算数据流映射

    图  5  逐通道卷积计算数据流映射

    图  6  全连接运算数据流映射

    图  7  常规3D卷积访存方案

    图  8  逐点卷积数据流映射

    图  9  逐点卷积访存方案

    图  10  填充示例图

    图  11  8 bit对称量化示意图

    图  12  CNN协处理器硬件结构

    图  13  乘累加单元硬件结构

    图  14  累加单元和后处理单元硬件结构图

    图  15  最大池化单元硬件结构

    图  16  CNN协处理器资源占用分布

    图  17  LeNet-5网络功耗分布

    图  18  MobileNet网络功耗分布

    算法1 常规3D卷积
     输入:IC, OC, OH, OW, KH, KW, If, K, S
     输出:Of
     FOR no=0; no<OC; no++ {
     FOR ni=0; ni<IC; ni++ {
     FOR Or=0; Or<OH; Or++ {
     FOR Oc=0; Oc<OW; Oc++ {
     FOR i=0; i<KH; i++ {
     FOR j=0; j<KW; j++ {
     Of[no][Or][Oc] += K[no][ni][i][j]×If[ni][S×Or+i][S×Oc+j];
     } } } } } }
    下载: 导出CSV
    算法2 按通道处理的常规3D卷积
     输入:IC, OC, OH, OW, KH, KW, If, K, S, N
     输出:Of
     FOR no=0; no<$\left\lceil {{\mathrm{OC}}/N} \right\rceil $; no++ {
     FOR ni=0; ni<$\left\lceil {{\mathrm{IC}}/N} \right\rceil $; ni++ {
     FOR Or=0; Or<OH; Or++ {
     FOR Oc=0; Oc<OW; Oc++ {
     FOR i=0; i<KH; i++{
     FOR j=0; j<KW; j++ {
     Of[no×N][Or][Oc] += K[no×N][ni×N][i][j]×If[ni×N][S×Or+i]
     [S×Oc+j]+ K[no×N][ni×N+1][i][j]×If[ni×N+1][S×Or+i]
     [S×Oc+j]+…+K[no×N][ni×N+N–1][i][j]×If[ni×N+N–1]
     [S×Or+i][S×Oc+j];
     Of[no×N+1][Or][Oc] +=
     K[no×N+1][ni×N][i][j]×If[ni×N][S×Or+i][S×Oc+j]+
     K[no×N+1][ni×N+1][i][j]×If[ni×N+1][S×Or+i][S×Oc+j]+…+
     K[no×N+1][ni×N+N–1][i][j]×If[ni×N+N–1][S×Or+i][S×Oc+j];
     …
     Of[no×N+N–1][Or][Oc] +=
     K[no×N+N–1][ni×N][i][j]×If[ni×N][S×Or+i][S×Oc+j] +
     K[no×N+N–1][ni×N+1][i][j]×If[ni×N+1][S×Or+i]
     [S×Oc+j]+…+
     K[no×N+N–1][ni×N+N–1][i][j]×If[ni×N+N–1][S×Or+i][S×Oc+j];
    } } } } } } }
    下载: 导出CSV

    表  1  不同硬件加速平台的能效对比

    CPUGPU本文
    工艺(nm)121255
    精度INT8INT8INT8
    测试模型LeNet-5MobileNetLeNet-5MobileNetLeNet-5MobileNet
    功耗(W)3.46.33.925.10.1380.279
    能效(GOPs/W)2.218.972.7519.85210.7340.0
    识别率(Images/s)8771602123455617418511272
    下载: 导出CSV

    表  2  不同CNN加速器性能对比

    JSSC 2017
    DISP[17]
    TOCC 2020
    ZASCAD[18]
    AICAS
    2020[20]
    TCAS-I 2021
    CARLA[19]
    TCAS-I 2021
    IECA[21]
    本文方案
    工艺制程(nm) 65 65 40 65 55 55
    测量方案 Chip Post-Layout Post-Layout Post-Layout Chip Post-Layout
    On-Chip SRAM (KB) 139.6 36.9 44.3 85.5 109.0 176.0
    电压(V) 1.2 1.0 1.2
    PE单元数量 64 192 144 192 168 256
    工作时钟频率(MHz) 250 200 750 200 250 320
    峰值性能(GOPs) 32 76.8 216 77.4 84.0 163.8
    面积(mm2) 12.25 6 8.04 6.2 2.75 4.41
    (1)面积效率(GOPs/mm2) 6.48 15.12 19.53 14.75 30.55 37.14
    (1) 工艺比例: process/55nm
    下载: 导出CSV

    表  3  不同CNN加速器能效对比

    JSSC 2017
    Eyeriss[10]
    JETCAS 2019
    Eyeriss V2[22]
    AICAS
    2021[23]
    TCAS-I 2021
    CARLA[19]
    本文方案
    工艺(nm) 65 65 65 65 55
    测量方案 Chip Post-Layout Post-Layout Post-Layout Post-Layout
    On-Chip SRAM (KB) 181.5 246 216 85.5 176
    量化精度(bit) 16 8 8 16 8
    电压(V) 1.0 1.2 1.2
    工作时钟频率 200 200 200 320
    测试模型 AlexNet MobileNet MobileNet VGG-16 MobileNet
    功耗(mW) 278 247 279.6
    能效(GOPs/W) 166.2 193.7 45.8 313.4 340.0
    下载: 导出CSV
  • [1] FIROUZI F, FARAHANI B, and MARINŠEK A. The convergence and interplay of edge, fog, and cloud in the AI-driven Internet of Things (IoT)[J]. Information Systems, 2022, 107: 101840. doi: 10.1016/j.is.2021.101840.
    [2] ALAM F, ALMAGHTHAWI A, KATIB I, et al. IResponse: An AI and IoT-enabled framework for autonomous COVID-19 pandemic management[J]. Sustainability, 2021, 13(7): 3797. doi: 10.3390/su13073797.
    [3] CHAUDHARY V, KAUSHIK A, FURUKAWA H, et al. Review-Towards 5th generation AI and IoT driven sustainable intelligent sensors based on 2D MXenes and borophene[J]. ECS Sensors Plus, 2022, 1(1): 013601. doi: 10.1149/2754-2726/ac5ac6.
    [4] KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386.
    [5] LU Wenyan, YAN Guihai, LI Jiajun, et al. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks[C]. 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, USA, 2017: 553–564. doi: 10.1109/HPCA.2017.29.
    [6] PARK J S, PARK C, KWON S, et al. A multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision Datapath in 4-nm flagship mobile SoC[J]. IEEE Journal of Solid-State Circuits, 2023, 58(1): 189–202. doi: 10.1109/JSSC.2022.3205713.
    [7] GOKHALE V, JIN J, DUNDAR A, et al. A 240 G-ops/s mobile coprocessor for deep neural networks[C]. IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, USA, 2014: 682–687. doi: 10.1109/CVPRW.2014.106.
    [8] DU Zidong, FASTHUBER R, CHEN Tianshi, et al. ShiDianNao: Shifting vision processing closer to the sensor[C]. 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, USA, 2015: 92–104. doi: 10.1145/2749469.2750389.
    [9] ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170. doi: 10.1145/2684746.2689060.
    [10] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127–138. doi: 10.1109/JSSC.2016.2616357.
    [11] HOWARD A G, ZHU Menglong, CHEN Bo, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. https://arxiv.org/abs/1704.04861, 2017.
    [12] DING Caiwen, WANG Shuo, LIU Ning, et al. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on FPGAs[C]. The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, USA, 2019: 33–42. doi: 10.1145/3289602.3293904.
    [13] LE M Q, NGUYEN Q T, DAO V H, et al. CNN quantization for anatomical landmarks classification from upper gastrointestinal endoscopic images on Edge devices[C]. 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, 2022: 389–394. doi: 10.1109/ICCE55644.2022.9852098.
    [14] KWAK J, KIM K, LEE S S, et al. Quantization aware training with order strategy for CNN[C]. 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Yeosu, Republic of Korea, 2022: 1–3. doi: 10.1109/ICCE-Asia57006.2022.9954693.
    [15] JACOB B, KLIGYS S, CHEN Bo, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2704–2713. doi: 10.1109/CVPR.2018.00286.
    [16] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791.
    [17] JO J, CHA S, RHO D, et al. DSIP: A scalable inference accelerator for convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2018, 53(2): 605–618. doi: 10.1109/JSSC.2017.2764045.
    [18] ARDAKANI A, CONDO C, and GROSS W J. Fast and efficient convolutional accelerator for edge computing[J]. IEEE Transactions on Computers, 2020, 69(1): 138–152. doi: 10.1109/TC.2019.2941875.
    [19] AHMADI M, VAKILI S, and LANGLOIS J M P. CARLA: A convolution accelerator with a reconfigurable and low-energy architecture[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2021, 68(8): 3184–3196. doi: 10.1109/TCSI.2021.3066967.
    [20] LU Yi, WU Yilin and HUANG J D. A coarse-grained dual-convolver based CNN accelerator with high computing resource utilization[C]. 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 2020: 198–202. doi: 10.1109/AICAS48895.2020.9073835.
    [21] HUANG Boming, HUAN Yuxiang, CHU Haoming, et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2021, 68(11): 4672–4685. doi: 10.1109/TCSI.2021.3108762.
    [22] CHEN Y H, YANG T J, EMER J, et al. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019, 9(2): 292–308. doi: 10.1109/JETCAS.2019.2910232.
    [23] HOSSAIN M D S and SAVIDIS I. Energy efficient computing with heterogeneous DNN accelerators[C]. 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, USA, 2021: 1–4. doi: 10.1109/AICAS51828.2021.9458474.
  • 加载中
图(18) / 表(5)
计量
  • 文章访问数:  352
  • HTML全文浏览量:  365
  • PDF下载量:  68
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-05-29
  • 修回日期:  2023-12-04
  • 网络出版日期:  2023-12-25
  • 刊出日期:  2024-04-24

目录

    /

    返回文章
    返回