面向边缘计算的可重构CNN协处理器研究与设计

李伟; 陈億; 陈韬; 南龙梅; 杜怡然

doi:10.11999/JEIT230509

面向边缘计算的可重构CNN协处理器研究与设计

doi: 10.11999/JEIT230509

战略支援部队信息工程大学密码工程学院郑州 450001

基金项目: 基础加强计划重点基础研究项目(2019-JCJQ-ZD-187-00-02)

详细信息

作者简介:
李伟：男，教授，博士生导师，研究方向为密码处理器设计、ASIC专用芯片设计

陈億：男，硕士生，研究方向为智能化可重构芯片电路与算法

陈韬：男，副教授，硕士生导师，研究方向为安全专用芯片设计

南龙梅：女，副教授，研究方向为大规模集成电路设计、专用集成电路设计

杜怡然：男，讲师，研究方向为可重构密码芯片设计

通讯作者:
陈億　18236403130@163.com

中图分类号: TN492; TP183
计量
- 文章访问数: 155
- HTML全文浏览量: 121
- PDF下载量: 38
- 被引次数: 0
出版历程
- 收稿日期: 2023-05-29
- 修回日期: 2023-12-04
- 网络出版日期: 2023-12-25
- 刊出日期: 2024-04-24

A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing

School of Cryptographic Engineering, Strategic Support Force Information Engineering University, Zhengzhou 450001, China

Funds: The Fundamental Enhancement Program Focused Essential Research Projects (2019-JCJQ-ZD-187-00-02)

摘要

摘要: 随着深度学习技术的发展，卷积神经网络模型的参数量和计算量急剧增加，极大提高了卷积神经网络算法在边缘侧设备的部署成本。因此，为了降低卷积神经网络算法在边缘侧设备上的部署难度，减小推理时延和能耗开销，该文提出一种面向边缘计算的可重构CNN协处理器结构。基于按通道处理的数据流模式，提出的两级分布式存储方案解决了片上大规模的数据搬移和重构运算时PE单元间的大量数据移动导致的功耗开销和性能下降的问题；为了避免加速阵列中复杂的数据互联网络传播机制，降低控制的复杂度，该文提出一种灵活的本地访存机制和基于地址转换的填充机制，使得协处理器能够灵活实现任意规格的常规卷积、深度可分离卷积、池化和全连接运算，提升了硬件架构的灵活性。本文提出的协处理器包含256个PE运算单元和176 kB的片上私有存储器，在55 nm TT Corner(25 °C，1.2 V)的CMOS工艺下进行逻辑综合和布局布线，最高时钟频率能够达到328 MHz，实现面积为4.41 mm²。在320 MHz的工作频率下，该协处理器峰值运算性能为163.8 GOPs，面积效率为37.14 GOPs/mm²，完成LeNet-5和MobileNet网络的能效分别为210.7 GOPs/W和340.08 GOPs/W，能够满足边缘智能计算场景下的能效和性能需求。
- 硬件加速 /
- 卷积神经网络 /
- 可重构 /
- ASIC
Abstract: With the development of Deep Learning, the number of parameters and computation of Convolutional Neural Network (CNN) increases dramatically, which greatly raises the cost of deploying CNN algorithms on edge devices. To reduce the difficulty of the deployment and decrease the inference latency and energy consumption of CNN on the edge side, a Reconfigurable CNN Co-Processor for edge computing is proposed. Based on the data flow pattern of channel-wise processing, the proposed two-level distributed storage scheme solves the problem of power consumption overhead and performance degradation caused by large data movement between PE units and large-scale migration of intermediate data on chip. To avoid the complex data interconnection network propagation mechanism in PE arrays and reduce the complexity of control, a flexible local access mechanism and a padding mechanism based on address translation are proposed, which can perform conventional convolution, deep separable convolution, pooling and fully connected operations with great flexibility. The proposed co-processor contains 256 Processing Elements (PEs) and 176 kB on-chip SRAM. Synthesized and post-layout with 55-nm TT Corner CMOS process (25 °C, 1.2 V), the CNN co-processor achieves a maximum clock frequency of 328 MHz and an area of 4.41 mm². The peak performance of the co-processor is 163.8 GOPs at 320 MHz and the area efficiency is 37.14 GOPs/mm², the energy efficiency of LeNet-5 and MobileNet are 210.7 GOPs/W and 340.08 GOPs/W, respectively, which is able to meet the energy-efficiency and performance requirements of edge intelligent computing scenarios.
- Hardware acceleration /
- Convolutional Neural Network (CNN) /
- Reconfigurable /
- ASIC

HTML全文

图 1 常规3D卷积运算示意图

下载: 全尺寸图片幻灯片

图 2 深度可分离卷积运算示意图

下载: 全尺寸图片幻灯片

图 3 按通道处理数据流框架示意图

下载: 全尺寸图片幻灯片

图 4 常规3D卷积计算数据流映射

下载: 全尺寸图片幻灯片

图 5 逐通道卷积计算数据流映射

下载: 全尺寸图片幻灯片

图 6 全连接运算数据流映射

下载: 全尺寸图片幻灯片

图 7 常规3D卷积访存方案

下载: 全尺寸图片幻灯片

图 8 逐点卷积数据流映射

下载: 全尺寸图片幻灯片

图 9 逐点卷积访存方案

下载: 全尺寸图片幻灯片

图 10 填充示例图

下载: 全尺寸图片幻灯片

图 11 8 bit对称量化示意图

下载: 全尺寸图片幻灯片

图 12 CNN协处理器硬件结构

下载: 全尺寸图片幻灯片

图 13 乘累加单元硬件结构

下载: 全尺寸图片幻灯片

图 14 累加单元和后处理单元硬件结构图

下载: 全尺寸图片幻灯片

图 15 最大池化单元硬件结构

下载: 全尺寸图片幻灯片

图 16 CNN协处理器资源占用分布

下载: 全尺寸图片幻灯片

图 17 LeNet-5网络功耗分布

下载: 全尺寸图片幻灯片

图 18 MobileNet网络功耗分布

下载: 全尺寸图片幻灯片

算法1　常规3D卷积
输入：IC, OC, OH, OW, KH, KW, If, K, S
输出：Of
FOR n_o=0; n_o<OC; n_o++ {
FOR n_i=0; n_i<IC; n_i++ {
FOR O_r=0; O_r<OH; O_r++ {
FOR O_c=0; O_c<OW; O_c++ {
FOR i=0; i<KH; i++ {
FOR j=0; j<KW; j++ {
Of[n_o][O_r][O_c] += K[n_o][n_i][i][j]×If[n_i][S×O_r+i][S×O_c+j];
} } } } } }

下载: 导出CSV

算法2　按通道处理的常规3D卷积
输入：IC, OC, OH, OW, KH, KW, If, K, S, N
输出：Of
FOR n_o=0; n_o<$\left\lceil {{\mathrm{OC}}/N} \right\rceil $; n_o++ {
FOR n_i=0; n_i<$\left\lceil {{\mathrm{IC}}/N} \right\rceil $; n_i++ {
FOR O_r=0; O_r<OH; O_r++ {
FOR O_c=0; O_c<OW; O_c++ {
FOR i=0; i<KH; i++{
FOR j=0; j<KW; j++ {
Of[n_o×N][O_r][O_c] += K[n_o×N][n_i×N][i][j]×If[n_i×N][S×O_r+i] 　[S×O_c+j]+ K[n_o×N][n_i×N+1][i][j]×If[n_i×N+1][S×O_r+i] 　[S×O_c+j]+…+K[n_o×N][n_i×N+N–1][i][j]×If[n_i×N+N–1] 　[S×O_r+i][S×O_c+j];
Of[n_o×N+1][O_r][O_c] +=
K[n_o×N+1][n_i×N][i][j]×If[n_i×N][S×O_r+i][S×O_c+j]+
K[n_o×N+1][n_i×N+1][i][j]×If[n_i×N+1][S×O_r+i][S×O_c+j]+…+
K[n_o×N+1][n_i×N+N–1][i][j]×If[n_i×N+N–1][S×O_r+i][S×O_c+j];
…
Of[n_o×N+N–1][O_r][O_c] +=
K[n_o×N+N–1][n_i×N][i][j]×If[n_i×N][S×O_r+i][S×O_c+j] +
K[n_o×N+N–1][n_i×N+1][i][j]×If[n_i×N+1][S×O_r+i] 　[S×O_c+j]+…+
K[n_o×N+N–1][n_i×N+N–1][i][j]×If[n_i×N+N–1][S×O_r+i][S×O_c+j];
} } } } } } }

下载: 导出CSV

表 1 不同硬件加速平台的能效对比

	CPU		GPU		本文
工艺(nm)	12		12		55
精度	INT8		INT8		INT8
测试模型	LeNet-5	MobileNet	LeNet-5	MobileNet	LeNet-5	MobileNet
功耗(W)	3.4	6.3	3.9	25.1	0.138	0.279
能效(GOPs/W)	2.21	8.97	2.75	19.85	210.7	340.0
识别率(Images/s)	8771	602	12345	5617	41851	1272

下载: 导出CSV

表 2 不同CNN加速器性能对比

	JSSC 2017 DISP^[17]	TOCC 2020 ZASCAD^[18]	AICAS 2020^[20]	TCAS-I 2021 CARLA^[19]	TCAS-I 2021 IECA^[21]	本文方案
工艺制程(nm)	65	65	40	65	55	55
测量方案	Chip	Post-Layout	Post-Layout	Post-Layout	Chip	Post-Layout
On-Chip SRAM (KB)	139.6	36.9	44.3	85.5	109.0	176.0
电压(V)	1.2	–	–	–	1.0	1.2
PE单元数量	64	192	144	192	168	256
工作时钟频率(MHz)	250	200	750	200	250	320
峰值性能(GOPs)	32	76.8	216	77.4	84.0	163.8
面积(mm²)	12.25	6	8.04	6.2	2.75	4.41
⁽¹⁾面积效率(GOPs/mm²)	6.48	15.12	19.53	14.75	30.55	37.14
(1) 工艺比例: process/55nm

下载: 导出CSV

表 3 不同CNN加速器能效对比

	JSSC 2017 Eyeriss^[10]	JETCAS 2019 Eyeriss V2^[22]	AICAS 2021^[23]	TCAS-I 2021 CARLA^[19]	本文方案
工艺(nm)	65	65	65	65	55
测量方案	Chip	Post-Layout	Post-Layout	Post-Layout	Post-Layout
On-Chip SRAM (KB)	181.5	246	216	85.5	176
量化精度(bit)	16	8	8	16	8
电压(V)	1.0	–	1.2	–	1.2
工作时钟频率	200	200	–	200	320
测试模型	AlexNet	MobileNet	MobileNet	VGG-16	MobileNet
功耗(mW)	278	–	–	247	279.6
能效(GOPs/W)	166.2	193.7	45.8	313.4	340.0

下载: 导出CSV

参考文献(23)

[1]	FIROUZI F, FARAHANI B, and MARINŠEK A. The convergence and interplay of edge, fog, and cloud in the AI-driven Internet of Things (IoT)[J]. Information Systems, 2022, 107: 101840. doi: 10.1016/j.is.2021.101840.
[2]	ALAM F, ALMAGHTHAWI A, KATIB I, et al. IResponse: An AI and IoT-enabled framework for autonomous COVID-19 pandemic management[J]. Sustainability, 2021, 13(7): 3797. doi: 10.3390/su13073797.
[3]	CHAUDHARY V, KAUSHIK A, FURUKAWA H, et al. Review-Towards 5th generation AI and IoT driven sustainable intelligent sensors based on 2D MXenes and borophene[J]. ECS Sensors Plus, 2022, 1(1): 013601. doi: 10.1149/2754-2726/ac5ac6.
[4]	KRIZHEVSKY A, SUTSKEVER I, and HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84–90. doi: 10.1145/3065386.
[5]	LU Wenyan, YAN Guihai, LI Jiajun, et al. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks[C]. 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, USA, 2017: 553–564. doi: 10.1109/HPCA.2017.29.
[6]	PARK J S, PARK C, KWON S, et al. A multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision Datapath in 4-nm flagship mobile SoC[J]. IEEE Journal of Solid-State Circuits, 2023, 58(1): 189–202. doi: 10.1109/JSSC.2022.3205713.
[7]	GOKHALE V, JIN J, DUNDAR A, et al. A 240 G-ops/s mobile coprocessor for deep neural networks[C]. IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, USA, 2014: 682–687. doi: 10.1109/CVPRW.2014.106.
[8]	DU Zidong, FASTHUBER R, CHEN Tianshi, et al. ShiDianNao: Shifting vision processing closer to the sensor[C]. 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, USA, 2015: 92–104. doi: 10.1145/2749469.2750389.
[9]	ZHANG Chen, LI Peng, SUN Guangyu, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, USA, 2015: 161–170. doi: 10.1145/2684746.2689060.
[10]	CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127–138. doi: 10.1109/JSSC.2016.2616357.
[11]	HOWARD A G, ZHU Menglong, CHEN Bo, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. https://arxiv.org/abs/1704.04861, 2017.
[12]	DING Caiwen, WANG Shuo, LIU Ning, et al. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on FPGAs[C]. The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, USA, 2019: 33–42. doi: 10.1145/3289602.3293904.
[13]	LE M Q, NGUYEN Q T, DAO V H, et al. CNN quantization for anatomical landmarks classification from upper gastrointestinal endoscopic images on Edge devices[C]. 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, 2022: 389–394. doi: 10.1109/ICCE55644.2022.9852098.
[14]	KWAK J, KIM K, LEE S S, et al. Quantization aware training with order strategy for CNN[C]. 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Yeosu, Republic of Korea, 2022: 1–3. doi: 10.1109/ICCE-Asia57006.2022.9954693.
[15]	JACOB B, KLIGYS S, CHEN Bo, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2704–2713. doi: 10.1109/CVPR.2018.00286.
[16]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791.
[17]	JO J, CHA S, RHO D, et al. DSIP: A scalable inference accelerator for convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2018, 53(2): 605–618. doi: 10.1109/JSSC.2017.2764045.
[18]	ARDAKANI A, CONDO C, and GROSS W J. Fast and efficient convolutional accelerator for edge computing[J]. IEEE Transactions on Computers, 2020, 69(1): 138–152. doi: 10.1109/TC.2019.2941875.
[19]	AHMADI M, VAKILI S, and LANGLOIS J M P. CARLA: A convolution accelerator with a reconfigurable and low-energy architecture[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2021, 68(8): 3184–3196. doi: 10.1109/TCSI.2021.3066967.
[20]	LU Yi, WU Yilin and HUANG J D. A coarse-grained dual-convolver based CNN accelerator with high computing resource utilization[C]. 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 2020: 198–202. doi: 10.1109/AICAS48895.2020.9073835.
[21]	HUANG Boming, HUAN Yuxiang, CHU Haoming, et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency[J]. IEEE Transactions on Circuits and Systems I:Regular Papers, 2021, 68(11): 4672–4685. doi: 10.1109/TCSI.2021.3108762.
[22]	CHEN Y H, YANG T J, EMER J, et al. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019, 9(2): 292–308. doi: 10.1109/JETCAS.2019.2910232.
[23]	HOSSAIN M D S and SAVIDIS I. Energy efficient computing with heterogeneous DNN accelerators[C]. 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, USA, 2021: 1–4. doi: 10.1109/AICAS51828.2021.9458474.