NN-EdgeBuilder：面向边缘端设备的高性能神经网络推理框架

张萌; 张雨; 张经纬; 曹新野; 李鹤

doi:10.11999/JEIT230325

NN-EdgeBuilder：面向边缘端设备的高性能神经网络推理框架

doi: 10.11999/JEIT230325 cstr: 32379.14.JEIT230325

东南大学电子科学与工程学院南京 210096

基金项目: 广东省重点领域研发计划(2021B1101270006)，江苏省自然科学基金(BK20201145)

详细信息

作者简介:
张萌：男，研究员，研究方向为数字信号处理、深度学习算法及硬件加速

张雨：男，硕士生，研究方向为深度学习硬件加速器设计

张经纬：男，博士生，研究方向为计算机视觉和深度学习硬件加速器设计

曹新野：男，硕士生，研究方向为深度学习硬件加速器设计

李鹤：男，副研究员，研究方向为可编程芯片(FPGA)开发和系统优化

通讯作者:
张雨　zhangyu_seu@foxmail.com

中图分类号: TN79.1
计量
- 文章访问数: 1292
- HTML全文浏览量: 699
- PDF下载量: 128
- 被引次数: 0
出版历程
- 收稿日期: 2023-04-26
- 修回日期: 2023-08-23
- 网络出版日期: 2023-08-28
- 刊出日期: 2023-09-27

NN-EdgeBuilder: High-performance Neural Network Inference Framework for Edge Devices

School of Electronic Sci. and Eng., Southeast University, Nanjing 210096, China

Funds: The Research and Development Program of Guangdong Province (2021B1101270006), The Natural Science Foundation of Jiangsu Province (BK20201145)

摘要

摘要: 飞速发展的神经网络已经在目标检测等领域取得了巨大的成功，通过神经网络推理框架将网络模型高效地自动部署在各类边缘端设备上是目前重要的研究方向。针对以上问题，该文设计一个针对边缘端FPGA的神经网络推理框架NN-EdgeBuilder，能够利用基于多目标贝叶斯优化的设计空间探索算法充分探索网络每层的并行度因子和量化位宽，接着调用高性能且通用的硬件加速算子来生成低延迟、低功耗的神经网络加速器。该文使用NN-EdgeBuilder在Ultra96-V2 FPGA上部署了UltraNet和VGG网络，生成的UltraNet-P1加速器与最先进的UltraNet定制加速器相比，功耗和能效比表现分别提升了17.71%和21.54%。与主流的推理框架相比，NN-EdgeBuilder生成的VGG加速器能效比提升了4.40倍，数字信号处理器(DSP)的计算效率提升了50.65%。
- 神经网络推理框架 /
- 设计空间探索 /
- 多目标贝叶斯优化 /
- 硬件加速算子
Abstract: The rapidly developing neural network has achieved great success in fields such as target detection. Currently, an important research direction is to deploy efficiently and automatically network models on various edge devices through a neural network inference framework. In response to these issues, a neural network inference framework NN-EdgeBuilder for edge FPGA is designed in this paper, which can fully explore the parallelism factors and quantization bit widths of each layer of the network through a design space exploration algorithm based on multi-objective Bayesian optimization. Then high-performance and universal hardware acceleration operators are called to generate low-latency and low-power neural network accelerators. NN-EdgeBuilder is used to deploy UltraNet and VGG networks on Ultra96-V2 FPGA in this study, and the generated UltraNet-P1 accelerator improves power consumption and energy efficiency by 17.71% and 21.54%, respectively, compared with the state-of-the-art UltraNet custom accelerator. Compared with mainstream inference frameworks, energy efficiency of the VGG accelerator generated by NN-EdgeBuilder is improved by 4.40 times and Digital Signal Processor(DSP) computing efficiency is improved by 50.65%.
- Neural network inference framework /
- Design space exploration /
- Multi-objective Bayesian optimization /
- Hardware acceleration operators

HTML全文

图 1 推理框架NN-EdgeBuilder部署网络模型的流程

下载: 全尺寸图片幻灯片

图 2 量化模块运行流程

下载: 全尺寸图片幻灯片

图 3 FC/Conv通用计算单元

下载: 全尺寸图片幻灯片

图 4 Line_buffer工作流程

下载: 全尺寸图片幻灯片

图 5 并行度因子控制MAC数量

下载: 全尺寸图片幻灯片

算法1 全连接运算循环嵌套
(1) Loop1: $ \text{ for(ci=0;ci < CI;ci++)} $
(2) Loop2: 　$ \text{ for(co=0;co < CO;co++)} $
(3) 　　　　　　${ {\boldsymbol{O} }_{{\rm{fc}}} }[{\rm{co}}]$+=$ {I_{{\text{fc}}}}{\text{[ci]}} \times {F_{{\text{fc}}}}{\text{[co,ci]}} $
(4) EndLoop

下载: 导出CSV

算法2 卷积运算循环嵌套
(1) Loop1: $ \text{ for(wo=0;wo < WO;wo++)} $
(2) Loop2: 　$ \text{ for(ho=0;ho < HO;ho++)} $
(3) Loop3: 　　$ \text{ for(co=0;co < CO;co++)} $
(4) Loop4: 　　　$ \text{ for(ci=0;ci < CI;ci++)} $
(5) Loop5: 　　　　$ \text{ for(hf=0;hf < HF;hf++)} $
(6) Loop6: 　　　　　$ \text{ for(wf=0;wf < WF;wf++)} $
(7) 　　　　　　　　　　${ {\boldsymbol{O} } }_{\text{conv} }[{\rm{co}},{\rm{ho}},{\rm{wo}}]$+=${I_{ {\text{conv} } } }[{\rm{ci}},{\rm{ho} } + {\rm{hf} },$ 　　　　　　　　　　　　 ${\rm{wo}} + {\rm{wf}}] \times{F_{ {\text{conv} } } }{\text{[co,ci,hf,wf]} }$
(8) EndLoop

下载: 导出CSV

算法3　贝叶斯优化算法流程
输入：设计空间$F$，代理模型${ {\rm{GP}} _M}$，采集函数${{\rm{EHVIC}}}$，目标　　　　函数${\varphi }(x)$，约束$ {C}(x) $
输出：推理框架NN-EdgeBuilder自动部署的加速器设计空间的　　　　Pareto前沿$P({{\mathcal{V} } })$
(1) 在$F$内采样，得到包含$J$个样本的数据集${D_{\varphi } } = ({\boldsymbol{X}},{\boldsymbol{Y}})$，约　　　束集${D_{C}} = \{ {C}(x)\} $
(2) while !(停止条件) do
(3) 　根据样本集${D_{\varphi }}$和约束集${D_{C}}$拟合代理模型${ {\rm GP} _{{M} } }$
(4) 　对于$ \forall p \in {P_n} $，算出期望的超体积改进量${{\rm{EHVI}}} (x)$和满足　　　约束的期望${{\rm{CS}}} (p)$
(5) 　求出采集函数的极值${x^{J + 1} } = \arg \mathop {\max }\limits_{x \in F} {{\rm{EHVIC}}} (x)$，选择　　　新的采样点
(6) 　运行Vivado工具流得到准确的函数值${\varphi }({x^{J + 1}})$，约束　　　 $ {C}({x^{J + 1}}) $
(7) 　更新数据集${D_{\varphi }}$和约束集${D_{C}}$
(8) end while
(9) return加速器设计空间的Pareto前沿$ P({ {\mathcal{V} } }) $

下载: 导出CSV

表 1 UltraNet加速器性能对比

加速器	IOU	FPS	Energy(J)	GOPS	GOPS/W
UltraNet-P1	0.702	2107	30.2	387.7	319.9
UltraNet-P2	0.703	2090	33.0	384.6	292.8
SEUer	0.703	2020	36.7	371.7	263.2
ultrateam	0.703	2266	40.3	416.9	239.7

下载: 导出CSV

表 2 NN-EdgeBuilder和其他推理框架部署VGG网络的性能对比

	NN-EdgeBuilder	DeepBurning-SEG^[6]	fpgaConvNet^[7]	HyBridDNN^[8]	DNNBuilder^[9]
支持的深度学习框架	PyTorch, TensorFlow & Keras	–	Caffe & Torch	–	Caffe
FPGA平台	ZU3EG	ZU3EG	XC7Z020	XC7Z020	XC7Z045
频率(MHz)	250	200	125	100	200
DSP	360	264	220	220	680
量化精度	4 bit	8 bit	16 bit	16 bit	8 bit
GOPS	418	203	48	83	524
GOPS/DSP	1.16	0.77	0.22	0.38	0.77
GOPS/W	320.2	–	7.3	32.0	72.8

下载: 导出CSV

参考文献(17)

[1]	SIMONYAN K and ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]. The 3rd International Conference on Learning Representations, San Diego, USA, 2015: 1–14.
[2]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[3]	张萌, 张经纬, 李国庆, 等. 面向深度神经网络加速芯片的高效硬件优化策略[J]. 电子与信息学报, 2021, 43(6): 1510–1517. doi: 10.11999/JEIT210002 ZHANG Meng, ZHANG Jingwei, LI Guoqing, et al. Efficient hardware optimization strategies for deep neural networks acceleration chip[J]. Journal of Electronics &Information Technology, 2021, 43(6): 1510–1517. doi: 10.11999/JEIT210002
[4]	ZHANG Xiaofan, LU Haoming, HAO Cong, et al. SkyNet: a hardware-efficient method for object detection and tracking on embedded systems[C]. Machine Learning and Systems, Austin, USA, 2020: 216–229.
[5]	LI Guoqing, ZHANG Jingwei, ZHANG Meng, et al. Efficient depthwise separable convolution accelerator for classification and UAV object detection[J]. Neurocomputing, 2022, 490: 1–16. doi: 10.1016/j.neucom.2022.02.071
[6]	CAI Xuyi, WANG Ying, MA Xiaohan, et al. DeepBurning-SEG: Generating DNN accelerators of segment-grained pipeline architecture[C]. 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, USA, 2022: 1396–1413.
[7]	VENIERIS S I and BOUGANIS C S. fpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(2): 326–342. doi: 10.1109/TNNLS.2018.2844093
[8]	YE Hanchen, ZHANG Xiaofan, HUANG Zhize, et al. HybridDNN: A framework for high-performance hybrid DNN accelerator design and implementation[C]. 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, USA, 2020: 1–6.
[9]	ZHANG Xiaofan, WANG Junsong, ZHU Chao, et al. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs[C]. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, USA, 2018: 1–8.
[10]	BANNER R, NAHSHAN Y, and SOUDRY D. Post training 4-bit quantization of convolutional networks for rapid-deployment[C]. The 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 714.
[11]	DUARTE J, HAN S, HARRIS P, et al. Fast inference of deep neural networks in FPGAs for particle physics[J]. Journal of Instrumentation, 2018, 13: P07027. doi: 10.1088/1748-0221/13/07/P07027
[12]	GHIELMETTI N, LONCAR V, PIERINI M, et al. Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml[J]. Machine Learning:Science and Technology, 2022, 3(4): 045011. doi: 10.1088/2632-2153/ac9cb5
[13]	ZHANG Zheng, CHEN Tinghuan, HUANG Jiaxin, et al. A fast parameter tuning framework via transfer learning and multi-objective bayesian optimization[C]. The 59th ACM/IEEE Design Automation Conference, San Francisco, USA, 2022: 133–138. doi: 10.1145/3489517.3530430.
[14]	HUTTER F, HOOS H H, and LEYTON-BROWN K. Sequential model-based optimization for general algorithm configuration[C]. The 5th International Conference on Learning and Intelligent Optimization, Rome, Italy, 2011: 507–523.
[15]	ZHAN Dawei and XING Huanlai. Expected improvement for expensive optimization: A review[J]. Journal of Global Optimization, 2020, 78(3): 507–544. doi: 10.1007/s10898-020-00923-x
[16]	EMMERICH M T M, DEUTZ A H, and KLINKENBERG J W. Hypervolume-based expected improvement: Monotonicity properties and exact computation[C]. 2011 IEEE Congress of Evolutionary Computation (CEC), New Orleans, USA, 2011: 2147–2154.
[17]	ABDOLSHAH M, SHILTON A, RANA S, et al. Expected hypervolume improvement with constraints[C]. 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018: 3238–3243.