基于新型多尺度注意力机制的密集人群计数算法

万洪林; 王晓敏; 彭振伟; 白智全; 杨星海; 孙建德

doi:10.11999/JEIT210163

基于新型多尺度注意力机制的密集人群计数算法

doi: 10.11999/JEIT210163 cstr: 32379.14.JEIT210163

1.
山东师范大学物理与电子科学学院济南 250358
2.
山东师范大学信息科学与工程学院济南 250358
3.
山东大学信息科学与工程学院青岛 266237
4.
青岛科技大学信息科学技术学院青岛 266061

基金项目: 国家自然科学基金(61971271)，山东省重点研发计划(2018GGX106008)

详细信息

作者简介:
万洪林：男，1979年生，副教授，博士，主要研究方向为计算机视觉、人工智能

王晓敏：女，1998年生，硕士生，主要研究方向为图像处理、人群计数

彭振伟：男，1995年生，硕士，主要研究方向为图像处理、人群计数

白智全：男，1978年生，教授，博士生导师，主要研究方向为协作通信技术、无线光通信技术

孙建德：男，1978年生，教授、博士生导师，主要研究方向为多媒体信息处理、分析、理解及其应用

通讯作者:
万洪林　visage1979@sdu.edu.cn

中图分类号: TP391
计量
- 文章访问数: 3126
- HTML全文浏览量: 1092
- PDF下载量: 203
- 被引次数: 0
出版历程
- 收稿日期: 2021-02-25
- 修回日期: 2021-10-23
- 录用日期: 2021-11-05
- 网络出版日期: 2021-11-11
- 刊出日期: 2022-03-28

Dense Crowd Counting Algorithm Based on New Multi-scale Attention Mechanism

1.
School of Physics and Electronic Science, Shandong Normal University , Jinan 250358, China
2.
School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China
3.
School of Information Science and Engineering, Shandong University, Qingdao 266237, China
4.
School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China

Funds: The National Natural Science Foundation of China (61971271), The Key Research and Development of Shandong Province (2018GGX106008)

摘要

摘要: 密集人群计数是计算机视觉领域的一个经典问题，仍然受制于尺度不均匀、噪声和遮挡等因素的影响。该文提出一种基于新型多尺度注意力机制的密集人群计数方法。深度网络包括主干网络、特征提取网络和特征融合网络。其中，特征提取网络包括特征支路和注意力支路，采用由并行卷积核函数组成的新型多尺度模块，能够更好地获取不同尺度下的人群特征，以适应密集人群分布的尺度不均匀特性；特征融合网络利用注意力融合模块对特征提取网络的输出特征进行增强，实现了注意力特征与图像特征的有效融合，提高了计数精度。在ShanghaiTech, UCF_CC_50, Mall和UCSD等公开数据集的实验表明，提出的方法在MAE和MSE两项指标上均优于现有方法。
- 人群计数 /
- 新型多尺度注意力 /
- 卷积神经网络 /
- 人工智能
Abstract: Dense crowd counting is a classic problem in the field of computer vision, and it is still subject to the influence of factors such as uneven scale, noise and occlusion. This paper proposes a dense crowd counting method based on a new multi-scale attention mechanism. Deep network includes backbone network, feature extraction network and feature fusion network. Among them, the feature extraction network includes feature branch and attention branch. It adopts a new multi-scale module composed of parallel convolution kernel functions, which can better obtain the characteristics of people at different scales to adapt to the uneven scale of dense population distribution features; The feature fusion network uses the attention fusion module to enhance the output features of the feature extraction network, realizes the effective fusion of attention features and image features, and improves counting accuracy. Experiments on public data sets such as ShanghaiTech, UCF_CC_50, Mall and UCSD show that the proposed method outperforms existing methods in both MAE and MSE indicators.
- Crowd counting /
- New multi-scale attention /
- Convolutional neural network /
- Artificial intelligence

HTML全文

图 1 本文提出的网络结构

下载: 全尺寸图片幻灯片

图 2 基础特征提取模块，在本文亦被采用为基础注意力模块

下载: 全尺寸图片幻灯片

图 3 传统Inception结构

下载: 全尺寸图片幻灯片

图 4 改进Inception结构

下载: 全尺寸图片幻灯片

图 5 新型多尺度模块

下载: 全尺寸图片幻灯片

图 6 注意力融合模块

下载: 全尺寸图片幻灯片

图 7 密度估计图、ground truth以及原始图像

下载: 全尺寸图片幻灯片

表 1 ShanghaiTech数据集实验结果

方法	Part A		Part B
方法	MAE	MSE	MAE	MSE
MCNN^[4]	110.2	173.2	26.4	41.3
EDMNet^[14]	76.5	100.2	15.4	26.3
MSFNet^[15]	63.4	97.2	9.6	14.3
Switching-CNN^[9]	90.4	135.0	21.6	33.4
CSRNet^[8]	68.2	115.0	10.6	16.0
SCAR^[21]	66.3	114.1	9.5	15.2
MRA-CNN^[22]	74.2	112.5	11.9	21.3
ACSPNet^[23]	85.2	137.1	15.4	23.1
ACM-CNN^[16]	72.2	103.5	17.5	22.7
SFANet^[24]	59.8	99.3	26.0	30.5
FPNet^[33]	108.6	126.3	26.0	30.5
本文方法	57.1	91.9	6.87	9.8

下载: 导出CSV

表 2 UCF_CC_50实验结果

方法	MAE	MSE
MCNN^[13]	377.6	509.1
MSFNet^[15]	257.2	380.8
Switching-CNN^[9]	318.1	439.2
CSRNet^[8]	266.1	397.5
ic-CNN^[25]	260.9	365.5
SCAR^[21]	259.0	374.0
MRA-CNN^[22]	240.8	352.6
ACSPNet^[16]	275.2	383.7
ACM-CNN^[16]	291.6	337.0
SDA-MCNN^[26]	306.6	313.2
SFANet^[24]	219.6	316.2
FPNet^[33]	463.0	501.6
本文方法	175.2	233.6

下载: 导出CSV

表 3 Mall实验结果

方法	MAE	MSE
EDMNet^[14]	1.80	5.36
R-FCN^[27]	6.02	5.46
Faster R-FCN^[28]	5.91	6.60
BidirectionalConvLSTM^[29]	2.10	7.6
DigCrowd^[30]	3.21	16.4
ACM-CNN^[16]	2.3	3.1
本文方法	1.57	2.03

下载: 导出CSV

表 4 UCSD实验结果

方法	MAE	MSE
MCNN^[13]	1.07	1.35
Switching-CNN^[9]	1.62	2.10
BidirectionalConvLSTM^[29]	1.13	1.43
ACSCP^[31]	1.04	1.35
CSRNet^[8]	1.16	1.47
SaNet^[32]	1.02	1.29
ACSPNet^[23] ACM-CNN^[16] SFANet^[24] FPNet^[33] 本文方法	1.02 1.01 0.82 1.67 0.97	1.28 1.29 1.07 3.91 1.27

下载: 导出CSV

表 5 消融实验结果

方法	MAE	MSE
Backbone + D +M Backbone+D+M+C	58.6 57.8	96.6 92.7
Backbone + ND +NM+C	57.1	91.9

下载: 导出CSV

参考文献(33)

[1]	ARTETA C, LEMPITSKY V, and ZISSERMAN A. Counting in the wild[C]. Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 483–498.
[2]	ARTETA C, LEMPITSKY V, NOBLE J A, et al. Interactive object counting[C]. Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 2014: 504–518.
[3]	SHANG Chong, AI Haizhou, and BAI Bo. End-to-end crowd counting via joint learning local and global count[C]. Proceedings of 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, USA, 2016: 1215–1219.
[4]	ZHANG Yingying, ZHOU Desen, CHEN Siqin, et al. Single-image crowd counting via multi-column convolutional neural network[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 589–597.
[5]	OÑORO-RUBIO D and LÓPEZ-SASTRE R J. Towards perspective-free object counting with deep learning[C]. Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 2016: 615–629.
[6]	MARSDEN M, MCGUINNESS K, LITTLE S, et al. ResnetCrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification[C]. Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, Lecce, Italy, 2017: 123–126.
[7]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
[8]	LI Yuhong, ZHANG Xiaofan, and CHEN Deming. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 1091–1100.
[9]	SAM D B, SURYA S, and BABU R V. Switching convolutional neural network for crowd counting[C]. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4031–4039.
[10]	WANG Qi, GAO Junyu, LIN Wei, et al. Learning from synthetic data for crowd counting in the wild[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 8190–8199.
[11]	LI Zhengqi, DEKEL T, COLE F, et al. Learning the depths of moving people by watching frozen people[C]. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 4516–4525.
[12]	WANG Shunzhou, LU Yao, ZHOU Tianfei, et al. SCLNet: Spatial context learning network for congested crowd counting[J]. Neurocomputing, 2020, 404: 227–239. doi: 10.1016/j.neucom.2020.04.139
[13]	CHEN Xinya, BIN Yanrui, GAO Changxin, et al. Relevant region prediction for crowd counting[J]. Neurocomputing, 2020, 407: 399–408. doi: 10.1016/j.neucom.2020.04.117
[14]	孟月波, 纪拓, 刘光辉, 等. 编码-解码多尺度卷积神经网络人群计数方法[J]. 西安交通大学学报, 2020, 54(5): 149–157. MENG Yuebo, JI Tuo, LIU Guanghui, et al. Encoding-decoding multi-scale convolutional neural network for crowd counting[J]. Journal of Xi'an Jiaotong University, 2020, 54(5): 149–157.
[15]	左静, 巴玉林. 基于多尺度融合的深度人群计数算法[J]. 激光与光电子学进展, 2020, 57(24): 307–315. ZUO Jing and BA Yulin. Population-depth counting algorithm based on multiscale fusion[J]. Laser &Optoelectronics Progress, 2020, 57(24): 307–315.
[16]	ZOU Zhikang, CHENG Yu, QU Xiaoye, et al. Attend to count: Crowd counting with adaptive capacity multi-scale CNNs[J]. Neurocomputing, 2019, 367: 75–83. doi: 10.1016/j.neucom.2019.08.009
[17]	SZEGEDY C, LIU Wei, JIA Yangqing, et al. Going deeper with convolutions[C]. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 1–9.
[18]	IDREES H, SALEEMI I, SEIBERT C, et al. Multi-source multi-scale counting in extremely dense crowd images[C]. Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013: 2547–2554.
[19]	CHEN Ke, LOY C C, GONG Shaogang, et al. Feature mining for localised crowd counting[C]. Proceedings of the British Machine Vision Conference, Surrey, UK, 2012: 3–5.
[20]	CHAN A B, LIANG Z S J, and VASCONCELOS N. Privacy preserving crowd monitoring: Counting people without people models or tracking[C]. Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, USA, 2008: 1–7.
[21]	GAO Junyu, WANG Qi, and YUAN Yuan. SCAR: Spatial-/channel-wise attention regression networks for crowd counting[J]. Neurocomputing, 2019, 363: 1–8. doi: 10.1016/j.neucom.2019.08.018
[22]	ZHANG Youmei, ZHOU Chunluan, CHANG Faliang, et al. Multi-resolution attention convolutional neural network for crowd counting[J]. Neurocomputing, 2018, 329: 144–152.
[23]	MA Junjie, DAI Yaping, and TAN Y P. Atrous convolutions spatial pyramid network for crowd counting and density estimation[J]. Neurocomputing, 2019, 350: 91–101. doi: 10.1016/j.neucom.2019.03.065
[24]	ZHU Liang, ZHAO Zhijian, LU Chao, et al. Dual path multi-scale fusion networks with attention for crowd counting[J]. arXiv: 1902.01115, 2019.
[25]	RANJAN V, LE H, and HOAI M. Iterative crowd counting[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 278–293.
[26]	YANG Biao, ZHAN Weiqin, WANG Nan, et al. Counting crowds using a scale-distribution-aware network and adaptive human-shaped kernel[J]. Neurocomputing, 2020, 390: 207–216. doi: 10.1016/j.neucom.2019.02.071
[27]	DAI Jifeng, LI Yi, HE Kaiming, et al. R-FCN: Object detection via region-based fully convolutional networks[C]. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 379–387.
[28]	REN Shaoqiang, HE Kaiming, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031
[29]	XIONG Feng, SHI Xingjian, and YEUNG D Y. Spatiotemporal modeling for crowd counting in videos[C]. Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 5161–5169.
[30]	XU Mingliang, GE Zhaoyang, JIANG Xiaoheng, et al. Depth Information Guided Crowd Counting for complex crowd scenes[J]. Pattern Recognition Letters, 2019, 125: 563–569. doi: 10.1016/j.patrec.2019.02.026
[31]	SHEN Zan, XU Yi, NI Bingbing, et al. Crowd counting via adversarial cross-scale consistency pursuit[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5245–5254.
[32]	CAO Xinkun, WANG Zhipeng, ZHAO Yanyun, et al. Scale aggregation network for accurate and efficient crowd counting[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 757–773.
[33]	邓远志, 胡钢. 基于特征金字塔的人群密度估计方法[J]. 测控技术, 2020, 39(6): 108–114. DENG Yuanzhi and HU Gang. Crowd density evaluation method based on feature pyramid[J]. Measurement &Control Technology, 2020, 39(6): 108–114.