音频标记一致性约束CRNN声音事件检测

杨利平; 郝峻永; 辜小花; 侯振威

doi:10.11999/JEIT210131

音频标记一致性约束CRNN声音事件检测

doi: 10.11999/JEIT210131

1.
重庆大学光电技术及系统教育部重点实验室重庆 400044
2.
重庆科技学院电气工程学院重庆 401331

基金项目: 国家自然科学基金(61903054)，重庆市自然科学基金(cstc2021jcyj-msxmX0478)

详细信息

作者简介:
杨利平：男，1981年生，副教授，研究方向为信号处理、模式识别和机器学习

郝峻永：男，1993年生，硕士生，研究方向为信号处理和机器学习

辜小花：女，1982年生，教授，研究方向为信号处理和机器学习

侯振威：男，1996年生，硕士生，研究方向为信号处理和机器学习

通讯作者:
杨利平　yanglp@cqu.edu.cn

¹⁾ https://project.inria.fr/desed/²⁾ http://dcase.community/challenge2019/task-sound-event-detection-in-domestic-environments-results
³⁾ http://dcase.community/challenge2020/task-sound-event-detection-and-separation-in-domestic-environments-results⁴⁾ https://github.com/sovrasov/flops-counter.pytorch
中图分类号: TN912.3; TP391.4
计量
- 文章访问数: 1971
- HTML全文浏览量: 1130
- PDF下载量: 161
- 被引次数: 0
出版历程
- 收稿日期: 2021-02-05
- 修回日期: 2021-05-30
- 录用日期: 2021-11-05
- 网络出版日期: 2021-11-18
- 刊出日期: 2022-03-28

Sound Event Detection width Audio Tagging Consistency Constraint CRNN

1.
Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, Chongqing 400044, China
2.
School of Electrical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China

Funds: The National Natural Science Foundation of China (61903054), The Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0478)

摘要

摘要: 级联卷积神经网络(CNN)结构和循环神经网络(RNN)结构的卷积循环神经网络(CRNN)及其改进是当前主流的声音事件检测模型。然而，以端到端方式训练的CRNN声音事件检测模型无法从功能上约束CNN和RNN结构的作用。针对这一问题，该文提出了音频标记一致性约束CRNN声音事件检测方法(ATCC-CRNN)。该方法在CRNN模型的声音事件分类网络中添加了CRNN音频标记分支，同时增加了CNN音频标记网络对CRNN网络CNN结构输出的特征图进行音频标记。然后，通过在模型训练阶段限定CNN和CRNN的音频标记预测结果一致使CRNN模型的CNN结构更关注音频标记任务，RNN结构更关注建立音频样本的帧间关系。从而使CRNN模型的CNN和RNN结构具备了不同的特征描述功能。该文在IEEE DCASE 2019国际竞赛家庭环境声音事件检测任务(任务4)的数据集上进行了实验。实验结果显示：提出的ATCC-CRNN方法显著提高了CRNN模型的声音事件检测性能，在验证集和评估集上的F1得分提高了3.7%以上。这表明提出的ATCC-CRNN方法促进了CRNN模型的功能划分，有效改善了CRNN声音事件检测模型的泛化能力。
- 声音事件检测 /
- 音频标记 /
- 深度学习 /
- 卷积循环神经网络
Abstract: Convolutional Recurrent Neural Network (CRNN), which cascades Convolutional Neural Network (CNN) structure and Recurrent Neural Network (RNN) structure, and its reformations are the mainstreams for sound event detection. However, CRNN models trained in end-to-end way can not make CNN and RNN structures have meaningful functions, which may affect the performances of sound event detection. To alleviate this problem, this paper proposes an Audio Tagging Consistency Constraint CRNN (ATCC-CRNN) method for sound event detection. In ATCC-CRNN, a CRNN audio tagging branch is embedded in the sound event classification network, meanwhile a CNN audio tagging network is designed to predict the audio tag of CNN structure. Thereafter, in the training stage of CRNN, the audio tagging prediction results of CNN and CRNN are limited to be consistent to make the CNN structure concentrating on audio tagging task and the RNN structure concentrating on modelling the inter-frame relationship of audio sample. As a result, the CNN structure and RNN structure of CRNN have different feature description functions for sound event detection. Experiments are carried out on the dataset of IEEE DCASE 2019 domestic environments sound event detection task (task 4). Experimental results demonstrate that the proposed ATCC-CRNN method improves significantly the performance of CRNN model in sound event detection. The event-based F1 scores on validation dataset and evaluation dataset are improved by more than 3.7%. These results indicate that the proposed ATCC-CRNN makes the CNN and RNN structures of CRNN functional clearly and improves the generalization ability of CRNN sound event detection model.
- Sound event detection /
- Audio tagging /
- Deep learning /
- Convolutional Recurrent Neural Network (CRNN)
¹⁾ https://project.inria.fr/desed/²⁾ http://dcase.community/challenge2019/task-sound-event-detection-in-domestic-environments-results

³⁾ http://dcase.community/challenge2020/task-sound-event-detection-and-separation-in-domestic-environments-results⁴⁾ https://github.com/sovrasov/flops-counter.pytorch

HTML全文

图 1 音频标记一致性约束声音事件检测方法网络架构

下载: 全尺寸图片幻灯片

图 2 音频标记一致性约束声音事件检测模型训练过程示意图

下载: 全尺寸图片幻灯片

图 3 训练过程各项损失的变化趋势图

下载: 全尺寸图片幻灯片

表 1 不同CRNN模型的声音事件检测F1得分及DR, IR

模型	验证集			评估集
模型	F1(%)	DR	IR	F1(%)	DR	IR
Baseline	23.3	0.78	0.62	28.6	0.75	0.46
ATCC-Baseline	28.6	0.76	0.35	34.6	0.72	0.27
CRNN	38.9	0.64	0.51	42.1	0.62	0.40
ATCC-CRNN	43.0	0.60	0.46	45.8	0.60	0.35

下载: 导出CSV

表 2 CRNN网络在验证集和评估集中对每种声音事件检测的F1得分和错误率结果

声音事件	CRNN				ATCC-CRNN
	验证集		评估集		验证集		评估集
	F1(%)	ER	F1(%)	ER	F1(%)	ER	F1(%)	ER
alarm	46.2	0.91	40.4	0.93	44.4	1.02	54.2	0.76
blender	40.7	1.09	33.3	1.38	48.0	1.22	45.1	1.27
cat	34.5	1.30	53.7	0.88	32.2	1.27	46.5	0.97
dishes	23.9	1.33	31.9	1.05	33.2	1.28	31.6	1.03
dog	26.7	1.24	39.3	1.04	29.0	1.15	37.0	1.01
electric	50.0	1.02	35.2	1.06	54.8	0.86	57.5	0.71
frying	23.8	1.69	42.1	1.10	29.2	1.33	37.3	1.12
running water	35.8	1.15	31.2	1.21	41.7	1.03	37.0	1.09
speech	53.8	0.82	57.3	0.75	56.6	0.79	59.7	0.73
vacuum cleaner	53.9	0.89	56.0	0.83	61.0	0.65	52.5	0.79
overall	38.9	1.14	42.0	1.02	43.0	1.06	45.8	0.95

下载: 导出CSV

表 3 ATCC-CRNN与几种代表性CRNN网络在DCASE竞赛任务4上的声音事件检测F1得分比较(%)

模型	验证集	评估集
Delphin ^[20]	43.6	45.8
Shi ^[21]	42.5	46.1
Baseline2020^[22]	36.5	39.5
Hou ^[23]	42.2	/
ATCC-CRNN	43.0	45.8

下载: 导出CSV

表 4 CRNN网络的各模型结构的参数量与计算复杂度(Flops)

模型结构	参数量(个)	Flops
${\theta _{{\rm{CNN}}} }$	1242304	5.12G
${\theta _{{\rm{CNN}}} }$	494592	0.06G
${\theta _{{\rm{SED}}} }$	5140	0.66M
${\theta _{{\rm{AT}}} }$	17802	17.80K
总计	1759838	5.18G

下载: 导出CSV

参考文献(23)

[1]	HUMAYUN A I, GHAFFARZADEGAN S, FENG Z, et al. Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection[C]. Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Honolulu, USA, 2018: 1408–1411.
[2]	BANDI A K, RIZKALLA M, and SALAMA P. A novel approach for the detection of gunshot events using sound source localization techniques[C]. Proceedings of the IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS), Boise, USA, 2012: 494–497.
[3]	DARGIE W. Adaptive audio-based context recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics - Part A:Systems and Humans, 2009, 39(4): 715–725. doi: 10.1109/TSMCA.2009.2015676
[4]	ZHANG Haomin, MCLOUGHLIN I, and SONG Yan. Robust sound event recognition using convolutional neural networks[C]. Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015: 559–563.
[5]	HIRATA K, KATO T, and OSHIMA R. Classification of environmental sounds using convolutional neural network with bispectral analysis[C]. Proceedings of 2019 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Taipei, China, 2019: 1–2.
[6]	ÇAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291–1303. doi: 10.1109/TASLP.2017.2690575
[7]	HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(11): 2059–2070. doi: 10.1109/TASLP.2017.2740002
[8]	KONG Qiuqiang, XU Yong, SOBIERAJ I, et al. Sound event detection and time–frequency segmentation from weakly labelled data[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(4): 777–787. doi: 10.1109/TASLP.2019.2895254
[9]	LU Jiakai. Mean teacher convolution system for DCASE 2018 task 4[R]. Technical Report of DCASE 2018 Challenge, 2018.
[10]	CHATTERJEE C C, MULIMANI M, and KOOLAGUDI S G. Polyphonic sound event detection using transposed convolutional recurrent neural network[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 661–665.
[11]	LI Yanxiong, LIU Mingle, DROSSOS K, et al. Sound event detection via dilated convolutional recurrent neural networks[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 286–290.
[12]	XU Yong, KONG Qiuqiang, WANG Wenwu, et al. Large-scale weakly supervised audio classification using gated convolutional neural network[C]. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018: 121–125.
[13]	YAN Jie, SONG Yan, GUO Wu, et al. A region based attention method for weakly supervised sound event detection and classification[C]. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019: 755–759.
[14]	HU Jie, SHEN Li, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011–2023. doi: 10.1109/TPAMI.2019.2913372
[15]	BA J L, KIROS J R, and HINTON G E. Layer normalization[OL]. arXiv: 1607.06450, 2016.
[16]	IOFFE S and SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[OL]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 448–456.
[17]	TARVAINEN A and VALPOLA H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 1195–1204.
[18]	TURPAULT N, SERIZEL R, SALAMON J, et al. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis[C]. 2019 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2019), New York, USA, 2019: 253–257.
[19]	GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]. Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017: 776–780.
[20]	DELPHIN-POULAT L and PLAPOUS C. Mean teacher with data augmentation for DCASE 2019 task 4[R]. Technical Report of DCASE 2019 Challenge, 2019.
[21]	SHI Ziqiang, LIU Liu, LIN Huibin, et al. HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods[C]. Proceedings of 2019 Workshop on Detection and Classification of Acoustic Scenes and Events, New York, USA, 2019: 224–228.
[22]	TURPAULT N and SERIZEL R. Training sound event detection on a heterogeneous dataset[C]. 2020 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020), Tokyo, Japan, 2020: 200–204.
[23]	HOU Z W and HAO J Y. Efficient CRNN network based on context gating and channel attention mechanism[R]. Technical Report of DCASE 2020 Challenge, 2020.