Sound Event Detection width Audio Tagging Consistency Constraint CRNN

YANG Liping; HAO Junyong; GU Xiaohua; HOU Zhenwei

doi:10.11999/JEIT210131

Volume 44 Issue 3

Mar. 2022

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2022 > 44(3): 1102-1110

YANG Liping, HAO Junyong, GU Xiaohua, HOU Zhenwei. Sound Event Detection width Audio Tagging Consistency Constraint CRNN[J]. Journal of Electronics & Information Technology, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131

Citation:

YANG Liping, HAO Junyong, GU Xiaohua, HOU Zhenwei. Sound Event Detection width Audio Tagging Consistency Constraint CRNN[J]. Journal of Electronics & Information Technology, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131

Citation:

PDF( 1592 KB)

Sound Event Detection width Audio Tagging Consistency Constraint CRNN

doi: 10.11999/JEIT210131

1.
Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education, Chongqing 400044, China
2.
School of Electrical Engineering, Chongqing University of Science & Technology, Chongqing 401331, China

Funds: The National Natural Science Foundation of China (61903054), The Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0478)

Received Date: 2021-02-05
Accepted Date: 2021-11-05
Rev Recd Date: 2021-05-30

Available Online: 2021-11-18

Publish Date: 2022-03-28

Abstract

Abstract

Convolutional Recurrent Neural Network (CRNN), which cascades Convolutional Neural Network (CNN) structure and Recurrent Neural Network (RNN) structure, and its reformations are the mainstreams for sound event detection. However, CRNN models trained in end-to-end way can not make CNN and RNN structures have meaningful functions, which may affect the performances of sound event detection. To alleviate this problem, this paper proposes an Audio Tagging Consistency Constraint CRNN (ATCC-CRNN) method for sound event detection. In ATCC-CRNN, a CRNN audio tagging branch is embedded in the sound event classification network, meanwhile a CNN audio tagging network is designed to predict the audio tag of CNN structure. Thereafter, in the training stage of CRNN, the audio tagging prediction results of CNN and CRNN are limited to be consistent to make the CNN structure concentrating on audio tagging task and the RNN structure concentrating on modelling the inter-frame relationship of audio sample. As a result, the CNN structure and RNN structure of CRNN have different feature description functions for sound event detection. Experiments are carried out on the dataset of IEEE DCASE 2019 domestic environments sound event detection task (task 4). Experimental results demonstrate that the proposed ATCC-CRNN method improves significantly the performance of CRNN model in sound event detection. The event-based F1 scores on validation dataset and evaluation dataset are improved by more than 3.7%. These results indicate that the proposed ATCC-CRNN makes the CNN and RNN structures of CRNN functional clearly and improves the generalization ability of CRNN sound event detection model.
- Sound event detection,
- Audio tagging,
- Deep learning,
- Convolutional Recurrent Neural Network (CRNN)

FullText(HTML)

References(23)

References

[1]	HUMAYUN A I, GHAFFARZADEGAN S, FENG Z, et al. Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection[C]. Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Honolulu, USA, 2018: 1408–1411.
[2]	BANDI A K, RIZKALLA M, and SALAMA P. A novel approach for the detection of gunshot events using sound source localization techniques[C]. Proceedings of the IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS), Boise, USA, 2012: 494–497.
[3]	DARGIE W. Adaptive audio-based context recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics - Part A:Systems and Humans, 2009, 39(4): 715–725. doi: 10.1109/TSMCA.2009.2015676
[4]	ZHANG Haomin, MCLOUGHLIN I, and SONG Yan. Robust sound event recognition using convolutional neural networks[C]. Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015: 559–563.
[5]	HIRATA K, KATO T, and OSHIMA R. Classification of environmental sounds using convolutional neural network with bispectral analysis[C]. Proceedings of 2019 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Taipei, China, 2019: 1–2.
[6]	ÇAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291–1303. doi: 10.1109/TASLP.2017.2690575
[7]	HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(11): 2059–2070. doi: 10.1109/TASLP.2017.2740002
[8]	KONG Qiuqiang, XU Yong, SOBIERAJ I, et al. Sound event detection and time–frequency segmentation from weakly labelled data[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(4): 777–787. doi: 10.1109/TASLP.2019.2895254
[9]	LU Jiakai. Mean teacher convolution system for DCASE 2018 task 4[R]. Technical Report of DCASE 2018 Challenge, 2018.
[10]	CHATTERJEE C C, MULIMANI M, and KOOLAGUDI S G. Polyphonic sound event detection using transposed convolutional recurrent neural network[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 661–665.
[11]	LI Yanxiong, LIU Mingle, DROSSOS K, et al. Sound event detection via dilated convolutional recurrent neural networks[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 286–290.
[12]	XU Yong, KONG Qiuqiang, WANG Wenwu, et al. Large-scale weakly supervised audio classification using gated convolutional neural network[C]. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018: 121–125.
[13]	YAN Jie, SONG Yan, GUO Wu, et al. A region based attention method for weakly supervised sound event detection and classification[C]. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019: 755–759.
[14]	HU Jie, SHEN Li, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011–2023. doi: 10.1109/TPAMI.2019.2913372
[15]	BA J L, KIROS J R, and HINTON G E. Layer normalization[OL]. arXiv: 1607.06450, 2016.
[16]	IOFFE S and SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[OL]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 448–456.
[17]	TARVAINEN A and VALPOLA H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 1195–1204.
[18]	TURPAULT N, SERIZEL R, SALAMON J, et al. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis[C]. 2019 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2019), New York, USA, 2019: 253–257.
[19]	GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]. Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017: 776–780.
[20]	DELPHIN-POULAT L and PLAPOUS C. Mean teacher with data augmentation for DCASE 2019 task 4[R]. Technical Report of DCASE 2019 Challenge, 2019.
[21]	SHI Ziqiang, LIU Liu, LIN Huibin, et al. HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods[C]. Proceedings of 2019 Workshop on Detection and Classification of Acoustic Scenes and Events, New York, USA, 2019: 224–228.
[22]	TURPAULT N and SERIZEL R. Training sound event detection on a heterogeneous dataset[C]. 2020 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020), Tokyo, Japan, 2020: 200–204.
[23]	HOU Z W and HAO J Y. Efficient CRNN network based on context gating and channel attention mechanism[R]. Technical Report of DCASE 2020 Challenge, 2020.