高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

音频标记一致性约束CRNN声音事件检测

杨利平 郝峻永 辜小花 侯振威

杨利平, 郝峻永, 辜小花, 侯振威. 音频标记一致性约束CRNN声音事件检测[J]. 电子与信息学报, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131
引用本文: 杨利平, 郝峻永, 辜小花, 侯振威. 音频标记一致性约束CRNN声音事件检测[J]. 电子与信息学报, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131
YANG Liping, HAO Junyong, GU Xiaohua, HOU Zhenwei. Sound Event Detection width Audio Tagging Consistency Constraint CRNN[J]. Journal of Electronics & Information Technology, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131
Citation: YANG Liping, HAO Junyong, GU Xiaohua, HOU Zhenwei. Sound Event Detection width Audio Tagging Consistency Constraint CRNN[J]. Journal of Electronics & Information Technology, 2022, 44(3): 1102-1110. doi: 10.11999/JEIT210131

音频标记一致性约束CRNN声音事件检测

doi: 10.11999/JEIT210131
基金项目: 国家自然科学基金(61903054),重庆市自然科学基金(cstc2021jcyj-msxmX0478)
详细信息
    作者简介:

    杨利平:男,1981年生,副教授,研究方向为信号处理、模式识别和机器学习

    郝峻永:男,1993年生,硕士生,研究方向为信号处理和机器学习

    辜小花:女,1982年生,教授,研究方向为信号处理和机器学习

    侯振威:男,1996年生,硕士生,研究方向为信号处理和机器学习

    通讯作者:

    杨利平 yanglp@cqu.edu.cn

  • 1) https://project.inria.fr/desed/2) http://dcase.community/challenge2019/task-sound-event-detection-in-domestic-environments-results
  • 3) http://dcase.community/challenge2020/task-sound-event-detection-and-separation-in-domestic-environments-results4) https://github.com/sovrasov/flops-counter.pytorch
  • 中图分类号: TN912.3; TP391.4

Sound Event Detection width Audio Tagging Consistency Constraint CRNN

Funds: The National Natural Science Foundation of China (61903054), The Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0478)
  • 摘要: 级联卷积神经网络(CNN)结构和循环神经网络(RNN)结构的卷积循环神经网络(CRNN)及其改进是当前主流的声音事件检测模型。然而,以端到端方式训练的CRNN声音事件检测模型无法从功能上约束CNN和RNN结构的作用。针对这一问题,该文提出了音频标记一致性约束CRNN声音事件检测方法(ATCC-CRNN)。该方法在CRNN模型的声音事件分类网络中添加了CRNN音频标记分支,同时增加了CNN音频标记网络对CRNN网络CNN结构输出的特征图进行音频标记。然后,通过在模型训练阶段限定CNN和CRNN的音频标记预测结果一致使CRNN模型的CNN结构更关注音频标记任务,RNN结构更关注建立音频样本的帧间关系。从而使CRNN模型的CNN和RNN结构具备了不同的特征描述功能。该文在IEEE DCASE 2019国际竞赛家庭环境声音事件检测任务(任务4)的数据集上进行了实验。实验结果显示:提出的ATCC-CRNN方法显著提高了CRNN模型的声音事件检测性能,在验证集和评估集上的F1得分提高了3.7%以上。这表明提出的ATCC-CRNN方法促进了CRNN模型的功能划分,有效改善了CRNN声音事件检测模型的泛化能力。
  • 图  1  音频标记一致性约束声音事件检测方法网络架构

    图  2  音频标记一致性约束声音事件检测模型训练过程示意图

    图  3  训练过程各项损失的变化趋势图

    表  1  不同CRNN模型的声音事件检测F1得分及DR, IR

    模型验证集评估集
    F1(%)DRIRF1(%)DRIR
    Baseline23.30.780.6228.60.750.46
    ATCC-Baseline28.60.760.3534.60.720.27
    CRNN38.90.640.5142.10.620.40
    ATCC-CRNN43.00.600.4645.80.600.35
    下载: 导出CSV

    表  2  CRNN网络在验证集和评估集中对每种声音事件检测的F1得分和错误率结果

    声音事件CRNNATCC-CRNN
    验证集评估集验证集评估集
    F1(%)ERF1(%)ERF1(%)ERF1(%)ER
    alarm46.20.9140.40.9344.41.0254.20.76
    blender40.71.0933.31.3848.01.2245.11.27
    cat34.51.3053.70.8832.21.2746.50.97
    dishes23.91.3331.91.0533.21.2831.61.03
    dog26.71.2439.31.0429.01.1537.01.01
    electric50.01.0235.21.0654.80.8657.50.71
    frying23.81.6942.11.1029.21.3337.31.12
    running water35.81.1531.21.2141.71.0337.01.09
    speech53.80.8257.30.7556.60.7959.70.73
    vacuum cleaner53.90.8956.00.8361.00.6552.50.79
    overall38.91.1442.01.0243.01.0645.80.95
    下载: 导出CSV

    表  3  ATCC-CRNN与几种代表性CRNN网络在DCASE竞赛任务4上的声音事件检测F1得分比较(%)

    模型验证集评估集
    Delphin [20]43.645.8
    Shi [21]42.546.1
    Baseline2020[22]36.539.5
    Hou [23]42.2/
    ATCC-CRNN43.045.8
    下载: 导出CSV

    表  4  CRNN网络的各模型结构的参数量与计算复杂度(Flops)

    模型结构参数量(个)Flops
    ${\theta _{{\rm{CNN}}} }$12423045.12G
    ${\theta _{{\rm{CNN}}} }$4945920.06G
    ${\theta _{{\rm{SED}}} }$51400.66M
    ${\theta _{{\rm{AT}}} }$1780217.80K
    总计17598385.18G
    下载: 导出CSV
  • [1] HUMAYUN A I, GHAFFARZADEGAN S, FENG Z, et al. Learning front-end filter-bank parameters using convolutional neural networks for abnormal heart sound detection[C]. Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Honolulu, USA, 2018: 1408–1411.
    [2] BANDI A K, RIZKALLA M, and SALAMA P. A novel approach for the detection of gunshot events using sound source localization techniques[C]. Proceedings of the IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS), Boise, USA, 2012: 494–497.
    [3] DARGIE W. Adaptive audio-based context recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics - Part A:Systems and Humans, 2009, 39(4): 715–725. doi: 10.1109/TSMCA.2009.2015676
    [4] ZHANG Haomin, MCLOUGHLIN I, and SONG Yan. Robust sound event recognition using convolutional neural networks[C]. Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015: 559–563.
    [5] HIRATA K, KATO T, and OSHIMA R. Classification of environmental sounds using convolutional neural network with bispectral analysis[C]. Proceedings of 2019 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Taipei, China, 2019: 1–2.
    [6] ÇAKIR E, PARASCANDOLO G, HEITTOLA T, et al. Convolutional recurrent neural networks for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(6): 1291–1303. doi: 10.1109/TASLP.2017.2690575
    [7] HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(11): 2059–2070. doi: 10.1109/TASLP.2017.2740002
    [8] KONG Qiuqiang, XU Yong, SOBIERAJ I, et al. Sound event detection and time–frequency segmentation from weakly labelled data[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(4): 777–787. doi: 10.1109/TASLP.2019.2895254
    [9] LU Jiakai. Mean teacher convolution system for DCASE 2018 task 4[R]. Technical Report of DCASE 2018 Challenge, 2018.
    [10] CHATTERJEE C C, MULIMANI M, and KOOLAGUDI S G. Polyphonic sound event detection using transposed convolutional recurrent neural network[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 661–665.
    [11] LI Yanxiong, LIU Mingle, DROSSOS K, et al. Sound event detection via dilated convolutional recurrent neural networks[C]. Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 286–290.
    [12] XU Yong, KONG Qiuqiang, WANG Wenwu, et al. Large-scale weakly supervised audio classification using gated convolutional neural network[C]. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018: 121–125.
    [13] YAN Jie, SONG Yan, GUO Wu, et al. A region based attention method for weakly supervised sound event detection and classification[C]. Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019: 755–759.
    [14] HU Jie, SHEN Li, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011–2023. doi: 10.1109/TPAMI.2019.2913372
    [15] BA J L, KIROS J R, and HINTON G E. Layer normalization[OL]. arXiv: 1607.06450, 2016.
    [16] IOFFE S and SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[OL]. Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015: 448–456.
    [17] TARVAINEN A and VALPOLA H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 1195–1204.
    [18] TURPAULT N, SERIZEL R, SALAMON J, et al. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis[C]. 2019 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2019), New York, USA, 2019: 253–257.
    [19] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]. Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, 2017: 776–780.
    [20] DELPHIN-POULAT L and PLAPOUS C. Mean teacher with data augmentation for DCASE 2019 task 4[R]. Technical Report of DCASE 2019 Challenge, 2019.
    [21] SHI Ziqiang, LIU Liu, LIN Huibin, et al. HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods[C]. Proceedings of 2019 Workshop on Detection and Classification of Acoustic Scenes and Events, New York, USA, 2019: 224–228.
    [22] TURPAULT N and SERIZEL R. Training sound event detection on a heterogeneous dataset[C]. 2020 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020), Tokyo, Japan, 2020: 200–204.
    [23] HOU Z W and HAO J Y. Efficient CRNN network based on context gating and channel attention mechanism[R]. Technical Report of DCASE 2020 Challenge, 2020.
  • 加载中
图(3) / 表(4)
计量
  • 文章访问数:  1521
  • HTML全文浏览量:  810
  • PDF下载量:  151
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-02-05
  • 修回日期:  2021-05-30
  • 录用日期:  2021-11-05
  • 网络出版日期:  2021-11-18
  • 刊出日期:  2022-03-28

目录

    /

    返回文章
    返回