高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

兰朝凤 蒋朋威 陈欢 韩闯 郭小霞

兰朝凤, 蒋朋威, 陈欢, 韩闯, 郭小霞. 结合光流算法与注意力机制的U-Net网络跨模态视听语音分离[J]. 电子与信息学报, 2023, 45(10): 3538-3546. doi: 10.11999/JEIT221500
引用本文: 兰朝凤, 蒋朋威, 陈欢, 韩闯, 郭小霞. 结合光流算法与注意力机制的U-Net网络跨模态视听语音分离[J]. 电子与信息学报, 2023, 45(10): 3538-3546. doi: 10.11999/JEIT221500
LAN Chaofeng, JIANG Pengwei, CHEN Huan, HAN Chuang, GUO Xiaoxia. Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3538-3546. doi: 10.11999/JEIT221500
Citation: LAN Chaofeng, JIANG Pengwei, CHEN Huan, HAN Chuang, GUO Xiaoxia. Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism[J]. Journal of Electronics & Information Technology, 2023, 45(10): 3538-3546. doi: 10.11999/JEIT221500

结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

doi: 10.11999/JEIT221500
基金项目: 国家自然科学基金(11804068),黑龙江省自然科学基金(LH2020F033)
详细信息
    作者简介:

    兰朝凤:女,副教授,博士生导师,研究方向为语音信号处理与分析、深度学习、水下信号处理及噪声控制技术等

    蒋朋威:男,硕士生,研究方向为语音分离

    陈欢:男,博士,研究方向为声信号分析与处理

    韩闯:男,博士,研究方向为声信号诊断、检测与定位

    郭小霞:女,博士,研究方向为声信号分析与处理

    通讯作者:

    韩闯 hanchuang@hrbust.edu.cn

  • 中图分类号: TN912.3

Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism

Funds: The National Natural Science Foundation of China (11804068), The Natural Science Foundation of Heilongjiang Province (LH2020F033)
  • 摘要: 目前多数的视听分离模型,大多是基于视频特征和音频特征简单拼接,没有充分考虑各个模态的相互关系,导致视觉信息未被充分利用,该文针对此问题提出了新的模型。该文充分考虑视觉特征、音频特征之间的相互联系,采用多头注意力机制,结合稠密光流(Farneback)算法和U-Net网络,提出跨模态融合的光流-视听语音分离(Flow-AVSS)模型。该模型通过Farneback算法和轻量级网络ShuffleNet v2分别提取运动特征和唇部特征,然后将运动特征与唇部特征进行仿射变换,经过时间卷积模块(TCN)得到视觉特征。为充分利用到视觉信息,在进行特征融合时采用多头注意力机制,将视觉特征与音频特征进行跨模态融合,得到融合视听特征,最后融合视听特征经过U-Net分离网络得到分离语音。利用客观语音质量评估(PESQ)、短时客观可懂度(STOI)及源失真比(SDR)评价指标,在AVspeech数据集进行实验测试。研究表明,该文所提方法与纯语音分离网络和仅采用特征拼接的视听分离网络相比,性能上分别提高了2.23 dB和1.68 dB。由此表明,采用跨模态注意力进行特征融合,能更加充分利用各个模态相关性,增加的唇部运动特征,能有效提高视频特征的鲁棒性,提高分离效果。
  • 图  1  2维光流矢量表示观测场景中3维速度在成像表面投影

    图  2  Q, K, V计算过程

    图  3  定义多组A,生成多组Q, K, V

    图  4  跨模态融合的光流-视听分离框架

    图  5  跨模态融合模块整体结构

    图  6  仿射变换和TCN模块

    图  7  跨模态注意力融合策略

    表  1  语音分离的性能评估(dB)

    模型SDRPESQSTOI
    PIT(仅有音频)[6]7.732.560.83
    AV基线[15]8.652.620.83
    Lip+HCMA8.712.630.83
    Lip+Flow+特征拼接8.732.650.83
    Lip+Flow+SCMA9.252.670.84
    Lip+Flow+HCMA9.962.720.85
    下载: 导出CSV

    表  2  同一数据集、服务器下不同模型分离结果(dB)

    模型SDRPESQSTOI
    文献[29]7.732.560.83
    文献[14]8.282.620.83
    文献[15]8.652.640.84
    文献[16]9.142.650.84
    Lip+Flow+SCMA9.252.670.84
    Lip+Flow+HCMA9.962.720.85
    下载: 导出CSV
  • [1] WANG Deliang and BROWN G J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications[M]. Hoboken: Wiley, 2006: 1–14.
    [2] SCHMIDT M N and OLSSON R K. Single-channel speech separation using sparse non-negative matrix factorization[C]. The INTERSPEECH 2006, Pittsburgh, USA, 2006.
    [3] ZHOU Weili, ZHU Zhen, and LIANG Peiying. Speech denoising using Bayesian NMF with online base update[J]. Multimedia Tools and Applications, 2019, 78(11): 15647–15664. doi: 10.1007/s11042-018-6990-5
    [4] SUN Lei, DU Jun, DAI Lirong, et al. Multiple-target deep learning for LSTM-RNN based speech enhancement[C]. 2017 Hands-free Speech Communications and Microphone Arrays, San Francisco, USA, 2017: 136–140.
    [5] HERSHEY J R, CHEN Zhuo, ROUX J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016: 31–35.
    [6] YU Dong, KOLBÆK M, TAN Zhenghua, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 241–245.
    [7] GOLUMBIC E Z, COGAN G B, SCHROEDER C E, et al. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”[J]. The Journal of Neuroscience, 2013, 33(4): 1417–1426. doi: 10.1523/jneurosci.3675-12.2013
    [8] SUSSMAN E S. Integration and segregation in auditory scene analysis[J]. The Journal of the Acoustical Society of America, 2005, 117(3): 1285–1298. doi: 10.1121/1.1854312
    [9] TAO Ruijie, PAN Zexu, DAS R K, et al. Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection[C]. The ACM Multimedia Conference, Chengdu, China, 2021: 3927–3935.
    [10] LAKHAN A, MOHAMMED M A, KADRY S, et al. Federated Learning-Aware Multi-Objective Modeling and blockchain-enable system for IIoT applications[J]. Computers and Electrical Engineering, 2022, 100: 107839. doi: 10.1016/j.compeleceng.2022.107839
    [11] MORGADO P, LI Yi, and VASCONCELOS N. Learning representations from audio-visual spatial alignment[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 397.
    [12] GABBAY A, EPHRAT A, HALPERIN T, et al. Seeing through noise: Visually driven speaker separation and enhancement[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 3051–3055.
    [13] AFOURAS T, CHUNG J S, and ZISSERMAN A. The conversation: Deep audio-visual speech enhancement[C]. The 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 3244–3248.
    [14] EPHRAT A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation[J]. ACM Transactions on Graphics, 2018, 37(4): 112. doi: 10.1145/3197517.3201357
    [15] GAO Ruohan and GRAUMAN K. VisualVoice: Audio-visual speech separation with cross-modal consistency[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 15490–15500.
    [16] XIONG Junwen, ZHANG Peng, XIE Lei, et al. Audio-visual speech separation based on joint feature representation with cross-modal attention[J]. arXiv preprint arVix: 2203.02655, 2022.
    [17] 杨子龙, 朱付平, 田金文, 等. 基于显著性与稠密光流的红外船只烟幕检测方法研究[J]. 红外与激光工程, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496

    YANG Zilong, ZHU Fuping, TIAN Jinwen, et al. Ship smoke detection method based on saliency and dense optical flow[J]. Infrared and Laser Engineering, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496
    [18] 田佳莉. 基于多特征的考场异常行为识别研究[D]. [硕士论文], 沈阳工业大学, 2021.

    TIAN Jiali. Research on abnormal behavior recognition of examination room based on multi-features[D]. [Master dissertation], Shenyang University of Technology, 2021.
    [19] 欧阳玉梅. 基于稠密光流算法的运动目标检测的Python实现[J]. 现代电子技术, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017

    OUYANG Yumei. DOF algorithm based moving object detection programmed by Python[J]. Modern Electronics Technique, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017
    [20] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 116–131.
    [21] CARREIRA J and ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4724–4733.
    [22] RONNEBERGER O, FISCHER P, and BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]. The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241.
    [23] PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]. The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 3942–3951.
    [24] BAI Shaojie, KOLTER J Z, and KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint arXiv: 1803.01271, 2018.
    [25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [26] THIEDE T, TREURNIET W, BITTO R, et al. PEAQ - The ITU standard for objective measurement of perceived audio quality[J]. Journal of the Audio Engineering Society, 2000, 48(1): 3–29.
    [27] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125–2136. doi: 10.1109/TASL.2011.2114881
    [28] LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done?[C]. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 626–630.
    [29] LUO Y and MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167
  • 加载中
图(7) / 表(2)
计量
  • 文章访问数:  346
  • HTML全文浏览量:  168
  • PDF下载量:  96
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-12-02
  • 修回日期:  2023-05-09
  • 网络出版日期:  2023-05-17
  • 刊出日期:  2023-10-31

目录

    /

    返回文章
    返回