结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

兰朝凤; 蒋朋威; 陈欢; 韩闯; 郭小霞

doi:10.11999/JEIT221500

结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

doi: 10.11999/JEIT221500

1.
哈尔滨理工大学测控技术与通信工程学院哈尔滨 150080
2.
中国舰船研究设计中心武汉 430064

基金项目: 国家自然科学基金(11804068)，黑龙江省自然科学基金(LH2020F033)

详细信息

作者简介:
兰朝凤：女，副教授，博士生导师，研究方向为语音信号处理与分析、深度学习、水下信号处理及噪声控制技术等

蒋朋威：男，硕士生，研究方向为语音分离

陈欢：男，博士，研究方向为声信号分析与处理

韩闯：男，博士，研究方向为声信号诊断、检测与定位

郭小霞：女，博士，研究方向为声信号分析与处理

通讯作者:
韩闯　hanchuang@hrbust.edu.cn

中图分类号: TN912.3
计量
- 文章访问数: 541
- HTML全文浏览量: 337
- PDF下载量: 100
- 被引次数: 0
出版历程
- 收稿日期: 2022-12-02
- 修回日期: 2023-05-09
- 网络出版日期: 2023-05-17
- 刊出日期: 2023-10-31

Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism

1.
School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China
2.
China Ship Design and Research Center, Wuhan 430064, China

Funds: The National Natural Science Foundation of China (11804068), The Natural Science Foundation of Heilongjiang Province (LH2020F033)

摘要

摘要: 目前多数的视听分离模型，大多是基于视频特征和音频特征简单拼接，没有充分考虑各个模态的相互关系，导致视觉信息未被充分利用，该文针对此问题提出了新的模型。该文充分考虑视觉特征、音频特征之间的相互联系，采用多头注意力机制，结合稠密光流(Farneback)算法和U-Net网络，提出跨模态融合的光流-视听语音分离(Flow-AVSS)模型。该模型通过Farneback算法和轻量级网络ShuffleNet v2分别提取运动特征和唇部特征，然后将运动特征与唇部特征进行仿射变换，经过时间卷积模块(TCN)得到视觉特征。为充分利用到视觉信息，在进行特征融合时采用多头注意力机制，将视觉特征与音频特征进行跨模态融合，得到融合视听特征，最后融合视听特征经过U-Net分离网络得到分离语音。利用客观语音质量评估(PESQ)、短时客观可懂度(STOI)及源失真比(SDR)评价指标，在AVspeech数据集进行实验测试。研究表明，该文所提方法与纯语音分离网络和仅采用特征拼接的视听分离网络相比，性能上分别提高了2.23 dB和1.68 dB。由此表明，采用跨模态注意力进行特征融合，能更加充分利用各个模态相关性，增加的唇部运动特征，能有效提高视频特征的鲁棒性，提高分离效果。
- 视听语音分离 /
- 视听融合 /
- 跨模态注意力 /
- 光流算法
Abstract: Most of the current audiovisual separation models are mostly based on simple splicing of video features and audio features, without fully considering the interrelationship of each modality, resulting in the underutilization of visual information, a new model is proposed to address this issue. Hence, in this paper, the interrelationship of each modality is taken into consideration. In addition, a multi-headed attention mechanism is used to combine the Farneback algorithm and the U-Net network to propose a cross-modal fusion optical Flow-Audio Visual Speech Separation (Flow-AVSS) model. The motion features and lip features are respectively extracted by the Farneback algorithm and the lightweight network ShuffleNet v2. Furthermore, the motion features are affine transformed with the lip features, and the visual features are obtained by the Temporal CoNvolution module (TCN). In order to utilize sufficiently the visual information, the multi-headed attention mechanism is used in the feature fusion to fuse the visual features with the audio features across modalities. Finally, the fused audio-visual features are passed through the U-Net separation network to obtain the separated speech. Using Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Source-to-Distortion Ratio (SDR) evaluation metrics, experimental tests are conducted on the AVspeech dataset. It is shown that the performance of the proposed method is improved by 2.23 dB and 1.68 dB compared with the pure speech separation network or the audio-visual separation network based on feature splicing. Thus, it is indicated that the feature fusion based on the cross-modal attention can make fuller use of the individual modal correlations. Besides, the increased lip motion features can effectively improve the robustness of video features and improve the separation effect.
- Audio-Visual Speech Separation (AVSS) /
- Audio-visual integration /
- Cross-modal attention /
- Optical flow algorithm

HTML全文

图 1 2维光流矢量表示观测场景中3维速度在成像表面投影

下载: 全尺寸图片幻灯片

图 2 Q, K, V计算过程

下载: 全尺寸图片幻灯片

图 3 定义多组A，生成多组Q, K, V

下载: 全尺寸图片幻灯片

图 4 跨模态融合的光流-视听分离框架

下载: 全尺寸图片幻灯片

图 5 跨模态融合模块整体结构

下载: 全尺寸图片幻灯片

图 6 仿射变换和TCN模块

下载: 全尺寸图片幻灯片

图 7 跨模态注意力融合策略

下载: 全尺寸图片幻灯片

表 1 语音分离的性能评估(dB)

模型 SDR PESQ STOI

PIT(仅有音频)^[6] 7.73 2.56 0.83
AV基线^[15] 8.65 2.62 0.83
Lip+HCMA 8.71 2.63 0.83
Lip+Flow+特征拼接 8.73 2.65 0.83
Lip+Flow+SCMA 9.25 2.67 0.84
Lip+Flow+HCMA 9.96 2.72 0.85

下载: 导出CSV

表 2 同一数据集、服务器下不同模型分离结果(dB)

模型 SDR PESQ STOI

文献[29] 7.73 2.56 0.83
文献[14] 8.28 2.62 0.83
文献[15] 8.65 2.64 0.84
文献[16] 9.14 2.65 0.84
Lip+Flow+SCMA 9.25 2.67 0.84
Lip+Flow+HCMA 9.96 2.72 0.85

下载: 导出CSV

参考文献(29)

[1]	WANG Deliang and BROWN G J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications[M]. Hoboken: Wiley, 2006: 1–14.
[2]	SCHMIDT M N and OLSSON R K. Single-channel speech separation using sparse non-negative matrix factorization[C]. The INTERSPEECH 2006, Pittsburgh, USA, 2006.
[3]	ZHOU Weili, ZHU Zhen, and LIANG Peiying. Speech denoising using Bayesian NMF with online base update[J]. Multimedia Tools and Applications, 2019, 78(11): 15647–15664. doi: 10.1007/s11042-018-6990-5
[4]	SUN Lei, DU Jun, DAI Lirong, et al. Multiple-target deep learning for LSTM-RNN based speech enhancement[C]. 2017 Hands-free Speech Communications and Microphone Arrays, San Francisco, USA, 2017: 136–140.
[5]	HERSHEY J R, CHEN Zhuo, ROUX J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016: 31–35.
[6]	YU Dong, KOLBÆK M, TAN Zhenghua, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 241–245.
[7]	GOLUMBIC E Z, COGAN G B, SCHROEDER C E, et al. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”[J]. The Journal of Neuroscience, 2013, 33(4): 1417–1426. doi: 10.1523/jneurosci.3675-12.2013
[8]	SUSSMAN E S. Integration and segregation in auditory scene analysis[J]. The Journal of the Acoustical Society of America, 2005, 117(3): 1285–1298. doi: 10.1121/1.1854312
[9]	TAO Ruijie, PAN Zexu, DAS R K, et al. Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection[C]. The ACM Multimedia Conference, Chengdu, China, 2021: 3927–3935.
[10]	LAKHAN A, MOHAMMED M A, KADRY S, et al. Federated Learning-Aware Multi-Objective Modeling and blockchain-enable system for IIoT applications[J]. Computers and Electrical Engineering, 2022, 100: 107839. doi: 10.1016/j.compeleceng.2022.107839
[11]	MORGADO P, LI Yi, and VASCONCELOS N. Learning representations from audio-visual spatial alignment[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 397.
[12]	GABBAY A, EPHRAT A, HALPERIN T, et al. Seeing through noise: Visually driven speaker separation and enhancement[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018: 3051–3055.
[13]	AFOURAS T, CHUNG J S, and ZISSERMAN A. The conversation: Deep audio-visual speech enhancement[C]. The 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2018: 3244–3248.
[14]	EPHRAT A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation[J]. ACM Transactions on Graphics, 2018, 37(4): 112. doi: 10.1145/3197517.3201357
[15]	GAO Ruohan and GRAUMAN K. VisualVoice: Audio-visual speech separation with cross-modal consistency[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 15490–15500.
[16]	XIONG Junwen, ZHANG Peng, XIE Lei, et al. Audio-visual speech separation based on joint feature representation with cross-modal attention[J]. arXiv preprint arVix: 2203.02655, 2022.
[17]	杨子龙, 朱付平, 田金文, 等. 基于显著性与稠密光流的红外船只烟幕检测方法研究[J]. 红外与激光工程, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496 YANG Zilong, ZHU Fuping, TIAN Jinwen, et al. Ship smoke detection method based on saliency and dense optical flow[J]. Infrared and Laser Engineering, 2021, 50(7): 20200496. doi: 10.3788/IRLA20200496
[18]	田佳莉. 基于多特征的考场异常行为识别研究[D]. [硕士论文], 沈阳工业大学, 2021. TIAN Jiali. Research on abnormal behavior recognition of examination room based on multi-features[D]. [Master dissertation], Shenyang University of Technology, 2021.
[19]	欧阳玉梅. 基于稠密光流算法的运动目标检测的Python实现[J]. 现代电子技术, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017 OUYANG Yumei. DOF algorithm based moving object detection programmed by Python[J]. Modern Electronics Technique, 2021, 44(1): 78–82. doi: 10.16652/j.issn.1004-373x.2021.01.017
[20]	MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]. The 15th European Conference on Computer Vision, Munich, Germany, 2018: 116–131.
[21]	CARREIRA J and ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4724–4733.
[22]	RONNEBERGER O, FISCHER P, and BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]. The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241.
[23]	PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]. The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 3942–3951.
[24]	BAI Shaojie, KOLTER J Z, and KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint arXiv: 1803.01271, 2018.
[25]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[26]	THIEDE T, TREURNIET W, BITTO R, et al. PEAQ - The ITU standard for objective measurement of perceived audio quality[J]. Journal of the Audio Engineering Society, 2000, 48(1): 3–29.
[27]	TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125–2136. doi: 10.1109/TASL.2011.2114881
[28]	LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done?[C]. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 626–630.
[29]	LUO Y and MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167

施引文献

资源附件(0)

访问统计

图(7) / 表(2)

计量

文章访问数: 541
HTML全文浏览量: 337
PDF下载量: 100
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

doi: 10.11999/JEIT221500

通讯作者:
韩闯　hanchuang@hrbust.edu.cn

计量

Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism

计量

目录

模型	SDR	PESQ	STOI
PIT(仅有音频)^[6]	7.73	2.56	0.83
AV基线^[15]	8.65	2.62	0.83
Lip+HCMA	8.71	2.63	0.83
Lip+Flow+特征拼接	8.73	2.65	0.83
Lip+Flow+SCMA	9.25	2.67	0.84
Lip+Flow+HCMA	9.96	2.72	0.85

留言板

结合光流算法与注意力机制的U-Net网络跨模态视听语音分离

doi: 10.11999/JEIT221500

通讯作者: 韩闯 hanchuang@hrbust.edu.cn

计量

出版历程

Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism

计量

出版历程

目录

通讯作者:
韩闯　hanchuang@hrbust.edu.cn