Monaural Speech Separation Method Based on Deep Learning Feature Fusion and Joint Constraints
-
摘要: 为了提高单通道语音分离性能,该文提出基于深度学习特征融合和联合约束的单通道语音分离方法。传统基于深度学习的分离算法的损失函数只考虑了预测值和真实值的误差,这使得分离后的语音与纯净语音之间误差较大。该文提出一种新的联合约束损失函数,该损失函数不仅约束了理想比值掩蔽的预测值和真实值的误差,还惩罚了相应幅度谱的误差。另外,为了充分利用多种特征的互补性,提出一种含特征融合层的卷积神经网络(CNN)结构。利用该CNN提取多通道输入特征的深度特征,并在融合层中将深度特征与声学特征融合用来训练分离模型。由于融合构成的特征含有丰富的语音信息,具有强的语音信号表征能力,使得分离模型预测的掩蔽更加准确。实验结果表明,从信号失真比(SDR) 、主观语音质量评估( PESQ)和短时客观可懂度(STOI)3个方面评价,相比其他优秀的基于深度学习的语音分离方法,该方法能够更有效地分离目标语音。Abstract: To improve the performance of monaural speech separation, a monaural speech separation method based on deep learning feature fusion and joint constraints is proposed. The loss function of the traditional separation algorithm based on deep learning only considers the error between the predicted value and the true one, which makes the error between the separated speech and the pure speech larger. To combat it, a new joint constrained loss function is proposed, which not only constrains the error between the predicted value and the true one of ideal ratio mask, but also penalizes the error of the corresponding amplitude spectrum. In addition, to make full use of the complementarity of multiple features, a Convolutional Neural Network (CNN) structure with feature fusion layer is proposed, which extracts the depth feature of the multi-channel input feature, and then fuses the depth feature and the acoustic feature in the fusion layer to train the separation model. The fused separation feature contains abundant acoustic information and has a strong acoustic representative ability, which makes the mask predicted by the separation model more accurate. The experimental results show that from Signal Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI), compared with other excellent speech separation methods based on deep learning, the proposed method can separate the mixed speech more effectively.
-
图 8 本文方法和文献[20]时域波形对比
表 1 基于CNN特征融合的单通道语音分离联合约束算法流程
(1) 训练阶段: 输入:混合语音信号的strain、目标语音信号s1train和其他源语音信号。 输出:训练好的分离模型。 步骤1 首先对目标源语音和混合语音做分帧处理后进行STFT, 归一化后的混合信号幅度谱Ytrain,目标信号的幅度谱Y1train和
其他源语音信号的幅度谱,还有功率谱‖以及对数功率谱\lg {\left\| { {{\boldsymbol{Y}}_{{\rm{train}}} } } \right\|^2}。步骤2 通过式(10)计算得到目标信号的理想比值掩蔽。
步骤3 \left[ { {{\boldsymbol{Y}}_{{\rm{train}}} },{ {\left\| { {{\boldsymbol{Y}}_{{\rm{train}}} } } \right\|}^2},\lg { {\left\| { {{\boldsymbol{Y}}_{{\rm{train}}} } } \right\|}^2} } \right]作为CNN输入特征和{{\boldsymbol{M}}_{1{\rm{train}}} }作为分离模型的输出目标。步骤4 在前向传播阶段,随机初始化每层神经元的权重和偏置。将CNN的输出特征和{{\boldsymbol{Y}}_{{\rm{train}}} }进行拼接融合作为全连层的输入,估计
比值掩蔽{\hat {\boldsymbol{M}}_{1{\rm{train}}} }。步骤5 在反向传播阶段,通过整合优化器寻优使得联合损失函数式(3)最小,迭代调整每层神经元的权重和偏置。 步骤6 得到训练完好的用于分离的深度神经网络。 (2) 分离阶段: 输入:测试信号{{\boldsymbol{s}}_{ {\rm{test} } } }和训练好的分离模型。 输出:估计的目标值{\hat {\boldsymbol{M}}_{1{\rm{test}}} }。 步骤1 首先对测试的混合语音做预处理之后进行STFT, 归一化后得到混合信号幅度谱 {{\boldsymbol{Y}}_{{\rm{test}}} },还有功率谱{\left\| { {{\boldsymbol{Y}}_{{\rm{test}}} } } \right\|^2}以及对数功率谱
\lg {\left\| { {{\boldsymbol{Y}}_{{\rm{test}}} } } \right\|^2}和混合语音相位谱{{\boldsymbol{P}}_{{\rm{test}}} }。步骤2 获得分离模型输出估计的目标值{\hat {\boldsymbol{M}}_{1{\rm{test}}} }。 (3) 语音重构阶段: 步骤1 利用{\hat {\boldsymbol{M}}_{1{\rm{test}}} }和混合语音幅度谱通过式(11)可以得到估计的目标语音帧幅度谱{\hat {\boldsymbol{S}}_{1{\rm{test}}} }。 步骤2 利用估计的幅度谱和提取的混合语音相位谱得到估计语音的频谱。 步骤3 通过ISTFT得到目标语音信号帧的时域信号,所有帧连接得到目标语音信号。 表 2 本文方法与基于DNN方法性能对比
方法比较 性别组合 SDR(dB) PESQ STOI
本文方法F-F 8.4626 2.567 0.76 F-M 10.6765 2.887 0.88 M-M 7.7658 2.335 0.65
文献[20]方法F-F 7.3623 2.306 0.68 F-M 8.9548 2.632 0.80 M-M 6.9021 2.061 0.59 -
[1] 田元荣, 王星, 周一鹏. 一种新的基于稀疏表示的单通道盲源分离算法[J]. 电子与信息学报, 2017, 39(6): 1371–1378. doi: 10.11999/JEIT160888TIAN Yuanrong, WANG Xing, and ZHOU Yipeng. Novel single channel blind source separation algorithm based on sparse representation[J]. Journal of Electronics &Information Technology, 2017, 39(6): 1371–1378. doi: 10.11999/JEIT160888 [2] 付卫红, 张琮. 基于步长自适应的独立向量分析卷积盲分离算法[J]. 电子与信息学报, 2018, 40(9): 2158–2164. doi: 10.11999/JEIT171156FU Weihong and ZHANG Cong. Independent vector analysis convolutive blind separation algorithm based on step-size adaptive[J]. Journal of Electronics &Information Technology, 2018, 40(9): 2158–2164. doi: 10.11999/JEIT171156 [3] 李红光, 郭英, 张东伟, 等. 基于欠定盲源分离的同步跳频信号网台分选[J]. 电子与信息学报, 2021, 43(2): 319–328. doi: 10.11999/JEIT190920LI Hongguang, GUO Ying, ZHANG Dongwei, et al. Synchronous frequency hopping signal network station sorting based on underdetermined blind source separation[J]. Journal of Electronics &Information Technology, 2021, 43(2): 319–328. doi: 10.11999/JEIT190920 [4] UDREA R M, CIOCHINA S, and VIZIREANU D N. Multi-band bark scale spectral over-subtraction for colored noise reduction[C]. International Symposium on Signals, Circuits and Systems, Iasi, Romania, 2005: 311–314. [5] CHEN Jingdong, BENESTY J, HUANG Yiteng, et al. New insights into the noise reduction wiener filter[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1218–1234. doi: 10.1109/TSA.2005.860851 [6] WIEM B, ANOUAR B M M, and AICHA B. Monaural speech separation based on linear regression optimized using gradient descent[C]. 2020 5th International Conference on Advanced Technologies for Signal and Image Processing, Sousse, Tunisia, 2020: 1–6. [7] WANG Chunpeng and ZHU Jie. Neural network based phase compensation methods on monaural speech separation[C]. 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 2019: 1384–1389. [8] SUN Yang, WANG Wenwu, CHAMBERS J, et al. Two-stage monaural source separation in reverberant room environments using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(1): 125–139. doi: 10.1109/TASLP.2018.2874708 [9] XIAN Yang, SUN Yang, WANG Wenwu, et al. Two stage audio-video speech separation using multimodal convolutional neural networks[C]. 2019 Sensor Signal Processing for Defence Conference (SSPD), Brighton, UK, 2019: 1–5. [10] LIU Yuzhou, DELFARAH M, and WANG Deliang. Deep casa for talker-independent monaural speech separation[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6354–6358. [11] WANG Deliang. On ideal binary mask as the computational goal of auditory scene analysis[M]. DIVENYI P. Speech Separation by Humans and Machines. New York: Springer, 2005, 60: 63–64. [12] KIM G, LU Yang, HU Yi, et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners[J]. The Journal of the Acoustical Society of America, 2009, 126(3): 1486–1494. doi: 10.1121/1.3184603 [13] HAN Kun and WANG Deliang. A classification based approach to speech segregation[J]. The Journal of the Acoustical Society of America, 2012, 132(5): 3475–3483. doi: 10.1121/1.4754541 [14] SRINIVASAN S, ROMAN N, and WANG Deliang. Binary and ratio time-frequency masks for robust speech recognition[J]. Speech Communication, 2006, 48(11): 1486–1501. doi: 10.1016/j.specom.2006.09.003 [15] ZHANG Xiaolei and WANG Deliang. A deep ensemble learning method for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(5): 967–977. doi: 10.1109/TASLP.2016.2536478 [16] HUANG Posen, KIM N, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2136–2147. doi: 10.1109/TASLP.2015.2468583 [17] DU Jun, TU Yanhui, DAI Lirong, et al. A regression approach to single-channel speech separation via high-resolution deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(8): 1424–1437. doi: 10.1109/TASLP.2016.2558822 [18] WANG Yannan, DU Jun, DAI Lirong, et al. A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(7): 1535–1546. doi: 10.1109/TASLP.2017.2700540 [19] LI Xiang, WU Xihong, and CHEN Jing. A spectral-change-aware loss function for DNN-based speech separation[C]. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019: 6870–6874. [20] SUN Linhui, ZHU Ge, and LI Ping’an. Joint constraint algorithm based on deep neural network with dual outputs for single-channel speech separation[J]. Signal, Image and Video Processing, 2020, 14(7): 1387–1395. doi: 10.1007/s11760-020-01676-6 [21] COOKE M, BARKER J, CUNNINGHAM S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5): 2421–2424. doi: 10.1121/1.2229005 期刊类型引用(21)
1. 李民,郭琳,姚雄. 优化高斯过程回归在太阳能集热效率预测上的应用. 电网与清洁能源. 2023(08): 127-131+138 . 百度学术
2. Han-shan Li. Recognition model and algorithm of projectiles by combining particle swarm optimization support vector and spatial-temporal constrain. Defence Technology. 2023(09): 273-283 . 必应学术
3. 何旭,席佩瑶,辛云宏. 基于代价敏感思想和自适应增强集成的SVM多分类算法. 微型电脑应用. 2023(09): 1-3 . 百度学术
4. 徐红先,张书玮. 基于极限学习机及多姿态信息融合的步态识别. 机械. 2023(11): 72-80 . 百度学术
5. 陈晓禾,曹旭刚,陈健生,胡春华,马羽. 基于三维卷积的帕金森患者拖步识别. 电子与信息学报. 2021(12): 3467-3475 . 本站查看
6. 雷建超,刘栋博,房玉,庄祖江,刘俊豪. 基于表面肌电信号的性别差异性手势识别. 中国医学物理学杂志. 2020(03): 337-341 . 百度学术
7. 金鑫,冯毅,尤雪汐,王佳欣. 基于机器学习的信息安全设备调配保障技术研究. 电子科技. 2020(08): 80-86 . 百度学术
8. 孟明,闫冉,高云园,佘青山. 基于多元变分模态分解的脑电多域特征提取方法. 传感技术学报. 2020(06): 853-860 . 百度学术
9. 王志芳,王书涛,王贵川. 粒子群优化BP神经网络在甲烷检测中的应用. 光子学报. 2019(04): 147-154 . 百度学术
10. 邹倩颖,王小芳. 粒子群优化BP神经网络在步态识别中的研究. 实验技术与管理. 2019(08): 130-133+138 . 百度学术
11. 郭海山,高波涌,陆慧娟. 基于Boruta-PSO-SVM的股票收益率研究. 传感器与微系统. 2018(03): 51-53+57 . 百度学术
12. 周长林,钱志升,王勤民,余道杰,程俊平. 基于PSO-SVM方法的电源线传导泄漏信号识别与还原. 电子与信息学报. 2018(09): 2206-2211 . 本站查看
13. 赵荣建,汤敏芳,陈贤祥,杜利东,曾华林,赵湛,方震. 基于光纤传感的生理参数监测系统研究. 电子与信息学报. 2018(09): 2182-2189 . 本站查看
14. 胡长俊,袁树杰. 煤矿井下WSN中基于自适应粒子群聚类算法的多sink节点部署. 计算机科学. 2018(11): 103-107+123 . 百度学术
15. 王秀娟,相从斌. 基于累积量的DoS攻击检测算法. 北京工业大学学报. 2017(09): 1328-1334 . 百度学术
16. 杜必强,孙立江. 基于PSO-SVM模型的焊接转子环焊缝超声缺陷识别. 动力工程学报. 2017(05): 379-385 . 百度学术
17. 赵湛,韩璐,方震,陈贤祥,杜利东,刘正奎. 基于可穿戴设备的日常压力状态评估研究. 电子与信息学报. 2017(11): 2669-2676 . 本站查看
18. 董广宇. 基于多特征融合的复杂路况步态识别方法. 科学技术与工程. 2017(08): 202-207 . 百度学术
19. 韩笑,佘青山,高云园,罗志增. 基于NA-MEMD和互信息的脑电特征提取方法. 传感技术学报. 2016(08): 1140-1148 . 百度学术
20. 黄成泉,王士同,蒋亦樟,董爱美. v-软间隔罗杰斯特回归分类机. 电子与信息学报. 2016(04): 985-992 . 本站查看
21. 徐超立,林科,杨晨,吴超华,高小榕. 基于小腿表面肌电的智能机器人协同控制方法. 中国生物医学工程学报. 2016(04): 385-393 . 百度学术
其他类型引用(43)
-