Monaural Speech Separation Method Based on Deep Learning Feature Fusion and Joint Constraints

SUN Linhui; WANG Can; LIANG Wenqing; LI Ping’an

doi:10.11999/JEIT210606

Volume 44 Issue 9

Sep. 2022

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2022 > 44(9): 3266-3276

SUN Linhui, WANG Can, LIANG Wenqing, LI Ping’an. Monaural Speech Separation Method Based on Deep Learning Feature Fusion and Joint Constraints[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3266-3276. doi: 10.11999/JEIT210606

Citation:

SUN Linhui, WANG Can, LIANG Wenqing, LI Ping’an. Monaural Speech Separation Method Based on Deep Learning Feature Fusion and Joint Constraints[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3266-3276. doi: 10.11999/JEIT210606

Citation:

PDF( 4878 KB)

Monaural Speech Separation Method Based on Deep Learning Feature Fusion and Joint Constraints

doi: 10.11999/JEIT210606 cstr: 32379.14.JEIT210606

College of Telecommunications & Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

Funds: The National Natural Science Foundation of China (61901227), The Universities Natural Science Research Project of Jiangsu Province (19KJB510049)

Received Date: 2021-06-21
Accepted Date: 2022-05-05
Rev Recd Date: 2022-04-19

Available Online: 2022-05-08

Publish Date: 2022-09-19

Abstract

Abstract

To improve the performance of monaural speech separation, a monaural speech separation method based on deep learning feature fusion and joint constraints is proposed. The loss function of the traditional separation algorithm based on deep learning only considers the error between the predicted value and the true one, which makes the error between the separated speech and the pure speech larger. To combat it, a new joint constrained loss function is proposed, which not only constrains the error between the predicted value and the true one of ideal ratio mask, but also penalizes the error of the corresponding amplitude spectrum. In addition, to make full use of the complementarity of multiple features, a Convolutional Neural Network (CNN) structure with feature fusion layer is proposed, which extracts the depth feature of the multi-channel input feature, and then fuses the depth feature and the acoustic feature in the fusion layer to train the separation model. The fused separation feature contains abundant acoustic information and has a strong acoustic representative ability, which makes the mask predicted by the separation model more accurate. The experimental results show that from Signal Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI), compared with other excellent speech separation methods based on deep learning, the proposed method can separate the mixed speech more effectively.
- Speech separation,
- Joint constraint,
- Feature fusion,
- Loss function,
- Convolutional Neural Network (CNN)

FullText(HTML)

References(21)

References

[1]	田元荣, 王星, 周一鹏. 一种新的基于稀疏表示的单通道盲源分离算法[J]. 电子与信息学报, 2017, 39(6): 1371–1378. doi: 10.11999/JEIT160888 TIAN Yuanrong, WANG Xing, and ZHOU Yipeng. Novel single channel blind source separation algorithm based on sparse representation[J]. Journal of Electronics &Information Technology, 2017, 39(6): 1371–1378. doi: 10.11999/JEIT160888
[2]	付卫红, 张琮. 基于步长自适应的独立向量分析卷积盲分离算法[J]. 电子与信息学报, 2018, 40(9): 2158–2164. doi: 10.11999/JEIT171156 FU Weihong and ZHANG Cong. Independent vector analysis convolutive blind separation algorithm based on step-size adaptive[J]. Journal of Electronics &Information Technology, 2018, 40(9): 2158–2164. doi: 10.11999/JEIT171156
[3]	李红光, 郭英, 张东伟, 等. 基于欠定盲源分离的同步跳频信号网台分选[J]. 电子与信息学报, 2021, 43(2): 319–328. doi: 10.11999/JEIT190920 LI Hongguang, GUO Ying, ZHANG Dongwei, et al. Synchronous frequency hopping signal network station sorting based on underdetermined blind source separation[J]. Journal of Electronics &Information Technology, 2021, 43(2): 319–328. doi: 10.11999/JEIT190920
[4]	UDREA R M, CIOCHINA S, and VIZIREANU D N. Multi-band bark scale spectral over-subtraction for colored noise reduction[C]. International Symposium on Signals, Circuits and Systems, Iasi, Romania, 2005: 311–314.
[5]	CHEN Jingdong, BENESTY J, HUANG Yiteng, et al. New insights into the noise reduction wiener filter[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1218–1234. doi: 10.1109/TSA.2005.860851
[6]	WIEM B, ANOUAR B M M, and AICHA B. Monaural speech separation based on linear regression optimized using gradient descent[C]. 2020 5th International Conference on Advanced Technologies for Signal and Image Processing, Sousse, Tunisia, 2020: 1–6.
[7]	WANG Chunpeng and ZHU Jie. Neural network based phase compensation methods on monaural speech separation[C]. 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 2019: 1384–1389.
[8]	SUN Yang, WANG Wenwu, CHAMBERS J, et al. Two-stage monaural source separation in reverberant room environments using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(1): 125–139. doi: 10.1109/TASLP.2018.2874708
[9]	XIAN Yang, SUN Yang, WANG Wenwu, et al. Two stage audio-video speech separation using multimodal convolutional neural networks[C]. 2019 Sensor Signal Processing for Defence Conference (SSPD), Brighton, UK, 2019: 1–5.
[10]	LIU Yuzhou, DELFARAH M, and WANG Deliang. Deep casa for talker-independent monaural speech separation[C]. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 6354–6358.
[11]	WANG Deliang. On ideal binary mask as the computational goal of auditory scene analysis[M]. DIVENYI P. Speech Separation by Humans and Machines. New York: Springer, 2005, 60: 63–64.
[12]	KIM G, LU Yang, HU Yi, et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners[J]. The Journal of the Acoustical Society of America, 2009, 126(3): 1486–1494. doi: 10.1121/1.3184603
[13]	HAN Kun and WANG Deliang. A classification based approach to speech segregation[J]. The Journal of the Acoustical Society of America, 2012, 132(5): 3475–3483. doi: 10.1121/1.4754541
[14]	SRINIVASAN S, ROMAN N, and WANG Deliang. Binary and ratio time-frequency masks for robust speech recognition[J]. Speech Communication, 2006, 48(11): 1486–1501. doi: 10.1016/j.specom.2006.09.003
[15]	ZHANG Xiaolei and WANG Deliang. A deep ensemble learning method for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(5): 967–977. doi: 10.1109/TASLP.2016.2536478
[16]	HUANG Posen, KIM N, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12): 2136–2147. doi: 10.1109/TASLP.2015.2468583
[17]	DU Jun, TU Yanhui, DAI Lirong, et al. A regression approach to single-channel speech separation via high-resolution deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(8): 1424–1437. doi: 10.1109/TASLP.2016.2558822
[18]	WANG Yannan, DU Jun, DAI Lirong, et al. A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(7): 1535–1546. doi: 10.1109/TASLP.2017.2700540
[19]	LI Xiang, WU Xihong, and CHEN Jing. A spectral-change-aware loss function for DNN-based speech separation[C]. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019: 6870–6874.
[20]	SUN Linhui, ZHU Ge, and LI Ping’an. Joint constraint algorithm based on deep neural network with dual outputs for single-channel speech separation[J]. Signal, Image and Video Processing, 2020, 14(7): 1387–1395. doi: 10.1007/s11760-020-01676-6
[21]	COOKE M, BARKER J, CUNNINGHAM S, et al. An audio-visual corpus for speech perception and automatic speech recognition[J]. The Journal of the Acoustical Society of America, 2006, 120(5): 2421–2424. doi: 10.1121/1.2229005