高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于波束形成的长短时记忆网络语音分离算法研究

兰朝凤 刘岩 赵宏运 刘春东

兰朝凤, 刘岩, 赵宏运, 刘春东. 基于波束形成的长短时记忆网络语音分离算法研究[J]. 电子与信息学报, 2022, 44(7): 2531-2538. doi: 10.11999/JEIT210229
引用本文: 兰朝凤, 刘岩, 赵宏运, 刘春东. 基于波束形成的长短时记忆网络语音分离算法研究[J]. 电子与信息学报, 2022, 44(7): 2531-2538. doi: 10.11999/JEIT210229
LAN Chaofeng, LIU Yan, ZHAO Hongyun, LIU Chundong. Research on Long Short-Term Memory Networks Speech Separation Algorithm Based on Beamforming[J]. Journal of Electronics & Information Technology, 2022, 44(7): 2531-2538. doi: 10.11999/JEIT210229
Citation: LAN Chaofeng, LIU Yan, ZHAO Hongyun, LIU Chundong. Research on Long Short-Term Memory Networks Speech Separation Algorithm Based on Beamforming[J]. Journal of Electronics & Information Technology, 2022, 44(7): 2531-2538. doi: 10.11999/JEIT210229

基于波束形成的长短时记忆网络语音分离算法研究

doi: 10.11999/JEIT210229
基金项目: 国家自然科学基金青年基金(11804068),黑龙江省自然科学基金(LH2020F033)
详细信息
    作者简介:

    兰朝凤:女,1981年生,博士,博士生导师,研究方向为智能语音人机交互、图像处理、噪声控制等

    刘岩:男,1995年生,硕士,研究方向为语音信号处理、语音分离、深度学习

    赵宏运:男,1994年生,硕士,研究方向为语音识别、深度学习

    刘春东:女,1996年生,硕士,研究方向为语音增强、深度学习

    通讯作者:

    兰朝凤 lanchaofeng@hrbust.edu.cn

  • 中图分类号: TN912.3

Research on Long Short-Term Memory Networks Speech Separation Algorithm Based on Beamforming

Funds: The National Natural Science Youth Foundation of China (11804068), The Natural Science Foundation of Heilongjiang Province (LH2020F033)
  • 摘要: 在利用深度学习方式进行语音分离的领域,常用卷积神经网络(RNN)循环神经网络进行语音分离,但是该网络模型在分离过程中存在梯度下降问题,分离结果不理想。针对该问题,该文利用长短时记忆网络(LSTM)进行信号分离探索,弥补了RNN网络的不足。多路人声信号分离较为复杂,现阶段所使用的分离方式多是基于频谱映射方式,没有有效利用语音信号空间信息。针对此问题,该文结合波束形成算法和LSTM网络提出了一种波束形成LSTM算法,在TIMIT语音库中随机选取3个说话人的声音文件,利用超指向波束形成算法得到3个不同方向上的波束,提取每一波束中频谱幅度特征,并构建神经网络预测掩蔽值,得到待分离语音信号频谱并重构时域信号,进而实现语音分离。该算法充分利用了语音信号空间特征和信号频域特征。通过实验验证了不同方向语音分离效果,在60°方向该算法与IBM-LSTM网络相比,客观语音质量评估(PESQ)提高了0.59,短时客观可懂(STOI)指标提高了0.06,信噪比(SNR)提高了1.13 dB,另外两个方向上,实验结果同样证明了该算法较IBM-LSTM算法和RNN算法具有更好的分离性能。
  • 图  1  监督性语音分离系统流程图

    图  2  LSTM梯度信息保存示意图

    图  3  LSTM网络记忆块

    图  4  波束形成频域求和结构示意图

    图  5  波束形成频域求和结构示意图

    图  6  分离算法流程图

    图  7  阵列布放及声源位置

    图  8  LSTM神经网络结构

    表  1  不同网络结构分离人声信号结果

    评价指标
    网络结构
    观测信号角度(°)PESQSTOISNR (dB)
    波束形成LSTM603.340.916.75
    IBM-LSTM2.750.855.62
    RNN2.580.824.59
    波束形成LSTM1203.280.896.74
    IBM-LSTM2.720.845.61
    RNN2.520.804.56
    波束形成LSTM2403.320.916.74
    IBM-LSTM2.760.845.60
    RNN2.540.814.54
    下载: 导出CSV
  • [1] EPHRAT A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: A speaker–independent audio–visual model for speech separation[J]. ACM Transactions on Graphics, 2008, 37(4): 109:1–109:11.
    [2] JONES G L and LITOVSKY R Y. A cocktail party model of spatial release from masking by both noise and speech interferers[J]. The Journal of the Acoustical Society of America, 2011, 130(3): 1463–1474. doi: 10.1121/1.3613928
    [3] XU Jiaming, SHI Jing, LIU Guangcan, et al. Modeling attention and memory for auditory selection in a cocktail party environment[C]. The 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018.
    [4] 黄雅婷, 石晶, 许家铭, 等. 鸡尾酒会问题与相关听觉模型的研究现状与展望[J]. 自动化学报, 2019, 45(2): 234–251.

    HUANG Yating, SHI Jing, XU Jiaming, et al. Research advances and perspectives on the cocktail party problem and related auditory models[J]. Acta Automatica Sinica, 2019, 45(2): 234–251.
    [5] 李娟. 基于ICA和波束形成的快速收敛的BSS算法[J]. 山西师范大学学报: 自然科学版, 2018, 32(4): 52–56.

    LI Juan. A fast-convergence algorithm combining ICA and beamforming[J]. Journal of Shanxi Normal University:Natural Science Edition, 2018, 32(4): 52–56.
    [6] 陈国良, 黄晓琴, 卢可凡. 改进的快速独立分量分析在语音分离系统中的应用[J]. 计算机应用, 2019, 39(S1): 206–209.

    CHEN Guoliang, HUANG Xiaoqin, and LU Kefan. Application of improved fast independent component analysis in speech separation system[J]. Journal of Computer Applications, 2019, 39(S1): 206–209.
    [7] 王昕, 蒋志翔, 张杨, 等. 基于时间卷积网络的深度聚类说话人语音分离[J]. 计算机工程与设计, 2020, 41(9): 2630–2635.

    WANG Xin, JIANG Zhixiang, ZHANG Yang, et al. Deep clustering speaker speech separation based on temporal convolutional network[J]. Computer Engineering and Design, 2020, 41(9): 2630–2635.
    [8] 崔建峰, 邓泽平, 申飞, 等. 基于非负矩阵分解和长短时记忆网络的单通道语音分离[J]. 科学技术与工程, 2019, 19(12): 206–210. doi: 10.3969/j.issn.1671-1815.2019.12.029

    CUI Jianfeng, DENG Zeping, SHEN Fei, et al. Single channel speech separation based on non–negative matrix factorization and long short–term memory network[J]. Science Technology and Engineering, 2019, 19(12): 206–210. doi: 10.3969/j.issn.1671-1815.2019.12.029
    [9] 陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137–2141.

    CHEN Xiukai, LU Zhihua, and ZHOU Yu. Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit[J]. Journal of Computer Applications, 2020, 40(7): 2137–2141.
    [10] WANG Deliang and CHEN Jitong. Supervised speech separation based on deep learning: An overview[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(10): 1702–1726. doi: 10.1109/TASLP.2018.2842159
    [11] 刘文举, 聂帅, 梁山, 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016, 42(6): 819–833.

    LIU Wenju, NIE Shuai, LIANG Shan, et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016, 42(6): 819–833.
    [12] WANG Yuxuan, NARAYANAN A, and WANG Deliang. On training targets for supervised speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1849–1858. doi: 10.1109/TASLP.2014.2352935
    [13] HUANG P S, KIM M, HASEGAWA–JOHNSON M, et al. Deep learning for monaural speech separation[C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014: 1562–1566.
    [14] HUI Like, CAI Meng, GUO Cong, et al. Convolutional maxout neural networks for speech separation[C]. 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Abu Dhabi, United Arab Emirates. 2015: 24–27.
    [15] CHANDNA P, MIRON M, JANER J, et al. Monoaural audio source separation using deep convolutional neural networks[C]. The 13th International Conference, Grenoble, France, 2017: 258–266.
    [16] NIE Shuai, ZHANG Hui, ZHANG Xueliang, et al. Deep stacking networks with time series for speech separation[C]. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014: 6667–6671.
    [17] GERS F A, SCHMIDHUBER J, and CUMMINS F. Learning to forget: Continual prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451–2471. doi: 10.1162/089976600300015015
    [18] 梁尧, 朱杰, 马志贤. 基于深度神经网络的单通道语音分离算法[J]. 信息技术, 2018, 42(7): 24–27.

    LIANG Yao, ZHU Jie, and MA Zhixian. A monaural speech separation algorithm based on deep neural networks[J]. Information Technology, 2018, 42(7): 24–27.
    [19] 李文杰, 罗文俊, 李艺文, 等. 基于可分离卷积与LSTM的语音情感识别研究[J]. 信息技术, 2020, 44(10): 61–66.

    LI Wenjie, LUO Wenjun, LI Yiwen, et al. Speech emotion recognition based on separable convolution and LSTM[J]. Information Technology, 2020, 44(10): 61–66.
    [20] WESTHAUSEN N L and MEYER B T. Dual–signal transformation LSTM network for real–time noise suppression[EB/OL]. https://arxiv.org/abs/2005.07551,2020.
    [21] GREZES F, NI Zhaoheng, TRINH V A, et al. Combining spatial clustering with LSTM speech models for multichannel speech enhancement[EB/OL]. https://arxiv.org/abs/2012.03388,2020.
    [22] LI Xiaofei and HORAUD R. Online monaural speech enhancement using delayed subband LSTM[EB/OL]. https://arxiv.org/abs/2005.05037, 2020.
    [23] 潘超, 黄公平, 陈景东. 面向语音通信与交互的麦克风阵列波束形成方法[J]. 信号处理, 2020, 36(6): 804–815.

    PAN Chao, HUANG Gongping, and CHEN Jingdong. Microphone array beamforming: An overview[J]. Journal of Signal Processing, 2020, 36(6): 804–815.
    [24] 朱训谕, 潘翔. 基于麦克风线阵的语音增强算法研究[J]. 杭州电子科技大学学报: 自然科学版, 2020, 40(5): 30–33, 72.

    ZHU Xunyu and PAN Xiang. Research on speech enhancement algorithm based on microphone linear array[J]. Journal of Hangzhou Dianzi University:Natural Science, 2020, 40(5): 30–33, 72.
    [25] KIM H S, KO H, BEH J, et al. Sound source separation method and system using beamforming technique[P]. USA Patent. 008577677B2, 2013.
    [26] ARAKI S, SAWADA H, and MAKINO S. Blind speech separation in a meeting situation with maximum SNR beamformers[C]. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing–ICASSP’07, Honolulu, USA, 2007, 1: I–41–I–44.
    [27] SARUWATARI H, KURITA S, TAKEDA K, et al. Blind source separation combining independent component analysis and beamforming[J]. EURASIP Journal on Advances in Signal Processing, 2003, 2003: 569270. doi: 10.1155/S1110865703305104
    [28] WANG Lin, DING Heping, and YIN Fuliang. Speech separation and extraction by combining superdirective beamforming and blind source separation[M]. NAIK G and WANG Wenwu. Blind Source Separation. Heidelberg: Springer, 2014: 323–348.
    [29] XENAKI A, BOLDT J B, and CHRISTENSEN M G. Sound source localization and speech enhancement with sparse Bayesian learning beamforming[J]. The Journal of the Acoustical Society of America, 2018, 143(6): 3912–3921. doi: 10.1121/1.5042222
    [30] QIAN Kaizhi, ZHANG Yang, CHANG Shiyu, et al. Deep learning based speech beamforming[C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018: 5389–5393.
    [31] HIMAWAN I, MCCOWAN I, and LINCOLN M. Microphone array beamforming approach to blind speech separation[C]. The 4th International Workshop, Brno, The Czech Republic, 2007: 295–305.
  • 加载中
图(8) / 表(1)
计量
  • 文章访问数:  843
  • HTML全文浏览量:  639
  • PDF下载量:  157
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-03-22
  • 修回日期:  2021-07-20
  • 网络出版日期:  2021-07-28
  • 刊出日期:  2022-07-25

目录

    /

    返回文章
    返回