高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于注意力门控膨胀卷积网络的单通道语音增强

张天骐 柏浩钧 叶绍鹏 刘鉴兴

张天骐, 柏浩钧, 叶绍鹏, 刘鉴兴. 基于注意力门控膨胀卷积网络的单通道语音增强[J]. 电子与信息学报, 2022, 44(9): 3277-3288. doi: 10.11999/JEIT210654
引用本文: 张天骐, 柏浩钧, 叶绍鹏, 刘鉴兴. 基于注意力门控膨胀卷积网络的单通道语音增强[J]. 电子与信息学报, 2022, 44(9): 3277-3288. doi: 10.11999/JEIT210654
ZHANG Tianqi, BAI Haojun, YE Shaopeng, LIU Jianxing. Monaural Speech Enhancement Based on Attention-Gate Dilated Convolution Network[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3277-3288. doi: 10.11999/JEIT210654
Citation: ZHANG Tianqi, BAI Haojun, YE Shaopeng, LIU Jianxing. Monaural Speech Enhancement Based on Attention-Gate Dilated Convolution Network[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3277-3288. doi: 10.11999/JEIT210654

基于注意力门控膨胀卷积网络的单通道语音增强

doi: 10.11999/JEIT210654
基金项目: 国家自然科学基金(61671095, 61702065, 61701067, 61771085),信号与信息处理重庆市市级重点实验室建设项目(CSTC2009CA2003),重庆市自然科学基金(cstc2021jcyj-msxmX0836)
详细信息
    作者简介:

    张天骐:男,教授,博士生导师,主要研究方向为通信信号的调制解调、盲处理、图像、语音信号处理、神经网络实现以及FPGA、VLSL实现

    柏浩钧:男,硕士生,研究方向为语音信号处理、语音增强

    叶绍鹏:男,硕士生,研究方向为图像处理、数字水印、信息隐藏

    刘鉴兴:男,硕士生,研究方向为信道编码参数盲识别技术

    通讯作者:

    柏浩钧 hjbai1997@163.com

  • 中图分类号: TN912.35

Monaural Speech Enhancement Based on Attention-Gate Dilated Convolution Network

Funds: The National Natural Science Foundation of China (61671095, 61702065, 61701067, 61771085), The Project of Key Laboratory of Signal and Information Processing of Chongqing (CSTC2009CA2003), The Natural Science Foundation of Chongqing (cstc2021jcyj-msxmX0836)
  • 摘要: 在有监督语音增强任务中,上下文信息对目标语音的估计产生重要影响,为了获取更加丰富的语音全局相关特征,该文以尽可能小的参数为前提,设计了一种新型卷积网络来进行语音增强。所提网络包含编码层、传输层与解码层3个部分:编解码部分提出一种2维非对称膨胀残差(2D-ADR)模块,其能明显减小训练参数并扩大感受野,提升网络对上下文信息的获取能力;传输层提出一种1维门控膨胀残差(1D-GDR)模块,该模块结合膨胀卷积、残差学习与门控机制,能够选择性传递特征并获取更多时序相关信息,同时采用密集跳跃连接的方式对8个1D-GDR模块进行堆叠,以增强层间信息流动并提供更多梯度传播方式;最后,对相应编解码层进行跳跃连接并引入注意力机制,以使解码过程获得更加鲁棒的底层特征。实验部分,使用了不同的参数设置以及对比方法来验证网络的有效性与鲁棒性,通过在28种噪声环境下训练及测试,相比于其他方法,该文方法以1.25×106的参数取得了更优的客观和主观指标,具备较强的增强效果与泛化能力。
  • 图  1  语音增强流程图

    图  2  2维非对称膨胀残差模块与1维门控膨胀残差模块

    图  3  密集连接门控膨胀残差网络

    图  4  注意力门控模块

    图  5  注意力门控膨胀卷积网络

    图  6  各损失函数的训练误差与验证误差

    图  7  各方法的LSD对比

    图  8  语谱图

    图  9  参数对比

    表  1  网络参数

    模块参数输入维度输出维度
    1Conv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {2,2} \right),{\text{ }}c = 16 $$ 128 \times 128 \times 1 $$ 64 \times 64 \times 16 $
    22D-ADR$ d = 1 $$ 64 \times 64 \times 16 $$ 64 \times 64 \times 16 $
    3Conv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {2,2} \right),{\text{ }}c = 32 $$ 64 \times 64 \times 16 $$ 32 \times 32 \times 32 $
    42D-ADR$ d = 2 $$ 32 \times 32 \times 32 $$ 32 \times 32 \times 32 $
    5Conv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {1,2} \right),{\text{ }}c = 64 $$ 32 \times 32 \times 32 $$ 32 \times 16 \times 64 $
    62D-ADR$ d = 4 $$ 32 \times 16 \times 64 $$ 32 \times 16 \times 64 $
    7Conv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {1,2} \right),{\text{ }}c = 64 $$ 32 \times 16 \times 64 $$ 32 \times 8 \times 64 $
    8Reshape$ 32 \times 8 \times 64 $$ 32 \times 512 $
    9DCGDR$ c1 = 128,{\text{ }}c2 = 512 $

    d=(1, 2, 5, 9, 2, 5, 9, 17)
    $ 32 \times 512 $$ 32 \times 512 $
    10Reshape$ 32 \times 512 $$ 32 \times 8 \times 64 $
    11Deconv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {1,2} \right),{\text{ }}c = 64 $$ 32 \times 8 \times \left( {64 + 64} \right) $$ 32 \times 16 \times 64 $
    122D-ADR$ d = 1 $$ 32 \times 16 \times 64 $$ 32 \times 16 \times 64 $
    13Deconv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {1,2} \right),{\text{ }}c = 32 $$ 32 \times 16 \times \left( {64 + 64} \right) $$ 32 \times 32 \times 32 $
    142D-ADR$ d = 2 $$ 32 \times 32 \times 32 $$ 32 \times 32 \times 32 $
    15Deconv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {2,2} \right),{\text{ }}c = 16 $$ 32 \times 32 \times \left( {32 + 32} \right) $$ 64 \times 64 \times 16 $
    162D-ADR$ d = 4 $$ 64 \times 64 \times 16 $$ 64 \times 64 \times 16 $
    17Deconv2d$ k = \left( {3,3} \right),{\text{ }}s = \left( {2,2} \right),{\text{ }}c = 1 $$ 64 \times 64 \times \left( {16 + 16} \right) $$ 128 \times 128 \times 1 $
    18Deconv2d$ k = \left( {1,1} \right),{\text{ }}s = \left( {1,1} \right),{\text{ }}c = 1 $$ 128 \times 128 \times 1 $$ 128 \times 128 \times 1 $
    下载: 导出CSV

    表  2  噪声类型

    类型噪声
    训练噪声Babble, Factory1, Volvo, White, F16, Pink, Tank, Machinegun, Office, Street, Restaurant, Bell, Alarm, Destroyer, Hfchannel, Alarm, Traffic, Animal, Wind, Cry, Shower, Laugh
    测试噪声(匹配噪声)Babble, Factory1, Volvo, White, F16, Hfchannel, Street, Restaurant
    测试噪声(不匹配噪声)Factory2, Buccaneer1, Engine , Leopard, Destroyer, Crowd
    下载: 导出CSV

    表  3  网络取不同膨胀率对增强语音的PESQ和STOI(%)影响

    膨胀率
    (编解码层)
    膨胀率
    (传输层)
    PESQSTOI (%)
    1,2,51,2,4,8,1,2,4,82.69784.34
    1,2,51,2,5,9,1,2,5,92.72284.68
    1,2,41,2,4,8,1,2,5,92.70884.45
    1,2,41,2,5,9,2,5,9,172.76984.90
    下载: 导出CSV

    表  4  AG的数量对增强语音PESQ和STOI(%)影响

    AG数量PESQSTOI (%)参数量 (M)
    02.68884.051.226
    12.71184.361.233
    22.75384.801.246
    32.76484.901.249
    42.76984.901.250
    下载: 导出CSV

    表  5  匹配噪声下各方法的平均STOI(%)和PESQ

    PESQSTOI(%)
    –50510均值–50510均值
    含噪语音1.3891.5951.8562.1831.75658.2165.7172.6378.4268.74
    谱减法1.4781.6822.1582.3311.91261.5870.4775.2882.4372.44
    RCED1.8752.2722.6012.8152.39171.3479.9183.6286.1380.25
    CRN2.1842.5252.7312.9072.58772.4782.0685.6888.8782.27
    BiLSTM2.1232.5072.7652.9572.58872.8582.8186.5389.6282.95
    GRN2.2012.6182.8773.0232.68074.1284.0387.9190.0384.02
    AU-net2.2862.5892.8543.1052.70976.0884.3287.4690.5284.60
    NAAGN2.3242.6872.8563.2082.76976.9284.2187.9590.8984.99
    本文2.3652.7143.0023.2542.83476.5884.7989.1392.1285.66
    下载: 导出CSV

    表  6  不匹配噪声下各方法的平均STOI(%)和PESQ

    PESQSTOI(%)
    –50510均值–50510均值
    含噪语音1.4021.6741.9272.2381.81057.7465.7073.0278.3368.70
    谱减法1.5111.7212.1492.3741.93962.1668.7976.0282.5872.39
    RCED1.8232.1812.4952.6962.29968.8776.9782.1885.2278.31
    CRN1.9152.4012.5922.7682.41970.7378.0184.1386.4579.83
    BiLSTM1.8932.3872.6252.7942.42570.9879.4984.6288.2180.83
    GRN2.0432.4862.7252.8912.53672.9782.6286.3488.9782.73
    AU-net2.1212.4952.7312.9422.57273.6683.0586.0889.2283.00
    NAAGN2.1342.4862.8022.9552.59473.9983.1886.4789.1583.20
    本文2.2242.6212.8653.1082.70574.3783.5587.5891.0584.14
    下载: 导出CSV

    表  7  各方法的CSIG, CBAK和COVL得分

    对比方法含噪语音谱减法RCEDCRNBiLSTMGRNAU-netNAAGN本文
    CSIG2.983.272.743.153.243.493.693.764.13
    CBAK1.632.112.662.752.833.003.223.103.49
    COVL1.852.152.452.822.803.153.213.293.38
    下载: 导出CSV
  • [1] YELWANDE A, KANSAL S, and DIXIT A. Adaptive wiener filter for speech enhancement[C]. International Conference on Information, Communication, Instrumentation and Control, Indore, India, 2017: 1–4.
    [2] MIYAZAKI R, SARUWATARI H, INOUE T, et al. Musical-noise-free speech enhancement based on optimized iterative spectral subtraction[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(7): 2080–2094. doi: 10.1109/TASL.2012.2196513
    [3] HENDRIKS R C, HEUSDENS R, and JENSEN J. An MMSE estimator for speech enhancement under a combined stochastic–deterministic speech model[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(2): 406–415. doi: 10.1109/TASL.2006.881666
    [4] XU Yong, DU Jun, DAI Lirong, et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal Processing Letters, 2014, 21(1): 65–68. doi: 10.1109/LSP.2013.2291240
    [5] WANG Qing, DU Jun, DAI Lirong, et al. Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features[C]. 2017 Hands-free Speech Communications and Microphone Arrays, San Francisco, USA, 2017: 101–105.
    [6] SALEEM N, KHATTAK M I, AL-HASAN M, et al. On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks[J]. IEEE Access, 2020, 8: 160581–160595. doi: 10.1109/ACCESS.2020.3021061
    [7] GERS F A, SCHMIDHUBER J, and CUMMINS F. Learning to forget: Continual prediction with LSTM[J]. Neural Computation, 2000, 12(10): 2451–2471. doi: 10.1162/089976600300015015
    [8] LI Xiaoqi, LI Yaxing, DONG Yuanjie, et al. Bidirectional LSTM network with ordered neurons for speech enhancement[C]. Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 2702–2706.
    [9] YOUSEFI M and HANSEN J H L. Block-based high performance CNN architectures for frame-level overlapping speech detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 28–40. doi: 10.1109/TASLP.2020.3036237
    [10] CHOI H, PARK S, PARK J, et al. Multi-speaker emotional acoustic modeling for CNN-based speech synthesis[C]. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 6950–6954.
    [11] PARK S R and LEE J W. A fully convolutional neural network for speech enhancement[C]. Proceedings of the Interspeech 2017, Stockholm, Sweden, 2017: 1993–1997.
    [12] TAN Ke and WANG Deliang. A convolutional recurrent neural network for real-time speech enhancement[C]. Proceedings of the Interspeech 2018, Hyderabad, India, 2018: 3229–3233.
    [13] TAN Ke, CHEN Jitong, and WANG Deliang. Gated residual networks with dilated convolutions for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(1): 189–198. doi: 10.1109/TASLP.2018.2876171
    [14] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778.
    [15] RONNEBERGER O, FISCHER P, and BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]. 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015: 234–241.
    [16] DENG Feng, JIANG Tao, WANG Xiaorui, et al. NAAGN: Noise-aware attention-gated network for speech enhancement[C]. Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 2020: 2457–2461.
    [17] OKTAY O, SCHLEMPER J, LE FOLGOC L, et al. Attention U-Net: Learning where to look for the pancreas[J]. arXiv: 1804.03999, 2018.
    [18] WANG Yu, ZHOU Quan, LIU Jie, et al. Lednet: A lightweight encoder-decoder network for real-time semantic segmentation[C]. 2019 IEEE International Conference on Image Processing, Taipei, China, 2019: 1860–1864.
    [19] HUANG Gao, LIU Zhuang, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2261–2269.
    [20] SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 21–25.
    [21] DAUPHIN Y N, FAN A, AULI M, et al. Language modeling with gated convolutional networks[C]. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017: 933–941.
    [22] HU Jie, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011–2023. doi: 10.1109/TPAMI.2019.2913372
    [23] KANEKO T, KAMEOKA H, TANAKA K, et al. Cyclegan-VC2: Improved cyclegan-based non-parallel voice conversion[C]. IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019: 6820–6824.
    [24] GAROFOLO J S, LAMEL L F, FISHER W M, et al. Darpa Timit acoustic-phonetic continuous speech corpus CD-ROM {TIMIT}[R]. NIST Interagency/Internal Report (NISTIR)-4930, 1993.
    [25] ITU-T. P. 862 Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[S]. International Telecommunications Union (ITU-T) Recommendation, 2001: 862.
    [26] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125–2136. doi: 10.1109/TASL.2011.2114881
  • 加载中
图(9) / 表(7)
计量
  • 文章访问数:  408
  • HTML全文浏览量:  432
  • PDF下载量:  97
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-06-30
  • 修回日期:  2022-03-07
  • 录用日期:  2022-03-22
  • 网络出版日期:  2022-03-24
  • 刊出日期:  2022-09-19

目录

    /

    返回文章
    返回