高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

双向长短时记忆模型训练中的空间平滑正则化方法研究

李文洁 葛凤培 张鹏远 颜永红

李文洁, 葛凤培, 张鹏远, 颜永红. 双向长短时记忆模型训练中的空间平滑正则化方法研究[J]. 电子与信息学报, 2019, 41(3): 544-550. doi: 10.11999/JEIT180314
引用本文: 李文洁, 葛凤培, 张鹏远, 颜永红. 双向长短时记忆模型训练中的空间平滑正则化方法研究[J]. 电子与信息学报, 2019, 41(3): 544-550. doi: 10.11999/JEIT180314
Wenjie LI, Fengpei GE, Pengyuan ZHANG, Yonghong YAN. Spatial Smoothing Regularization for Bi-direction Long Short-term Memory Model[J]. Journal of Electronics & Information Technology, 2019, 41(3): 544-550. doi: 10.11999/JEIT180314
Citation: Wenjie LI, Fengpei GE, Pengyuan ZHANG, Yonghong YAN. Spatial Smoothing Regularization for Bi-direction Long Short-term Memory Model[J]. Journal of Electronics & Information Technology, 2019, 41(3): 544-550. doi: 10.11999/JEIT180314

双向长短时记忆模型训练中的空间平滑正则化方法研究

doi: 10.11999/JEIT180314
基金项目: 国家重点研发计划重点专项(2016YFB0801203, 2016YFB0801200),国家自然科学基金(11590770-4, U1536117, 11504406, 11461141004),新疆维吾尔自治区科技重大专项(2016A03007-1)
详细信息
    作者简介:

    李文洁:女,1993年生,博士生,研究方向为语音信号处理、语音识别、声学模型、远场语音识别等

    葛凤培:女,1982年生,副研究员,研究方向为语音识别、发音质量评估、声学建模及自适应等

    张鹏远:男,1978年生,研究员,硕士生导师,研究方向为大词表非特定人连续语音识别、关键词检索、声学模型、鲁棒语音识别等

    颜永红:男,1967年生,研究员,博士生导师,研究方向为语音信号处理、语音识别、口语系统及多模系统、人机界面技术等

    通讯作者:

    张鹏远 pzhang@hccl.ioa.ac.cn

  • 中图分类号: TN912.34

Spatial Smoothing Regularization for Bi-direction Long Short-term Memory Model

Funds: The National Key Research and Development Plan (2016YFB0801203, 2016YFB0801200), The National Natural Science Foundation of China (11590770-4, U1536117, 11504406, 11461141004), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)
  • 摘要:

    双向长短时记忆模型(BLSTM)由于其强大的时间序列建模能力,以及良好的训练稳定性,已经成为语音识别领域主流的声学模型结构。但是该模型结构拥有更大计算量以及参数数量,因此在神经网络训练的过程当中很容易过拟合,进而无法获得理想的识别效果。在实际应用中,通常会使用一些技巧来缓解过拟合问题,例如在待优化的目标函数中加入L2正则项就是常用的方法之一。该文提出一种空间平滑的方法,把BLSTM模型激活值的向量重组成一个2维图,通过滤波变换得到它的空间信息,并将平滑该空间信息作为辅助优化目标,与传统的损失函数一起,作为优化神经网络参数的学习准则。实验表明,在电话交谈语音识别任务上,这种方法相比于基线模型取得了相对4%的词错误率(WER)下降。进一步探索了L2范数正则技术和空间平滑方法的互补性,实验结果表明,同时应用这2种算法,能够取得相对8.6%的WER下降。

  • 图  1  LSTM网络的记忆单元

    图  2  将激活值的1维向量拼成2维网格

    图  3  模型结构图

    表  1  不同位置空间平滑的结果

    空间平滑
    位置
    空间平滑
    权重(c)
    CallHm WER (%)Swbd WER (%)总计WER (%)
    20.010.315.2
    P10.002019.910.415.2
    P10.001019.910.015.0
    P10.000720.010.315.2
    P20.002019.710.014.9
    P20.001019.79.814.8
    P20.000719.99.815.0
    P30.002020.110.315.2
    P30.001020.09.815.0
    P30.000720.010.115.1
    P40.001020.910.615.8
    P40.000720.610.315.5
    P40.000620.510.615.6
    下载: 导出CSV

    表  2  不同权重下的细胞状态值${{{c}}_t}$的空间平滑结果

    空间平滑权重
    (c)
    CallHm WER
    (%)
    Swbd WER
    (%)
    总计WER
    (%)
    20.010.315.2
    0.010020.310.415.4
    0.001019.79.814.8
    0.000919.39.814.6
    0.000819.69.714.7
    0.000719.99.815.0
    下载: 导出CSV

    表  3  网络中添加L2正则后的结果

    L2正则
    有/无
    空间平滑
    有/无
    CallHm WER (%)Swbd WER (%)总计WER (%)
    20.010.315.2
    19.39.814.6
    19.09.514.3
    18.59.313.9
    下载: 导出CSV
  • LI X, and WU X. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015: 4520–4524. doi: 10.1109/ICASSP.2015.7178826.
    CHEN K and HUO Q. Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2016, 24(7): 1185–1193. doi: 10.1109/TASLP.2016.2539499
    AXELROD S, GOEL V, Gopinath R, et al. Discriminative estimation of subspace constrained gaussian mixture models for speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(1): 172–189. doi: 10.1109/TASL.2006.872617
    POVEY D, KANEVSKY D, KINGSBURY B, et al. Boosted MMI for model and feature-space discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, USA, 2008: 4057–4060. doi: 10.1109/ICASSP.2008.4518545.
    POVEY D and KINGSBURY B. Evaluation of proposed modifications to MPE for large scale discriminative training[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, USA, 2007: 321–324. doi: 10.1109/ICASSP.2007.366914.
    HUANG Z, SINISCALCHI S M, and LEE C H. Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation[J]. Pattern Recognition Letters, 2017, 98(15): 1–7. doi: 10.1016/j.patrec.2017.08.001
    POVEY D. Discriminative training for large vocabulary speech recognition[D].[Ph.D. dissertation], University of Cambridge, 2003.
    ZHOU P, JIANG H, DAI L R, et al. State-clustering based multiple deep neural networks modeling approach for speech recognition[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2015, 23(4): 631–642. doi: 10.1109/TASLP.2015.2392944
    SRIVASTAVA N, HINTON G, KRIZHEYSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958.
    GOODFELLOW I, BENGIO Y, and COURVILLE A, Deep Learning[M], Cambridge, MA: MIT Press, 2016: 228–230.
    POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 2751–2755. doi: 10.21437/Interspeech.2016-595.
    SAHRAEIAN R, and VAN D. Cross-entropy training of DNN ensemble acoustic models for low-resource ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11): 1991–2001. doi: 10.1109/TASLP.2018.2851145
    LIU P, LIU C, JIANG H, et al. A constrained line search optimization method for discriminative training of HMMs[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(5): 900–909. doi: 10.1109/TASL.2008.925882
    WU C, KARANASOU P, GALES M J, et al. Stimulated deep neural network for speech recognition[C]. International Speech Communication Association (INTERSPEECH), San Francisco, USA, 2016: 400–404. doi: 10.21437/Interspeech.2016-580.
    Wu C, CALES M J F, RAGNI A, et al. Improving interpretability and regularization in deep learning[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , 2018, 26(2): 256–265. doi: 10.1109/TASLP.2017.2774919
    KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]. International Speech Communication Association (INTERSPEECH), Dresden, Germany, 2015: 3586–3589. doi: 10.21437/Interspeech.2015-571.
    LAURENT C, PEREYRA G, BRAKEL P, et al. Batch normalized recurrent neural networks[C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 2657–2661. doi: 10.1109/ICASSP.2016.7472159.
  • 加载中
图(3) / 表(3)
计量
  • 文章访问数:  2526
  • HTML全文浏览量:  640
  • PDF下载量:  77
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-04-03
  • 修回日期:  2018-11-22
  • 网络出版日期:  2018-12-03
  • 刊出日期:  2019-03-01

目录

    /

    返回文章
    返回