基于递归神经网络的语音识别快速解码算法

张舸; 张鹏远; 潘接林; 颜永红

doi:10.11999/JEIT160543

基于递归神经网络的语音识别快速解码算法

doi: 10.11999/JEIT160543

基金项目:

国家自然科学基金(U1536117, 11590770-4)，国家重点研发计划重点专项(2016YFB0801200, 2016YFB0801203)，新疆维吾尔自治区科技重大专项(2016A03007-1)

计量
- 文章访问数: 1859
- HTML全文浏览量: 188
- PDF下载量: 737
- 被引次数: 0
出版历程
- 收稿日期: 2016-05-26
- 修回日期: 2017-01-09
- 刊出日期: 2017-04-19

Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks

Funds:

The National Natural Science Foundation of China (U1536117, 11590770-4), The National Key Research and Development Plan of China (2016YFB0801200, 2016YFB0801203), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)

摘要

摘要: 递归神经网络(Recurrent Neural Network, RNN)如今已经广泛用于自动语音识别(Automatic Speech Recognition, ASR)的声学建模。虽然其较传统的声学建模方法有很大优势，但相对较高的计算复杂度限制了这种神经网络的应用，特别是在实时应用场景中。由于递归神经网络采用的输入特征通常有较长的上下文，因此利用重叠信息来同时降低声学后验和令牌传递的时间复杂度成为可能。该文介绍了一种新的解码器结构，通过有规律抛弃存在重叠的帧来获得解码过程中的计算开销降低。特别地，这种方法可以直接用于原始的递归神经网络模型，只需对隐马尔可夫模型(Hidden Markov Model, HMM)结构做小的变动，这使得这种方法具有很高的灵活性。该文以时延神经网络为例验证了所提出的方法，证明该方法能够在精度损失相对较小的情况下取得2~4倍的加速比。
- 语音识别 /
- 递归神经网络 /
- 解码器 /
- 跳帧计算
Abstract: Recurrent Neural Networks (RNN) are widely used for acoustic modeling in Automatic Speech Recognition (ASR). Although RNNs show many advantages over traditional acoustic modeling methods, the inherent higher computational cost limits its usage, especially in real-time applications. Noticing that the features used by RNNs usually have relatively long acoustic contexts, it is possible to lower the computational complexity of both posterior calculation and token passing process with overlapped information. This paper introduces a novel decoder structure that drops the overlapped acoustic frames regularly, which leads to a significant computational cost reduction in the decoding process. Especially, the new approach can directly use the original RNNs with minor modifications on the HMM topology, which makes it flexible. In experiments on conversation telephone speech datasets, this approach achieves 2 to 4 times speedup with little relative accuracy reduction.
- Speech recognition /
- Recurrent Neural Network (RNN) /
- Decoder /
- Frame skipping

HTML全文

参考文献(18)

GRAVES Alex, JAITLY Navdeep, and MOHAMED Abdel-rahman. Hybrid speech recognition with deep bidirectional LSTM[C]. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, 2013: 273-278.

SAK Hasim, SENIOR Andrew, and BEAUFAYS Franoise. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]. 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, 2014: 338-342.

NARAYANAN Arun, MISRA Ananya, and CHIN Kean. Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3571-3575.

LI Jinyu, MOHAMED Abdelrahman, ZWEIG Geoffrey, et al. Exploring multidimensional LSTMs for large vocabulary ASR[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 4940-4944.

PEDDINTI Vijayaditya, POVEY Daniel, and KHUDANPUR Sanjeev. A time delay neural network architecture for efficient modeling of long temporal contexts[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3214-3218.

SNYDER David, GARCIA-ROMERO Daniel, and POVEY Daniel. Time delay deep neural network-based universal background models for speaker recognition[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 92-97.

PEDDINTI Vijayaditya, CHEN Guoguo, MANOHAR Vimal, et al. JHU ASpIRE system: robust LVCSR with TDNNs, i-vector adaptation, and RNN-LMs[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 539-546.

SEIDE Frank, LI Gang, and YU Dong. Conversational speech transcription using context-dependent deep neural networks[C]. 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 2011: 437-440.

SELTZER Michael L, YU Dong, and WANG Yongqiang. An investigation of deep neural networks for noise robust speech recognition[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7398-7402.

VANHOUCKE Vincent, DEVIN Matthieu, and HEIGOLD Georg. Multiframe deep neural networks for acoustic modeling[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7582-7585.

MOORE Darren, DINES John, DOSS Mathew Magimai, et al. Juicer: A Weighted Finite-State Transducer Speech Decoder[M]. Berlin, Heidelberg, Springer, 2006: 285-296.

YOUNG S J, RUSSELL N H, and THORNTON J H S. Token passing: A simple conceptual model for connected speech recognition systems[R]. CUED/F-INFENG/TR38, Engineering Department, Cambridge University, 1989.

NOLDEN David, SCHLTER Ralf, and NEY Hermann. Extended search space pruning in LVCSR[C]. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012: 4429-4432.

郭宇弘. 基于加权有限状态转换机的语音识别系统研究[D]. [博士论文], 中国科学院大学, 2013: 1-20.

GUO Yuhong. Automatic speech recognition system based on weighted finite-state transducers[D]. [Ph.D. dissertation], University of Chinese Academy of Sciences, 2013: 1-20.

RABINER Lawrence R and JUANG Biinghwang. An introduction to hidden Markov models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16. doi: 10.1109/MASSP.1986. 1165342

YOUNG Steve, EVERMANN Gunnar, GALES Mark, et al. The HTK Book Vol. 2[M]. Cambridge, Entropic Cambridge Research Laboratory, 1997: 59-210.

ZHANG Qingqing, SOONG Frank, QIAN Yao, et, al. Improved modeling for F0 generation and V/U decision in HMM-based TTS[C]. 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, USA, 2010: 4606-4609.

施引文献

资源附件(0)

访问统计