Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks

ZHANG Ge; ZHANG Pengyuan; PAN Jielin; YAN Yonghong

doi:10.11999/JEIT160543

Volume 39 Issue 4

Apr. 2017

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2017 > 39(4): 930-937

ZHANG Ge, ZHANG Pengyuan, PAN Jielin, YAN Yonghong. Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks[J]. Journal of Electronics & Information Technology, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543

Citation:

ZHANG Ge, ZHANG Pengyuan, PAN Jielin, YAN Yonghong. Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks[J]. Journal of Electronics & Information Technology, 2017, 39(4): 930-937. doi: 10.11999/JEIT160543

Citation:

PDF( 233 KB)

Fast Decoding Algorithm for Automatic Speech Recognition Based on Recurrent Neural Networks

doi: 10.11999/JEIT160543 cstr: 32379.14.JEIT160543

Funds:

The National Natural Science Foundation of China (U1536117, 11590770-4), The National Key Research and Development Plan of China (2016YFB0801200, 2016YFB0801203), The Key Science and Technology Project of the Xinjiang Uygur Autonomous Region (2016A03007-1)

Received Date: 2016-05-26
Rev Recd Date: 2017-01-09
Publish Date: 2017-04-19

Abstract

Abstract

Recurrent Neural Networks (RNN) are widely used for acoustic modeling in Automatic Speech Recognition (ASR). Although RNNs show many advantages over traditional acoustic modeling methods, the inherent higher computational cost limits its usage, especially in real-time applications. Noticing that the features used by RNNs usually have relatively long acoustic contexts, it is possible to lower the computational complexity of both posterior calculation and token passing process with overlapped information. This paper introduces a novel decoder structure that drops the overlapped acoustic frames regularly, which leads to a significant computational cost reduction in the decoding process. Especially, the new approach can directly use the original RNNs with minor modifications on the HMM topology, which makes it flexible. In experiments on conversation telephone speech datasets, this approach achieves 2 to 4 times speedup with little relative accuracy reduction.
- Speech recognition,
- Recurrent Neural Network (RNN),
- Decoder,
- Frame skipping

FullText(HTML)

References(18)

References

GRAVES Alex, JAITLY Navdeep, and MOHAMED Abdel-rahman. Hybrid speech recognition with deep bidirectional LSTM[C]. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, 2013: 273-278.

SAK Hasim, SENIOR Andrew, and BEAUFAYS Franoise. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]. 15th Annual Conference of the International Speech Communication Association (Interspeech 2014), Singapore, 2014: 338-342.

NARAYANAN Arun, MISRA Ananya, and CHIN Kean. Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3571-3575.

LI Jinyu, MOHAMED Abdelrahman, ZWEIG Geoffrey, et al. Exploring multidimensional LSTMs for large vocabulary ASR[C]. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016: 4940-4944.

PEDDINTI Vijayaditya, POVEY Daniel, and KHUDANPUR Sanjeev. A time delay neural network architecture for efficient modeling of long temporal contexts[C]. 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, 2015: 3214-3218.

SNYDER David, GARCIA-ROMERO Daniel, and POVEY Daniel. Time delay deep neural network-based universal background models for speaker recognition[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 92-97.

PEDDINTI Vijayaditya, CHEN Guoguo, MANOHAR Vimal, et al. JHU ASpIRE system: robust LVCSR with TDNNs, i-vector adaptation, and RNN-LMs[C]. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, USA, 2015: 539-546.

SEIDE Frank, LI Gang, and YU Dong. Conversational speech transcription using context-dependent deep neural networks[C]. 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 2011: 437-440.

SELTZER Michael L, YU Dong, and WANG Yongqiang. An investigation of deep neural networks for noise robust speech recognition[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7398-7402.

VANHOUCKE Vincent, DEVIN Matthieu, and HEIGOLD Georg. Multiframe deep neural networks for acoustic modeling[C]. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013: 7582-7585.

MOORE Darren, DINES John, DOSS Mathew Magimai, et al. Juicer: A Weighted Finite-State Transducer Speech Decoder[M]. Berlin, Heidelberg, Springer, 2006: 285-296.

YOUNG S J, RUSSELL N H, and THORNTON J H S. Token passing: A simple conceptual model for connected speech recognition systems[R]. CUED/F-INFENG/TR38, Engineering Department, Cambridge University, 1989.

NOLDEN David, SCHLTER Ralf, and NEY Hermann. Extended search space pruning in LVCSR[C]. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012: 4429-4432.

郭宇弘. 基于加权有限状态转换机的语音识别系统研究[D]. [博士论文], 中国科学院大学, 2013: 1-20.

GUO Yuhong. Automatic speech recognition system based on weighted finite-state transducers[D]. [Ph.D. dissertation], University of Chinese Academy of Sciences, 2013: 1-20.

RABINER Lawrence R and JUANG Biinghwang. An introduction to hidden Markov models[J]. IEEE ASSP Magazine, 1986, 3(1): 4-16. doi: 10.1109/MASSP.1986. 1165342

YOUNG Steve, EVERMANN Gunnar, GALES Mark, et al. The HTK Book Vol. 2[M]. Cambridge, Entropic Cambridge Research Laboratory, 1997: 59-210.

ZHANG Qingqing, SOONG Frank, QIAN Yao, et, al. Improved modeling for F0 generation and V/U decision in HMM-based TTS[C]. 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, USA, 2010: 4606-4609.

Relative Articles

Supplements(0)

Cited By

Proportional views