The Viseme Based Continuous Speech Recognition System for a Talking Head

Jiang Dong-mei; Xie Lei; Ilse Ravyse; Zhao Rong-chun; Hichem Sahli; Jan Cornelis

Volume 26 Issue 3

Mar. 2004

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2004 > 26(3): 375-381

Jiang Dong-mei, Xie Lei, Ilse Ravyse, Zhao Rong-chun, Hichem Sahli, Jan Cornelis. The Viseme Based Continuous Speech Recognition System for a Talking Head[J]. Journal of Electronics & Information Technology, 2004, 26(3): 375-381.

Citation:

Jiang Dong-mei, Xie Lei, Ilse Ravyse, Zhao Rong-chun, Hichem Sahli, Jan Cornelis. The Viseme Based Continuous Speech Recognition System for a Talking Head[J]. Journal of Electronics & Information Technology, 2004, 26(3): 375-381.

Citation:

PDF( 1470 KB)

The Viseme Based Continuous Speech Recognition System for a Talking Head

Received Date: 2002-07-25
Rev Recd Date: 2003-03-10
Publish Date: 2004-03-19

Abstract

Abstract

A continuous speech recognition system for a talking head is presented in this paper, which is based on the viseme (the basic speech unit in visual domain) HMMs and segments speech to mouth shape sequences with timing boundaries. The trisemes are for malized to consider the viseme contexts. Based on the 3D talking head images, the viseme similarity weight (VSW) is denned, and 166 visual questions are designed for the building of the triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. For the system evaluation, besides the recognition rate, an image related measurement, the viseme similarity weighted accuracy accounts for the mismatches of the recognized viseme sequence with its reference, and jerky points in liprounding and VSW graphs help evaluate the smoothness of the resulting viseme image sequences. Results show that the viseme based speech recognition system gives smoother and more plausible mouth shapes.

FullText(HTML)

References(1)

References

Petajan E D, Goldschen A J, Garcia O N. Continuous automatic speech recognition by lipreading,In Motion-Based Recognition, USA: Kluwer Academnic Publishers, 1997: 321-343.[2]Woodland P C, Young S J, Odell J. Tree-based state tying for high accuracy acoustic modelling.In Proc. ARPA Workshop on Human Language Technology, Plainsboro, NJ, USA, 1994: 307-312.[3]Kate R, Faruquie T A, Kapoor A. Audio driven facial animation for audio-visual reality. In Proc.International Conference on Multimedia and EXPO (ICME), Tokyo, Japan, 2001: 22-25.[4]Young S J. The HTK Hidden Markov Model Toolkit: Design and Philosophy, Technical Report,CUED, Cambridge University, 1994.[5]Young S J, Kershaw D, Odell J, Woodland P. The HTK Book (for HTK Version 3.0),Http://htk.eng.cam.ac.uk/docs/docs.shtml, 2000.[6]Ezzat T, Poggio T. MikeTalk: A talking facial display based on morphing visemes. In Proc.Computer Animation Conference, Philadelphia, USA, 1998: 456-459.

Relative Articles

Supplements(0)

Cited By

Proportional views