基于Viseme的连续语音识别系统及Talking Head

蒋冬梅; 谢磊; IlseRavyse; 赵荣椿; HichemSahli; JanCornelis

基于Viseme的连续语音识别系统及Talking Head

计量
- 文章访问数: 2862
- HTML全文浏览量: 192
- PDF下载量: 618
- 被引次数: 0
出版历程
- 收稿日期: 2002-07-25
- 修回日期: 2003-03-10
- 刊出日期: 2004-03-19

The Viseme Based Continuous Speech Recognition System for a Talking Head

摘要

摘要: 为实现听觉/视觉驱动的说话人头部动画,该文给出了一个基于viseme(说话时的基本嘴形单位)的连续语音识别系统。它训练viseme隐马尔可夫模型(HMM),识别语音为viseme图像序列。建模采用triseme的概念来考虑viseme的上下文相关性,但它需要超大量的训练数据。该文根据viseme图像及其相似度权值(VSW)定义视觉问题集,用来建立triseme决策树,以实现triseme的状态捆绑及HMM参数共享。为比较系统性能,基于phoneme(听觉领域的语音基本单位)的语音识别结果也被映射为viseme序列。在评价准则上,定义viseme图像相似度加权识别精度,更全面地考虑输出和参考图像序列的差别,并用嘴形圆度和VSW曲线中的突变点来评估所得viseme序列的平滑性。结果表明,基于viseme的语音识别系统能给出更平滑和合理的嘴形图像序列。
- 说话人头部动画; Viseme; Triseme决策树; Viseme图像相似度加权; 嘴形圆度; VSW曲线
Abstract: A continuous speech recognition system for a talking head is presented in this paper, which is based on the viseme (the basic speech unit in visual domain) HMMs and segments speech to mouth shape sequences with timing boundaries. The trisemes are for malized to consider the viseme contexts. Based on the 3D talking head images, the viseme similarity weight (VSW) is denned, and 166 visual questions are designed for the building of the triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. For the system evaluation, besides the recognition rate, an image related measurement, the viseme similarity weighted accuracy accounts for the mismatches of the recognized viseme sequence with its reference, and jerky points in liprounding and VSW graphs help evaluate the smoothness of the resulting viseme image sequences. Results show that the viseme based speech recognition system gives smoother and more plausible mouth shapes.

HTML全文

参考文献(1)

Petajan E D, Goldschen A J, Garcia O N. Continuous automatic speech recognition by lipreading,In Motion-Based Recognition, USA: Kluwer Academnic Publishers, 1997: 321-343.[2]Woodland P C, Young S J, Odell J. Tree-based state tying for high accuracy acoustic modelling.In Proc. ARPA Workshop on Human Language Technology, Plainsboro, NJ, USA, 1994: 307-312.[3]Kate R, Faruquie T A, Kapoor A. Audio driven facial animation for audio-visual reality. In Proc.International Conference on Multimedia and EXPO (ICME), Tokyo, Japan, 2001: 22-25.[4]Young S J. The HTK Hidden Markov Model Toolkit: Design and Philosophy, Technical Report,CUED, Cambridge University, 1994.[5]Young S J, Kershaw D, Odell J, Woodland P. The HTK Book (for HTK Version 3.0),Http://htk.eng.cam.ac.uk/docs/docs.shtml, 2000.[6]Ezzat T, Poggio T. MikeTalk: A talking facial display based on morphing visemes. In Proc.Computer Animation Conference, Philadelphia, USA, 1998: 456-459.

施引文献

资源附件(0)

访问统计