摘要:
为实现听觉/视觉驱动的说话人头部动画,该文给出了一个基于viseme(说话时的基本嘴形单位)的连续语音识别系统。它训练viseme隐马尔可夫模型(HMM),识别语音为viseme图像序列。建模采用triseme的概念来考虑viseme的上下文相关性,但它需要超大量的训练数据。该文根据viseme图像及其相似度权值(VSW)定义视觉问题集,用来建立triseme决策树,以实现triseme的状态捆绑及HMM参数共享。为比较系统性能,基于phoneme(听觉领域的语音基本单位)的语音识别结果也被映射为viseme序列。在评价准则上,定义viseme图像相似度加权识别精度,更全面地考虑输出和参考图像序列的差别,并用嘴形圆度和VSW曲线中的突变点来评估所得viseme序列的平滑性。结果表明,基于viseme的语音识别系统能给出更平滑和合理的嘴形图像序列。
Abstract:
A continuous speech recognition system for a talking head is presented in this paper, which is based on the viseme (the basic speech unit in visual domain) HMMs and segments speech to mouth shape sequences with timing boundaries. The trisemes are for malized to consider the viseme contexts. Based on the 3D talking head images, the viseme similarity weight (VSW) is denned, and 166 visual questions are designed for the building of the triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. For the system evaluation, besides the recognition rate, an image related measurement, the viseme similarity weighted accuracy accounts for the mismatches of the recognized viseme sequence with its reference, and jerky points in liprounding and VSW graphs help evaluate the smoothness of the resulting viseme image sequences. Results show that the viseme based speech recognition system gives smoother and more plausible mouth shapes.