基于汉语视频三音素的可视语音合成

赵晖; 唐朝京

doi:10.3724/SP.J.1146.2008.01634

基于汉语视频三音素的可视语音合成

doi: 10.3724/SP.J.1146.2008.01634

赵晖,
唐朝京

基金项目:

国家部委基金(51329060101)资助课题

计量
- 文章访问数: 3254
- HTML全文浏览量: 84
- PDF下载量: 825
- 被引次数: 0
出版历程
- 收稿日期: 2008-12-05
- 修回日期: 2009-06-19
- 刊出日期: 2009-12-19

Visual Speech Synthesis Algorithm Based on Chinese Visual Triphone

摘要

摘要: 为了合成具有真实感的视频序列，该文提出一种基于汉语视频三音素的可视语音合成方法。根据汉语的发音规律和音素与视素的对应关系，该文提出视频三音素的概念。在此基础上，建立隐马尔可夫(HMM)训练与合成模型，在训练过程中使用了视频音频联合特征，并加入了动态特征。在合成过程中，连接视频三音素HMM模型形成句子HMM，并从中提取特征参数，合成可视语音。从主观和客观评估结果来看，合成视频的真实感强，满意度较高。
- 可视语音合成; 视频三音素; 隐马尔可夫模型; 联合特征
Abstract: In order to synthesize real video sequence, a visual speech synthesis algorithm based on Chinese visual triphone is proposed. According to Chinese pronunciation principle and the relationship between phoneme and viseme, conception of visual triphone is presented. Hidden Markov Model(HMM) is established based on visual triphones. In the training stage, combined features including visual features and audio features are used. In the synthesis stage, sentence HMM is constructed by concatenating triphone HMMs, from which the feature parameters are extracted. From the result of subjective and objective evaluation, the synthesized video is real and satisfied.

HTML全文

参考文献(1)

]. Mini-Micro-System, 2002,23(4): 474-477.[5]Masuko T, Kobayashi T, and Tamura M, et al.. Text-tovisualspeech synthesis based on parameter generation fromHMM[C]. IEEE International Conference on Acoustics,Speech and Signal Processing, Seattle, USA, 1998, 6:3745-3748.[6]Jiang Jin-tao, Aronoff J M, and Bernstein L E. Developmentof a visual speech synthesizer via second-orderisomorphism[C]. IEEE International Conference onAcoustics, Speech and Signal Processing, Las Vegas, USA,2008: 4677-4680.[7]Zhou Wei and Wang Zeng-fu. Speech animation based onChinese mandarin triphone model. 6th IEEE/ACISInternational Conference on Computer and InformationScience, Melbourne, Australia, July 2007: 924-929.[8]吴华, 徐波, 黄泰翼. 基于三音素模型的语料自动选取算法[J]. 软件学报, 2000, 11(2): 271-276.Wu Hua, Xu Bo, and Huang Tai-yi. Automatic corpusselecting algorithm based on triphone models[J]. Journal ofSoftware, 2000, 11(2): 271-276.[9]Zhao Hui and Tang Chao-jing. Visual speech synthesis basedon Chinese dynamic visemes[C]. IEEE InternationalConference on Information and Automation, Zhangjiajie,China, June, 2008: 139-14.

Summerfield Q. Use of visual information in phoneticperception[J].Phonetic.1979, 36(4/5):314-331[2]McGurk H and Macdonald J. Hearing lips and seeingvoices[J].Nature.1976, 264(5588):746-748[3]Perng Woei-luen, Wu Yung-kang, and Ming Ouh-young.Image talk: a real time synthetic talking head using one singleimage with Chinese text-to-speech capability[C]. SixthPacific Conference on Computer Graphics and Applications,Singapore, 1998: 140-148.[4]王志明, 蔡莲红, 吴志勇. 汉语文本-可视语音转换的研究[J].小型微型计算机系统, 2002, 23(4): 474-477.Wang Zhi-ming, Cai Lian-hong, and Wu Zhi-yong. Study oftext to visual speech.

施引文献

资源附件(0)

访问统计