高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

结合有监督联合一致性自编码器的跨音视频说话人标注

柳欣 李鹤洋 钟必能 杜吉祥

柳欣, 李鹤洋, 钟必能, 杜吉祥. 结合有监督联合一致性自编码器的跨音视频说话人标注[J]. 电子与信息学报, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
引用本文: 柳欣, 李鹤洋, 钟必能, 杜吉祥. 结合有监督联合一致性自编码器的跨音视频说话人标注[J]. 电子与信息学报, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011
Citation: LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011

结合有监督联合一致性自编码器的跨音视频说话人标注

doi: 10.11999/JEIT171011
基金项目: 

国家自然科学基金(61673185, 61572205, 61673186),福建省自然科学基金(2017J01112),华侨大学中青年创新人才培育项目(ZQN-309)

详细信息
    作者简介:

    柳欣:柳 欣: 男,1982年生,博士,副教授,研究方向为生物特征识别和机器学习. 李鹤洋: 男,1994年生,硕士生,研究方向为计算机视觉与模式识别. 钟必能: 男,1981年生,博士,教授,研究方向为机器学习和模式识别. 杜吉祥: 男,1977年生,博士,教授,研究方向为计算机视觉和机器学习.

  • 中图分类号: TP391.4

Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder

Funds: 

The National Natural Science Foundation of China (61673185, 61572205, 61673186), The Natural Science Foundation of Fujian Province (2017J01112), The Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-309)

  • 摘要: 跨模态说话人标注旨在利用说话人的不同生物特征进行相互匹配和互标注,可广泛应用于各种人机交互场合。针对人脸和语音两种不同模态生物特征之间存在明显的“语义鸿沟”问题,该文提出一种结合有监督联合一致性自编码器的跨音视频说话人标注方法。首先分别利用卷积神经网络和深度信念网络分别对人脸图像和语音数据进行判别性特征提取,接着在联合自编码器模型的基础上,提出一种新的有监督跨模态神经网络模型,同时嵌入softmax回归模型以保证模态间和模态内样本的相似性,进而扩展为3种有监督一致性自编码器神经网络模型来挖掘音视频异构特征之间的潜在关系,从而有效实现人脸和语音的跨模态相互标注。实验结果表明,该文提出的网络模型能够有效的对说话人进行跨模态标注,效果显著,取得了对姿态变化和样本多样性的鲁棒性。
  • CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
    陈存宝, 赵力. 嵌入自联想神经网络的高斯混合模型说话人辨认[J]. 电子与信息学报, 2010, 32(3): 528-532.

    doi: 10.3724/ SP.J.1146.2008.00275.
    GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
    [3] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.
    [4] ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.
    [5] ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.
    [6] SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.
    [7] SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.
    [8] KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.
    [9] TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.
    [10] YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.
    [11] PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.
    [12] HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.
    [13] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.
    [14] HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.
    [15] FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.
    [16] MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.
    [17] WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.
    [18] CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.
    [19] KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.
    [20] TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.
  • 加载中
计量
  • 文章访问数:  1218
  • HTML全文浏览量:  162
  • PDF下载量:  48
  • 被引次数: 0
出版历程
  • 收稿日期:  2017-10-30
  • 修回日期:  2018-04-10
  • 刊出日期:  2018-07-19

目录

    /

    返回文章
    返回