Advanced Search
Volume 40 Issue 7
Jul.  2018
Turn off MathJax
Article Contents
MA Jitong, QIU Tianshuang, LI Rong, XIA Nan, LI Jingchun. Concurrent Blind Equalization Algorithm Based on Probability Density Function Matching and Fractional Lower Order Moments[J]. Journal of Electronics & Information Technology, 2017, 39(7): 1532-1538. doi: 10.11999/JEIT160841
Citation: LIU Xin, LI Heyang, ZHONG Bineng, DU Jixiang. Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder[J]. Journal of Electronics & Information Technology, 2018, 40(7): 1635-1642. doi: 10.11999/JEIT171011

Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder

doi: 10.11999/JEIT171011
Funds:

The National Natural Science Foundation of China (61673185, 61572205, 61673186), The Natural Science Foundation of Fujian Province (2017J01112), The Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-309)

  • Received Date: 2017-10-30
  • Rev Recd Date: 2018-04-10
  • Publish Date: 2018-07-19
  • Cross-modal speaker tagging aims to learn the latent relationship between different biometrics for mutual annotation, which can potentially be utilized in various human-computer interactions. In order to solve the “semantic gap” between the face and audio modalities, this paper presents an efficient supervised joint correspondence auto-encoder to link the face and audio counterpart, where by the speaker can be crosswise tagged. First, Convolutional Neural Network (CNN) and Deep Belief Network (DBN) are used to extract the discriminative features of the face and the audio samples respectively. Then, a supervised neural network model associated with softmax regression is embedded into a joint auto-encoder model, which can discriminatively preserving the inter-modal and intra-modal similarities. Accordingly, three different kinds of supervised joint correspondence auto-encoder models are presented to correlate the semantic relationships between the face and the audio counterparts, and the speaker can be crosswise annotated efficiently. The experimental results show that the proposed supervised joint auto-encoder is able to perform cross-modal speaker tagging with outstanding performance, and demonstrate the robustness to facial posture variations and sample diversities.
  • CHEN Cunbao and ZHAO Li. Speaker identification based on GMM with embedded AANN[J]. Journal of Electronics & Information Technology, 2010, 32(3): 528-532. doi: 10.3724/ SP.J.1146.2008.00275.
    陈存宝, 赵力. 嵌入自联想神经网络的高斯混合模型说话人辨认[J]. 电子与信息学报, 2010, 32(3): 528-532.

    doi: 10.3724/ SP.J.1146.2008.00275.
    GUO Wu, DAI Lirong, and WANG Renhua. Speaker verification based on factor analysis and SVM[J]. Journal of Electronics & Information Technology, 2009, 31(2): 302-305. doi: 10.3724/SP.J.1146.2007.01289.
    [3] RASIWASIA N, PEREIRA J C, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]. ACM International Conference on Multimedia, Firenze, Italy, 2010: 251-260.
    [4] ZHANG Liang, MA Bingpeng, LI Guorong, et al. Cross- modal retrieval using multiordered discriminative structured subspace learning[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1220-1233. doi: 10.1109/TMM.2016.2646219.
    [5] ZOU Hui, DU Jixiang, ZHAI Chuanmin, et al. Deep learning and shared representation space learning based cross-modal multimedia retrieval[C]. International Conference on Intelligent Computing. Lanzhou, China, 2016: 322-331.
    [6] SUN Yi, WANG Xiaogang, and TANG Xiaoou. Hybrid deep learning for face verification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 1997-2009. doi: 10.1109/TPAMI.2015.2505293.
    [7] SUN Yi, WANG Xiaogang, TANG Xiaoou, et al. Deep learning face representation from predicting 10,000 classes[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1891-1898.
    [8] KARAABA M F, SURINTA O, SCHOMAKER L R B, et al. Robust face identification with small sample sizes using bag of words and histogram of oriented gradients[C]. International Joint Conference on Computer Vision Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2016: 582-589.
    [9] TAIGMAN Y, YANG M, RANZATO M, et al. DeepFace: Closing the gap to human-level performance in face verification[C]. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1701-1708.
    [10] YUAN Xiaochen, PUN Chiman, and CHEN C L. Robust Mel-Frequency cepstral coefficients feature detection and dual-tree complex wavelet transform for digital audio watermarking[J]. Information Sciences, 2015, 29(8): 159-179. doi: 10.1016/j.ins.2014.11.040.
    [11] PATHAK M A and RAJ B. Privacy-preserving speaker verification and identification using Gaussian mixture models [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(2): 397-406. doi: 10.1109/ TASL.2012. 2215602.
    [12] HINTON G, LI Deng, DONG Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. doi: 10.1109/MSP.2012.2205597.
    [13] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. IEEE International Conference on Machine Learning, Bellevue, USA, 2011: 689-696.
    [14] HU Yongtao, REN J S, DAI Jingwen, et al. Deep multimodal speaker naming[C]. ACM International Conference on Multimedia, Brisbane, Australia, 2015: 1107-1110.
    [15] FENG Fangxiang, WANG Xi, LI Ruifan, et al. Correspondence autoencoders for cross-modal retrieval[J]. ACM Transactions on Multimedia Computing Communications & Applications, 2015, 12(1s): 1-22. doi: 10.1145/2808205.
    [16] MOHAMED A, DAHL G E, and HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22. doi: 10.1109/TASL.2011.2109382.
    [17] WANG Kaiye, HE Ran, WANG Liang, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. doi: 10.1109/TPAMI. 2015.2505311.
    [18] CASTREJÓN L, AYTAR Y, VONDRICK C, et al. Learning aligned cross-modal representations from weakly aligned data[C]. IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2940-2949.
    [19] KIM J, NAM J, and GUREVYCH I. Learning semantics with deep belief network for cross-language information retrieval[C]. International Conference on Computational Linguistics, Dublin, Ireland, 2013: 579-588.
    [20] TANG Jun, WANG Ke, and SHAO Ling. Supervised matrix factorization hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2016, 25(7): 3157-3166. doi: 10.1109/TIP.2016.2564638.
  • Cited by

    Periodical cited type(6)

    1. 陈根华,陈伯孝,秦永. 基于分数低阶矩的干涉阵列米波雷达稳健测高方法. 电子与信息学报. 2021(06): 1676-1682 . 本站查看
    2. 王旭光,陈红. α噪声环境下基于余弦代价函数的盲均衡算法. 探测与控制学报. 2019(01): 87-91 .
    3. 周孟琳,陈阳,马正华. 一种适用于稀疏多径信道的自适应均衡算法. 电讯技术. 2019(03): 266-270 .
    4. 王旭光,陈红,褚鼎立. 基于余弦代价函数的双模盲均衡算法. 空军工程大学学报(自然科学版). 2019(04): 78-83 .
    5. 王旭光,陈红,褚鼎立. 基于改进布谷鸟算法的分数低阶盲均衡算法. 探测与控制学报. 2018(05): 111-115 .
    6. 高从芮,许华. 自适应非零延迟MMSE盲均衡算法. 空军工程大学学报(自然科学版). 2018(06): 79-83+96 .

    Other cited types(10)

  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (1281) PDF downloads(49) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return