高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

借助语音和面部图像的双模态情感识别

薛珮芸 戴书涛 白静 高翔

薛珮芸, 戴书涛, 白静, 高翔. 借助语音和面部图像的双模态情感识别[J]. 电子与信息学报. doi: 10.11999/JEIT240087
引用本文: 薛珮芸, 戴书涛, 白静, 高翔. 借助语音和面部图像的双模态情感识别[J]. 电子与信息学报. doi: 10.11999/JEIT240087
XUE Peiyun, DAI Shutao, BAI Jing, GAO Xiang. Emotion Recognition with Speech and Facial Images[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240087
Citation: XUE Peiyun, DAI Shutao, BAI Jing, GAO Xiang. Emotion Recognition with Speech and Facial Images[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT240087

借助语音和面部图像的双模态情感识别

doi: 10.11999/JEIT240087
基金项目: 山西省青年基金 (20210302124544),山西省应用基础研究计划(201901D111094)
详细信息
    作者简介:

    薛珮芸:女,讲师,研究方向为信号处理

    戴书涛:男,硕士生,研究方向为信号处理

    白静:女,教授,研究方向为信号处理

    高翔:男,硕士生,研究方向为信号处理

    通讯作者:

    薛珮芸 xuepeiyun@tyut.edu.cn

  • 中图分类号: TN911.7; TP391.41

Emotion Recognition with Speech and Facial Images

Funds: Shanxi Youth Fundation (20210302124544), Applied Basic Research Program of Shanxi Province (201901D111094)
  • 摘要: 为提升情感识别模型的准确率,解决情感特征提取不充分的问题,对语音和面部图像的双模态情感识别进行研究。语音模态提出一种结合通道-空间注意力机制的多分支卷积神经网络(Multi-branch Convolutional Neural Networks, MCNN)特征提取模型,在时间、空间和局部特征维度对语音频谱图提取情感特征;面部图像模态提出一种残差混合卷积神经网络(Residual Hybrid Convolutional Neural Network, RHCNN)的特征提取模型,进一步建立并行注意力机制关注全局情感特征,提高识别准确率;将提取到的语音和面部图像特征分别通过分类层进行分类识别,并使用决策融合对识别结果进行最终的融合分类。实验结果表明,所提双模态融合模型在RAVDESS, eNTERFACE’05, RML三个数据集上的识别准确率分别达到了97.22%, 94.78%和96.96%,比语音单模态的识别准确率分别提升了11.02%, 4.24%, 8.83%,比面部图像单模态的识别准确率分别提升了4.60%, 6.74%, 4.10%,且与近年来对应数据集上的相关方法相比均有所提升。说明了所提的双模态融合模型能有效聚焦情感信息,从而提升情感识别的准确率。
  • 图  1  双模态情感识别模型总体结构

    图  2  对数梅尔频谱图提取流程

    图  3  MCNN结构

    图  4  RHCNN网络结构

    图  5  并行注意力机制

    图  6  决策融合算法

    图  7  语音模型混淆矩阵

    图  8  面部图像模型混淆矩阵

    图  9  双模态混淆矩阵

    图  10  不同模块对模型性能的有效性验证

    图  11  不同数据量对模型性能的影响

    表  1  语音模型各类情感识别结果(%)

    类别 RAVDESS RML eNTERFACE’05
    P R S F1 P R S F1 P R S F1
    中性 83.22 87.94 98.84 85.52
    平静 86.94 88.35 97.94 87.64
    快乐 79.86 97.46 97.20 87.79 78.12 92.59 95.62 84.75 95.66 95.94 99.12 95.80
    悲伤 80.46 78.64 97.04 79.56 91.95 78.82 98.47 84.88 94.59 90.27 99.01 92.38
    愤怒 87.03 84.62 97.93 85.80 96.09 94.51 99.25 95.29 84.44 87.99 96.85 86.18
    恐惧 85.40 91.06 97.65 88.14 83.43 82.51 96.80 82.97 85.15 88.68 96.90 87.36
    厌恶 90.66 83.61 98.41 87.00 86.50 87.37 97.07 86.93 91.83 93.68 98.30 92.75
    惊讶 94.31 82.30 99.19 87.89 93.30 94.27 98.60 93.78 92.15 85.67 98.47 88.79
    下载: 导出CSV

    表  2  语音模态已有方法与本文模型在准确率上的对比(%)

    数据集 CNN-Transformer[18] 2D Feature
    Extration+
    SVM[19]
    CNN-BiLSTM[20] CNN-X[21] 3D-CNN[22] DCNN-GWO[23] Auto-encoder+
    SVM[24]
    Multi-label model[25] 本文
    方法
    RAVDESS 82.0 81.94 84.72 82.99 86.20
    RML 74.07 84.72 88.13
    eNTERFACE’05 88.47 90.40 90.54
    下载: 导出CSV

    表  3  面部图像模型各类情感识别结果(%)

    类别 RAVDESS RML eNTERFACE’05
    P R S F1 P R S F1 P R S F1
    中性 91.95 91.33 99.44 91.64
    平静 95.22 93.44 99.24 94.32
    快乐 94.79 92.54 99.25 93.65 96.84 97.35 99.36 97.10 89.31 88.29 97.82 88.79
    悲伤 94.04 92.21 99.10 93.11 93.22 88.24 98.71 90.66 86.62 84.21 97.57 85.40
    愤怒 92.41 93.59 98.80 92.99 91.67 95.38 98.42 93.48 87.61 89.68 97.48 88.63
    恐惧 87.27 92.13 97.95 89.63 91.35 87.56 98.27 89.42 87.04 87.78 97.29 87.41
    沮丧 94.58 96.62 99.09 95.59 93.75 94.74 98.71 94.24 90.42 89.68 97.99 90.81
    惊讶 90.39 87.89 98.66 89.12 90.31 94.15 97.96 92.19 87.01 86.75 97.49 86.88
    下载: 导出CSV

    表  4  面部图像模态经典网络与本文结果对比(%)

    数据集 方法 A R S F1
    RAVDESS ResNet-34[26] 88.37 88.64 98.34 88.19
    ShuffleNetV2[27] 82.03 82.06 97.43 81.60
    MobileNetV2[28] 84.03 84.46 97.71 83.86
    本文方法 92.62 92.47 98.94 92.51
    RML ResNet-34[26] 89.29 89.41 97.86 89.27
    ShuffleNetV2[27] 79.55 79.82 95.95 79.24
    MobileNetV2[28] 86.07 86.33 97.22 86.05
    本文方法 92.86 92.90 98.57 92.85
    eNTERFACE’05 ResNet-34[26] 83.30 83.57 96.67 83.28
    ShuffleNetV2[27] 82.13 82.17 96.43 82.04
    MobileNetV2[28] 83.50 83.68 96.70 83.49
    本文方法 88.04 87.98 97.61 87.99
    下载: 导出CSV

    表  5  双模态模型各类情感识别结果(%)

    类别 RAVDESS RML eENTERFACE’05
    P R S F1 P R S F1 P R S F1
    中性 98.66 96.71 99.91 97.67
    平静 100.00 98.13 100.00 99.05
    快乐 97.57 96.56 99.65 97.06 96.95 98.96 99.37 97.95 96.53 96.53 99.00 96.53
    悲伤 96.36 97.98 99.45 97.16 97.22 95.63 99.48 96.42 95.54 95.24 98.59 95.39
    愤怒 98.10 95.38 99.70 96.72 98.91 99.45 99.79 99.18 95.18 94.83 98.41 94.96
    恐惧 93.48 96.47 98.95 94.95 95.24 93.75 99.06 94.49 92.44 94.56 99.30 93.48
    沮丧 96.99 98.47 99.49 97.72 98.02 95.59 99.58 97.30 93.24 93.77 99.19 93.50
    惊讶 97.51 97.86 99.65 97.68 95.50 97.45 99.06 96.46 96.07 93.81 99.24 94.93
    下载: 导出CSV

    表  6  双模态已有方法与本文结果对比(%)

    数据集 文献[29]方法 文献[30]方法 文献[31]方法 文献[32]方法 文献[33]方法 文献[34]方法 文献[35]方法 文献[36]方法 文献[37]方法 本文
    方法
    RAVDESS 86.0 86.7 87.5 87.89 82.99- 93.23 97.22
    RML 82.47 96.79 96.96
    eNTERFACE’05 72.27 91.62 88.11 87.2 94.78
    下载: 导出CSV
  • [1] KUMARAN U, RADHA RAMMOHAN S, NAGARAJAN S M, et al. Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN[J]. International Journal of Speech Technology, 2021, 24(2): 303–314. doi: 10.1007/s10772-020-09792-x.
    [2] 韩虎, 范雅婷, 徐学锋. 面向方面情感分析的多通道增强图卷积网络[J]. 电子与信息学报, 2024, 46(3): 1022–1032. doi: 10.11999/JEIT230353.

    HAN Hu, FAN Yating, and XU Xuefeng. Multi-channel enhanced graph convolutional network for aspect-based sentiment analysis[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1022–1032. doi: 10.11999/JEIT230353.
    [3] CORNEJO J and PEDRINI H. Bimodal emotion recognition based on audio and facial parts using deep convolutional neural networks[C]. Proceedings of the 18th IEEE International Conference On Machine Learning And Applications, Boca Raton, USA, 2019: 111–117. doi: 10.1109/ICMLA.2019.00026.
    [4] O’TOOLE A J, CASTILLO C D, PARDE C J, et al. Face space representations in deep convolutional neural networks[J]. Trends in Cognitive Sciences, 2018, 22(9): 794–809. doi: 10.1016/j.tics.2018.06.006.
    [5] CHEN Qiupu and HUANG Guimin. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition[J]. Engineering Applications of Artificial Intelligence, 2021, 102: 104277. doi: 10.1016/J.ENGAPPAI.2021.104277.
    [6] PAN Bei, HIROTA K, JIA Zhiyang, et al. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods[J]. Neurocomputing, 2023, 561: 126866. doi: 10.1016/j.neucom.2023.126866.
    [7] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278–2324. doi: 10.1109/5.726791.
    [8] HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
    [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [10] KIM B K, LEE H, ROH J, et al. Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition[C]. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle Washington, USA, 2015: 427–434. doi: 10.1145/2818346.2830590.
    [11] TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, et al. End-to-end multimodal emotion recognition using deep neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301–1309. doi: 10.1109/JSTSP.2017.2764438.
    [12] SAHOO S and ROUTRAY A. Emotion recognition from audio-visual data using rule based decision level fusion[C]. Proceedings of 2016 IEEE Students’ Technology Symposium, Kharagpur, India, 2016: 7–12. doi: 10.1109/TechSym.2016.7872646.
    [13] WANG Qilong, WU Banggu, ZHU Pengfei, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks[C]. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 11534–11542. doi: 10.1109/CVPR42600.2020.01155.
    [14] CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1251–1258. doi: 10.1109/CVPR.2017.195.
    [15] LIVINGSTONE S R and RUSSO F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English[J]. PLoS One, 2018, 13(5): e0196391. doi: 10.1371/journal.pone.0196391.
    [16] WANG Yongjin and GUAN Ling. Recognizing human emotional state from audiovisual signals[J]. IEEE Transactions on Multimedia, 2008, 10(5): 936–946. doi: 10.1109/TMM.2008.927665.
    [17] MARTIN O, KOTSIA I, MACQ B, et al. The eNTERFACE’ 05 audio-visual emotion database[C]. Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, USA, 2006: 8. doi: 10.1109/ICDEW.2006.145.
    [18] VIJAYAN D M, ARUN A V, GANESHNATH R, et al. Development and analysis of convolutional neural network based accurate speech emotion recognition models[C]. Proceedings of the 19th India Council International Conference, Kochi, India, 2022: 1–6. doi: 10.1109/INDICON56171.2022.10040174.
    [19] AGGARWAL A, SRIVASTAVA A, AGARWAL A, et al. Two-way feature extraction for speech emotion recognition using deep learning[J]. Sensors, 2022, 22(6): 2378. doi: 10.3390/s22062378.
    [20] ZHANG Limin, LI Yang, ZHANG Yueting, et al. A deep learning method using gender-specific features for emotion recognition[J]. Sensors, 2023, 23(3): 1355. doi: 10.3390/s23031355.
    [21] KANANI C S, GILL K S, BEHERA S, et al. Shallow over deep neural networks: A empirical analysis for human emotion classification using audio data[M]. MISRA R, KESSWANI N, RAJARAJAN M, et al. Internet of Things and Connected Technologies. Cham: Springer, 2021: 134–146. doi: 10.1007/978-3-030-76736-5_13.
    [22] FALAHZADEH M R, FARSA E Z, HARIMI A, et al. 3D convolutional neural network for speech emotion recognition with its realization on Intel CPU and NVIDIA GPU[J]. IEEE Access, 2022, 10: 112460–112471. doi: 10.1109/ACCESS.2022.3217226.
    [23] FALAHZADEH M R, FAROKHI F, HARIMI A, et al. Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition[J]. Circuits, Systems, and Signal Processing, 2023, 42(1): 449–492. doi: 10.1007/s00034-022-02130-3.
    [24] HARÁR P, BURGET R, and DUTTA M K. Speech emotion recognition with deep learning[C]. Proceedings of the 4th International Conference on Signal Processing and Integrated Networks, Noida, India, 2017: 137–140. doi: 10.1109/SPIN.2017.8049931.
    [25] SLIMI A, HAFAR N, ZRIGUI M, et al. Multiple models fusion for multi-label classification in speech emotion recognition systems[J]. Procedia Computer Science, 2022, 207: 2875–2882. doi: 10.1016/j.procs.2022.09.345.
    [26] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
    [27] MA Ningning, ZHANG Xiangyu, ZHENG Haitao, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]. Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 2018: 116–131. doi: 10.1007/978-3-030-01264-9_8.
    [28] SANDLER M, HOWARD A, ZHU Menglong, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 4510–4520. doi: 10.1109/CVPR.2018.00474.
    [29] MIDDYA A I, NAG B, and ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities[J]. Knowledge-Based Systems, 2022, 244: 108580. doi: 10.1016/j.knosys.2022.108580.
    [30] LUNA-JIMÉNEZ C, KLEINLEIN R, GRIOL D, et al. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset[J]. Applied Sciences, 2021, 12(1): 327. doi: 10.3390/app12010327.
    [31] BOUALI Y L, AHMED O B, and MAZOUZI S. Cross-modal learning for audio-visual emotion recognition in acted speech[C]. Proceedings of the 6th International Conference on Advanced Technologies for Signal and Image Processing, Sfax, Tunisia, 2022: 1–6. doi: 10.1109/ATSIP55956.2022.9805959.
    [32] MOCANU B and TAPU R. Audio-video fusion with double attention for multimodal emotion recognition[C]. Proceedings of the 14th Image, Video, and Multidimensional Signal Processing Workshop, Nafplio, Greece, 2022: 1–5. doi: 10.1109/IVMSP54334.2022.9816349.
    [33] WOZNIAK M, SAKOWICZ M, LEDWOSINSKI K, et al. Bimodal emotion recognition based on vocal and facial features[J]. Procedia Computer Science, 2023, 225: 2556–2566. doi: 10.1016/j.procs.2023.10.247.
    [34] PAN Bei, HIROTA K, JIA Zhiyang, et al. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(3): 1903–1917. doi: 10.1007/s12652-021-03407-2.
    [35] TANG Guichen, XIE Yue, LI Ke, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion[J]. Multimedia Tools and Applications, 2023, 82(11): 16359–16373. doi: 10.1007/s11042-022-14185-0.
    [36] CHEN Luefeng, WANG Kuanlin, LI Min, et al. K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human-robot interaction[J]. IEEE Transactions on Industrial Electronics, 2023, 70(1): 1016–1024. doi: 10.1109/TIE.2022.3150097.
    [37] CHEN Guanghui and ZENG Xiaoping. Multi-modal emotion recognition by fusing correlation features of speech-visual[J]. IEEE Signal Processing Letters, 2021, 28: 533–537. doi: 10.1109/LSP.2021.3055755.
  • 加载中
图(11) / 表(6)
计量
  • 文章访问数:  156
  • HTML全文浏览量:  48
  • PDF下载量:  31
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-02-05
  • 修回日期:  2024-11-06
  • 网络出版日期:  2024-11-08

目录

    /

    返回文章
    返回