Bimodal Emotion Recognition Method Based on Dual-stream Attention and Adversarial Mutual Reconstruction

LIU Jia; ZHANG Yangrui; CHEN Dapeng; MAO Die; LU Guorui

doi:10.11999/JEIT250424

Volume 48 Issue 1

Jan. 2026

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 > 48(1): 277-286

LIU Jia, ZHANG Yangrui, CHEN Dapeng, MAO Die, LU Guorui. Bimodal Emotion Recognition Method Based on Dual-stream Attention and Adversarial Mutual Reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(1): 277-286. doi: 10.11999/JEIT250424

Citation:

LIU Jia, ZHANG Yangrui, CHEN Dapeng, MAO Die, LU Guorui. Bimodal Emotion Recognition Method Based on Dual-stream Attention and Adversarial Mutual Reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(1): 277-286. doi: 10.11999/JEIT250424

Citation:

PDF( 2800 KB)

Bimodal Emotion Recognition Method Based on Dual-stream Attention and Adversarial Mutual Reconstruction

doi: 10.11999/JEIT250424 cstr: 32379.14.JEIT250424

LIU Jia^{1, 2, 3, 4},
ZHANG Yangrui^{1, 2},
CHEN Dapeng^{1, 2, 3, 4
,
,},
MAO Die^{1, 2},
LU Guorui^{1, 2}

1.
School of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
2.
Jiangsu Province Engineering Research Center of Intelligent Meteorological Exploration Robot, Nanjing 210044, China
3.
Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing 210044, China
4.
Jiangsu Key Laboratory of Big Data Analysis and Intelligent Systems, Nanjing 210044, China

Funds: The National Natural Science Foundation of China (62473200, 62003169), Jiangsu Province Youth Science and Technology Talent Support Project (JSTJ-2024-195), The “Qinglan Project” of Jiangsu Province

Received Date: 2025-05-15
Accepted Date: 2025-12-12
Rev Recd Date: 2025-12-12

Available Online: 2025-12-19

Publish Date: 2026-01-30

Abstract

Abstract

Objective This paper proposes a bimodal emotion recognition method that integrates ElectroEncephaloGraphy (EEG) and speech signals to address noise sensitivity and inter-subject variability that limit single-modality emotion recognition systems. Although substantial progress has been achieved in emotion recognition research, cross-subject recognition accuracy remains limited, and performance is strongly affected by noise. For EEG signals, physiological differences among subjects lead to large variations in emotion classification performance. Speech signals are likewise sensitive to environmental noise and data loss. This study aims to develop a dual-modality recognition framework that combines EEG and speech signals to improve robustness, stability, and generalization performance. Methods The proposed method utilizes two independent feature extractors for EEG and speech signals. For EEG, a dual feature extractor integrating time-frame-channel joint attention and state-space modeling is designed to capture salient temporal and spectral features. For speech, a Bidirectional Long Short-Term Memory (Bi-LSTM) network with a frame-level random masking strategy is adopted to improve robustness to missing or noisy speech segments. A modality refinement fusion module is constructed using gradient reversal and orthogonal projection to enhance feature alignment and discriminability. In addition, an adversarial mutual reconstruction mechanism is applied to enforce consistent emotion feature reconstruction across subjects within a shared latent space. Results and Discussions The proposed method is evaluated on multiple benchmark datasets, including MAHNOB-HCI, EAV, and SEED. Under cross-subject validation on the MAHNOB-HCI dataset, the model achieves accuracies of 81.09% for valence and 80.11% for arousal, outperforming several existing approaches. In five-fold cross-validation, accuracies increase to 98.14% for valence and 98.37% for arousal, demonstrating strong generalization and stability. On the EAV dataset, the proposed model attains an accuracy of 73.29%, which exceeds the 60.85% achieved by conventional Convolutional Neural Network (CNN)-based methods. In single-modality experiments on the SEED dataset, an accuracy of 89.33% is obtained, confirming the effectiveness of the dual-stream attention mechanism and adversarial mutual reconstruction for improving cross-subject generalization. Conclusions The proposed dual-stream attention and adversarial mutual reconstruction framework effectively addresses challenges in cross-subject emotion recognition and multimodal fusion for affective computing. The method demonstrates strong robustness to individual differences and noise, supporting its applicability in real-world human–computer interaction systems.
- Multimodal emotion recognition,
- Attention mechanism,
- Adversarial mutual reconstruction,
- ElectroEncephaloGraphy (EEG),
- Speech features

FullText(HTML)

References(41)

References

[1]	LI Wei, HUAN Wei, HOU Bowen, et al. Can emotion be transferred?—A review on transfer learning for EEG-based emotion recognition[J]. IEEE Transactions on Cognitive and Developmental Systems, 2022, 14(3): 833–846. doi: 10.1109/TCDS.2021.3098842.
[2]	LI Wei, FANG Cheng, ZHU Zhihao, et al. Fractal spiking neural network scheme for EEG-based emotion recognition[J]. IEEE Journal of Translational Engineering in Health and Medicine, 2024, 12: 106–118. doi: 10.1109/JTEHM.2023.3320132.
[3]	HAMADA M, ZAIDAN B B, and ZAIDAN A A. A systematic review for human EEG brain signals based emotion classification, feature extraction, brain condition, group comparison[J]. Journal of Medical Systems, 2018, 42(9): 162. doi: 10.1007/s10916-018-1020-8.
[4]	姚鸿勋, 邓伟洪, 刘洪海, 等. 情感计算与理解研究发展概述[J]. 中国图象图形学报, 2022, 27(6): 2008–2035. doi: 10.11834/jig.220085. YAO Hongxun, DENG Weihong, LIU Honghai, et al. An overview of research development of affective computing and understanding[J]. Journal of Image and Graphics, 2022, 27(6): 2008–2035. doi: 10.11834/jig.220085.
[5]	MA Jiaxin, TANG Hao, ZHENG Weilong, et al. Emotion recognition using multimodal residual LSTM network[C]. The 27th ACM International Conference on Multimedia, Nice, France, 2019: 176–183. doi: 10.1145/3343031.3350871.
[6]	LI Mu and LU Baoliang. Emotion classification based on gamma-band EEG[C]. 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, USA, 2009: 1223–1226. doi: 10.1109/IEMBS.2009.5334139.
[7]	FRANTZIDIS C A, BRATSAS C, PAPADELIS C L, et al. Toward emotion aware computing: An integrated approach using multichannel neurophysiological recordings and affective visual stimuli[J]. IEEE Transactions on Information Technology in Biomedicine, 2010, 14(3): 589–597. doi: 10.1109/TITB.2010.2041553.
[8]	LAWHERN V J, SOLON A J, WAYTOWICH N R, et al. EEGNet: A compact convolutional neural network for EEG-based brain–computer interfaces[J]. Journal of Neural Engineering, 2018, 15(5): 056013. doi: 10.1088/1741-2552/aace8c.
[9]	YANG Yilong, WU Qingfeng, FU Yazhen, et al. Continuous convolutional neural network with 3D input for EEG-based emotion recognition[C]. The 25th International Conference on Neural Information Processing, Siem Reap, Cambodia, 2018: 433–443. doi: 10.1007/978-3-030-04239-4_39.
[10]	DU Xiaobing, MA Cuixia, ZHANG Guanhua, et al. An efficient LSTM network for emotion recognition from multichannel EEG signals[J]. IEEE Transactions on Affective Computing, 2022, 13(3): 1528–1540. doi: 10.1109/TAFFC.2020.3013711.
[11]	SHEN Jian, LI Kunlin, LIANG Huajian, et al. HEMAsNet: A hemisphere asymmetry network inspired by the brain for depression recognition from electroencephalogram signals[J]. IEEE Journal of Biomedical and Health Informatics, 2024, 28(9): 5247–5259. doi: 10.1109/JBHI.2024.3404664.
[12]	孙强, 陈远. 多层次时空特征自适应集成与特有-共享特征融合的双模态情感识别[J]. 电子与信息学报, 2024, 46(2): 574–587. doi: 10.11999/JEIT231110. SUN Qiang and CHEN Yuan. Bimodal emotion recognition with adaptive integration of multi-level spatial-temporal features and specific-shared feature fusion[J]. Journal of Electronics & Information Technology, 2024, 46(2): 574–587. doi: 10.11999/JEIT231110.
[13]	LI Chao, BIAN Ning, ZHAO Ziping, et al. Multi-view domain-adaptive representation learning for EEG-based emotion recognition[J]. Information Fusion, 2024, 104: 102156. doi: 10.1016/j.inffus.2023.102156.
[14]	CHUANG Zejing and WU C H. Multi-modal emotion recognition from speech and text[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2004, 9(2): 45–62.
[15]	ZHENG Wenbo, YAN Lan, and WANG Feiyue. Two birds with one stone: Knowledge-embedded temporal convolutional transformer for depression detection and emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 2595–2613. doi: 10.1109/TAFFC.2023.3282704.
[16]	NING Zhaolong, HU Hao, YI Ling, et al. A depression detection auxiliary decision system based on multi-modal feature-level fusion of EEG and speech[J]. IEEE Transactions on Consumer Electronics, 2024, 70(1): 3392–3402. doi: 10.1109/TCE.2024.3370310.
[17]	杨杨, 詹德川, 姜远, 等. 可靠多模态学习综述[J]. 软件学报, 2021, 32(4): 1067–1081. doi: 10.13328/j.cnki.jos.006167. YANG Yang, ZHAN Dechuan, JIANG Yuan, et al. Reliable multi-modal learning: A survey[J]. Journal of Software, 2021, 32(4): 1067–1081. doi: 10.13328/j.cnki.jos.006167.
[18]	JENKE R, PEER A, and BUSS M. Feature extraction and selection for emotion recognition from EEG[J]. IEEE Transactions on Affective Computing, 2014, 5(3): 327–339. doi: 10.1109/TAFFC.2014.2339834.
[19]	HOU Fazheng, GAO Qiang, SONG Yu, et al. Deep feature pyramid network for EEG emotion recognition[J]. Measurement, 2022, 201: 111724. doi: 10.1016/j.measurement.2022.111724.
[20]	ZHANG Jianhua, YIN Zhong, CHEN Peng, et al. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review[J]. Information Fusion, 2020, 59: 103–126. doi: 10.1016/j.inffus.2020.01.011.
[21]	GU A and DAO T. M: Linear-time Sequence Modeling with Selective State Spaces[J]. 2023, arXiv preprint arXiv:2312.00752.
[22]	薛珮芸, 戴书涛, 白静, 等. 借助语音和面部图像的双模态情感识别[J]. 电子与信息学报, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087. XUE Peiyun, DAI Shutao, BAI Jing, et al. Emotion recognition with speech and facial images[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087.
[23]	HUANG Poyao, XU Hu, LI Juncheng, et al. Masked autoencoders that listen[C]. The 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 2081.
[24]	李幼军, 黄佳进, 王海渊, 等. 基于SAE和LSTM RNN的多模态生理信号融合和情感识别研究[J]. 通信学报, 2017, 38(12): 109–120. doi: 10.11959/j.issn.1000-436x.2017294. LI Youjun, HUANG Jiajin, WANG Haiyuan, et al. Study of emotion recognition based on fusion multi-modal bio-signal with SAE and LSTM recurrent neural network[J]. Journal on Communications, 2017, 38(12): 109–120. doi: 10.11959/j.issn.1000-436x.2017294.
[25]	WANG Yiming, ZHANG Bin, and TANG Yujiao. DMMR: Cross-subject domain generalization for EEG-based emotion recognition via denoising mixed mutual reconstruction[C] The 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024: 628–636. doi: 10.1609/aaai.v38i1.27819.
[26]	SOLEYMANI M, LICHTENAUER J, PUN T, et al. A multimodal database for affect recognition and implicit tagging[J]. IEEE Transactions on Affective Computing, 2012, 3(1): 42–55. doi: 10.1109/T-AFFC.2011.25.
[27]	LEE M H, SHOMANOV A, BEGIM B, et al. EAV: EEG-audio-video dataset for emotion recognition in conversational contexts[J]. Scientific Data, 2024, 11(1): 1026. doi: 10.1038/s41597-024-03838-4.
[28]	ZHENG Weilong and LU Baoliang. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks[J]. IEEE Transactions on Autonomous Mental Development, 2015, 7(3): 162–175. doi: 10.1109/TAMD.2015.2431497.
[29]	SUYKENS J A K and VANDEWALLE J. Least squares support vector machine classifiers[J]. Neural Processing Letters, 1999, 9(3): 293–300. doi: 10.1023/A:1018628609742.
[30]	COVER T and HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21–27. doi: 10.1109/TIT.1967.1053964.
[31]	HUANG Yongrui, YANG Jianhao, LIU Siyu, et al. Combining facial expressions and electroencephalography to enhance emotion recognition[J]. Future Internet, 2019, 11(5): 105. doi: 10.3390/fi11050105.
[32]	LI Ruixin, LIANG Yan, LIU Xiaojian, et al. MindLink-eumpy: An open-source python toolbox for multimodal emotion recognition[J]. Frontiers in Human Neuroscience, 2021, 15: 621493. doi: 10.3389/fnhum.2021.621493.
[33]	ZHANG Yuhao, HOSSAIN Z, and RAHMAN S. DeepVANet: A deep end-to-end network for multi-modal emotion recognition[C]. The 18th International Conference on Human-Computer Interaction, Bari, Italy, 2021: 227–237. doi: 10.1007/978-3-030-85613-7_16.
[34]	SALAMA E S, EL-KHORIBI R A, SHOMAN M E, et al. A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition[J]. Egyptian Informatics Journal, 2021, 22(2): 167–176. doi: 10.1016/j.eij.2020.07.005.
[35]	CHEN Jingxia, LIU Yang, XUE Wen, et al. Multimodal EEG emotion recognition based on the attention recurrent graph convolutional network[J]. Information, 2022, 13(11): 550. doi: 10.3390/info13110550.
[36]	WANG Shuai, QU Jingzi, ZHANG Yong, et al. Multimodal emotion recognition from EEG signals and facial expressions[J]. IEEE Access, 2023, 11: 33061–33068. doi: 10.1109/ACCESS.2023.3263670.
[37]	YIN Kang, SHIN H B, LI Dan, et al. EEG-based multimodal representation learning for emotion recognition[C]. 2025 13th International Conference on Brain-Computer Interface, Gangwon, Republic of Korea, 2025: 1–4. doi: 10.1109/BCI65088.2025.10931743.
[38]	SONG Tengfei, ZHENG Wenming, SONG Peng, et al. EEG emotion recognition using dynamical graph convolutional neural networks[J]. IEEE Transactions on Affective Computing, 2020, 11(3): 532–541. doi: 10.1109/TAFFC.2018.2817622.
[39]	LI Yang, CHEN Ji, LI Fu, et al. GMSS: Graph-based multi-task self-supervised learning for EEG emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2512–2525. doi: 10.1109/TAFFC.2022.3170428.
[40]	MA Boqun, LI He, ZHENG Weilong, et al. Reducing the subject variability of EEG signals with adversarial domain generalization[C]. The 26th International Conference on Neural Information Processing, Sydney, Australia, 2019: 30–42. doi: 10.1007/978-3-030-36708-4_3.
[41]	ZHAO Liming, YAN Xu, and LU Baoliang. Plug-and-play domain adaptation for cross-subject EEG-based emotion recognition[C/OL] The 35th AAAI Conference on Artificial Intelligence, Beijing, China, 2021: 863–870. doi: 10.1609/aaai.v35i1.16169.