Facial Expression Recognition Model Based on an Improved YOLO12n

HAN Chuang; HUANG Jingyao; LAN Chaofeng

doi:10.11999/JEIT250936

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

HAN Chuang, HUANG Jingyao, LAN Chaofeng. Facial Expression Recognition Model Based on an Improved YOLO12n[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250936

Citation:

HAN Chuang, HUANG Jingyao, LAN Chaofeng. Facial Expression Recognition Model Based on an Improved YOLO12n[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250936

HAN Chuang, HUANG Jingyao, LAN Chaofeng. Facial Expression Recognition Model Based on an Improved YOLO12n[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250936

Citation:

HAN Chuang, HUANG Jingyao, LAN Chaofeng. Facial Expression Recognition Model Based on an Improved YOLO12n[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250936

PDF( 5130 KB)

Facial Expression Recognition Model Based on an Improved YOLO12n

doi: 10.11999/JEIT250936 cstr: 32379.14.JEIT250936

School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China

Funds: The Program for Young Talents of Basic Research in Universities of Heilongjiang Province (YQJH2024077)

Received Date: 2025-09-19
Accepted Date: 2026-04-08
Rev Recd Date: 2026-04-02

Available Online: 2026-04-28

Abstract

Abstract

Objective Facial Expression Recognition (FER) is a key technology in affective computing and intelligent human–computer interaction. In practical scenarios, recognition performance is often degraded by low resolution, complex illumination, partial occlusion, and class imbalance. Although deep learning-based methods have made substantial progress, lightweight models such as You Only Look Once version 12 nano (YOLO12n) still have limited feature extraction ability and reduced robustness under degraded imaging conditions. To address these limitations, this paper proposes an improved FER model, termed YOLO-FER. The model is designed to enhance feature representation, improve the discrimination of similar expressions, and maintain real-time detection performance in low-quality environments. Methods Based on the YOLO12n model, YOLO-FER introduces several targeted improvements. First, a C3k2_star module is constructed by embedding NewStarBlock into the original bottleneck structure. This design enhances high-dimensional nonlinear feature representation and alleviates feature loss during fusion, as shown in Fig. 2 and Fig. 3. Second, Multidimensional Collaborative Attention (MCA) is integrated with the A2C2f module to form A2C2f_MCA. This module performs joint modeling across the channel, height, and width dimensions to capture fine-grained facial features (Fig. 4). Third, a Low Resolution Feature Extractor (LRFE) module is placed at the end of the backbone. It enhances pixel-level feature representation under low-resolution and low-light conditions through dilated convolution and pixel attention (Fig. 5). Finally, Adaptive Threshold Focal Loss (ATFL) is used to dynamically adjust the contributions of easy and hard samples. This function mitigates class imbalance and improves the discrimination of similar expressions. The overall model structure is shown in Fig. 1. Experiments are conducted on the RAF-DB and Low Light Dataset (LLD) datasets. Precision (P), recall (R), F1 score, and mAP@0.5 are used as evaluation metrics. Results and Discussions Extensive experiments show that YOLO-FER outperforms the baseline YOLO12n and other YOLO-series models. As shown in Table 2, on the RAF-DB dataset, YOLO-FER achieves P=81.8%, R=81.9%, and mAP@0.5=87.6%, with a 3.8% improvement in mAP@0.5 over the baseline. On the LLD dataset (Table 3), YOLO-FER achieves an mAP@0.5 of 95.9%, representing a 5.0% improvement. These results indicate strong robustness under low-light conditions. The ablation studies in Table 2 and Table 3 confirm that each proposed module contributes to performance improvement. C3k2_star, A2C2f_MCA, LRFE, and ATFL all lead to consistent gains in detection accuracy. Their combination achieves the best performance with only a slight increase in parameters. The comparison with other YOLO variants in Table 5 further shows that YOLO-FER achieves a favorable balance between accuracy and model complexity. The mAP@0.5 curves in Fig. 8 show that the proposed model maintains consistent performance gains during training. The confusion matrix analysis in Fig. 9 and Table 4 demonstrates that the MCA module improves the discrimination of similar expressions, such as Angry and Disgust, and reduces misclassification. Grad-CAM visualization results (Fig. 13) indicate that YOLO-FER focuses more accurately on key facial regions, including the eyes, eyebrows, and mouth, than the baseline model. Experiments under degraded conditions (Fig. 14 and Table 13) further show that YOLO-FER maintains higher detection performance than YOLO12n and has a smaller overall performance drop. These findings confirm its robustness in low-quality scenarios. Although the number of parameters increases slightly from 2.5 M to 3.0 M, the inference speed remains competitive (Table 7), indicating that the proposed method retains real-time capability. Conclusions This paper proposes YOLO-FER, an improved FER model based on YOLO12n. The model improves feature extraction and robustness in low-quality image scenarios. By integrating C3k2_star, MCA, LRFE, and ATFL, YOLO-FER improves recognition performance and generalization ability. Experimental results on the RAF-DB and LLD datasets confirm that the model achieves high detection performance while maintaining efficient inference speed. The proposed method provides a practical solution for real-time FER applications in complex environments. Future work will focus on improving performance under extremely low-resolution conditions and exploring cross-domain generalization and micro-expression recognition.
- Facial Expression Recognition,
- YOLO,
- Multidimensional Collaborative Attention,
- Low Resolution Feature Enhancement,
- Adaptive Threshold Focal Loss

FullText(HTML)

References(20)

References

[1]	ADYAPADY R R and ANNAPPA B. A comprehensive review of facial expression recognition techniques[J]. Multimedia Systems, 2023, 29(1): 73–103. doi: 10.1007/S00530-022-00984-w.
[2]	LI Shan and DENG Weihong. Deep facial expression recognition: A survey[J]. IEEE Transactions on Affective Computing, 2022, 13(3): 1195–1215. doi: 10.1109/taffc.2020.2981446.
[3]	张国祥, 孙运卓. 复杂光线环境下局部二值模式的CNN人脸识别方法[J]. 湖北师范大学学报: 自然科学版, 2023, 43(4): 49–55. doi: 10.3969/j.issn.2096-3149.2023.04.007. ZHANG Guoxiang and SUN Yunzhuo. CNN facialrecognition method based on local binary pattern in complex light environment[J]. Journal of Hubei Normal University: Natural Science, 2023, 43(4): 49–55. doi: 10.3969/j.issn.2096-3149.2023.04.007.
[4]	李蕊, 刘鹏宇, 贾克斌. 局部遮挡条件下的人脸表情识别[J]. 计算机应用与软件, 2016, 33(9): 147–150,175. doi: 10.3969/j.issn.1000-386x.2016.09.035. LI Rui, LIU Pengyu, and JIA Kebin. Facial expression recognition under partial occlusion[J]. Computer Applications and Software, 2016, 33(9): 147–150,175. doi: 10.3969/j.issn.1000-386x.2016.09.035.
[5]	李珊, 邓伟洪. 深度人脸表情识别研究进展[J]. 中国图象图形学报, 2020, 25(11): 2306–2320. doi: 10.11834/jig.200233. LI Shan and DENG Weihong. Deep facial expression recognition: A survey[J]. Journal of Image and Graphics, 2020, 25(11): 2306–2320. doi: 10.11834/jig.200233.
[6]	WANG Kai, PENG Xiaojiang, YANG Jianfei, et al. Region attention networks for pose and occlusion robust facial expression recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 4057–4069. doi: 10.1109/TIP.2019.2956143.
[7]	MAO Jiawei, XU Rui, YIN Xuesong, et al. POSTER++: A simpler and stronger facial expression recognition network[J]. Pattern Recognition, 2025, 157: 110951. doi: 10.1016/J.PATCOG.2024.110951.
[8]	赵明华, 董爽爽, 胡静, 等. 注意力引导的三流卷积神经网络用于微表情识别[J]. 中国图象图形学报, 2024, 29(1): 111–122. doi: 10.11834/jig.230053. ZHAO Minghua, DONG Shuangshuang, HU Jing, et al. Attention-guided three-stream convolutional neural network for microexpression recognition[J]. Journal of Image and Graphics, 2024, 29(1): 111–122. doi: 10.11834/jig.230053.
[9]	YANG Qiaohe, HE Yueshun, CHEN Hongmao, et al. A novel lightweight facial expression recognition network based on deep shallow network fusion and attention mechanism[J]. Algorithms, 2025, 18(8): 473. doi: 10.3390/A18080473.
[10]	WEN Zhengyao, LIN Wenzhong, WANG Tao, et al. Distract your attention: Multi-head cross attention network for facial expression recognition[J]. Biomimetics, 2023, 8(2): 199. doi: 10.3390/BIOMIMETICS8020199.
[11]	LAI Zhenyi, CHEN Renhe, JIA Jinlu, et al. Real-time micro-expression recognition based on ResNet and atrous convolutions[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(11): 15215–15226. doi: 10.1007/s12652-020-01779-5.
[12]	薛珮芸, 戴书涛, 白静, 等. 借助语音和面部图像的双模态情感识别[J]. 电子与信息学报, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087. XUE Peiyun, DAI Shutao, BAI Jing, et al. Emotion recognition with speech and facial images[J]. Journal of Electronics & Information Technology, 2024, 46(12): 4542–4552. doi: 10.11999/JEIT240087.
[13]	张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构[J]. 计算机科学, 2022, 49(6A): 370–377. doi: 10.11896/jsjkx.210500023. ZHANG Jiahao, LIU Feng, and QI Jiayin. Lightweight micro-expression recognition architecture based on Bottleneck Transformer[J]. Computer Science, 2022, 49(6A): 370–377. doi: 10.11896/jsjkx.210500023.
[14]	张鹏, 孔韦韦, 滕金保. 基于多尺度特征注意力机制的人脸表情识别[J]. 计算机工程与应用, 2022, 58(1): 182–189. doi: 10.3778/j.issn.1002-8331.2106-0174. ZHANG Peng, KONG Weiwei, and TENG Jinbao. Facial expression recognition based on multi-scale feature attention mechanism[J]. Computer Engineering and Applications, 2022, 58(1): 182–189. doi: 10.3778/j.issn.1002-8331.2106-0174.
[15]	邵延华, 张铎, 楚红雨, 等. 基于深度学习的YOLO目标检测综述[J]. 电子与信息学报, 2022, 44(10): 3697–3708. doi: 10.11999/JEIT210790. SHAO Yanhua, ZHANG Duo, CHU Hongyu, et al. A review of YOLO object detection based on deep learning[J]. Journal of Electronics & Information Technology, 2022, 44(10): 3697–3708. doi: 10.11999/JEIT210790.
[16]	MA Xu, DAI Xiyang, BAI Yue, et al. Rewrite the stars[C]. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 5694–5703. doi: 10.1109/CVPR52733.2024.00544.
[17]	YU Yang, ZHANG Yi, CHENG Zeyu, et al. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition[J]. Engineering Applications of Artificial Intelligence, 2023, 126: 107079. doi: 10.1016/j.engappai.2023.107079.
[18]	YANG Bo, ZHANG Xinyu, ZHANG Jian, et al. EFLNet: Enhancing feature learning network for infrared small target detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5906511. doi: 10.1109/TGRS.2024.3365677.
[19]	LI Shan and DENG Weihong. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition[J]. IEEE Transactions on Image Processing, 2019, 28(1): 356–370. doi: 10.1109/TIP.2018.2868382.
[20]	Emotiscore. Low light dataset computer vision model[EB/OL]. https://universe.roboflow.com/emotiscore/low-light-dataset, 2024.