Depression Screening Method Driven by Global-Local Feature Fusion

ZHANG Siyong; QIU Jiefan; ZHAO Xiangyun; XIAO Kejiang; CHEN Xiaofu; MAO Keji

doi:10.11999/JEIT250035

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

ZHANG Siyong, QIU Jiefan, ZHAO Xiangyun, XIAO Kejiang, CHEN Xiaofu, MAO Keji. Depression Screening Method Driven by Global-Local Feature Fusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250035

Citation:

ZHANG Siyong, QIU Jiefan, ZHAO Xiangyun, XIAO Kejiang, CHEN Xiaofu, MAO Keji. Depression Screening Method Driven by Global-Local Feature Fusion[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250035

Citation:

PDF( 3176 KB)

Depression Screening Method Driven by Global-Local Feature Fusion

doi: 10.11999/JEIT250035 cstr: 32379.14.JEIT250035

1.
Department of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
2.
Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China
3.
ZJUT-Aishiguang New-quality Meal Nutrition Research Institute, Hangzhou 310023, China

Funds: The National Natural Science Foundation of China (62173158, 62377023), ZJUT School-enterprise Joint Project (KYY-HX-20240328)

Received Date: 2025-01-16
Rev Recd Date: 2025-07-14

Available Online: 2025-07-22

Abstract

Abstract

Objective Depression is a globally prevalent mental disorder that poses a serious threat to the physical and mental health of millions of individuals. Early screening and diagnosis are essential to reducing severe consequences such as self-harm and suicide. However, conventional questionnaire-based screening methods are limited by their dependence on the reliability of respondents’ answers, their difficulty in balancing efficiency with accuracy, and the uneven distribution of medical resources. New auxiliary screening approaches are therefore needed. Existing Artificial Intelligence (AI) methods for depression detection based on facial features primarily emphasize global expressions and often overlook subtle local cues such as eye features. Their performance also declines in scenarios where partial facial information is obscured, for instance by masks, and they raise privacy concerns. This study proposes a Global-Local Fusion Axial Network (GLFAN) for depression screening. By jointly extracting global facial and local eye features, this approach enhances screening accuracy and robustness under complex conditions. A corresponding dataset is constructed, and experimental evaluations are conducted to validate the method’s effectiveness. The model is deployed on edge devices to improve privacy protection while maintaining screening efficiency, offering a more objective, accurate, efficient, and secure depression screening solution that contributes to mitigating global mental health challenges. Methods To address the challenges of accuracy and efficiency in depression screening, this study proposes GLFAN. For long-duration consultation videos with partial occlusions such as masks, data preprocessing is performed using OpenFace 2.0 and facial keypoint algorithms, combined with peak detection, clustering, and centroid search strategies to segment the videos into short sequences capturing dynamic facial changes, thereby enhancing data validity. At the model level, GLFAN adopts a dual-branch parallel architecture to extract global facial and local eye features simultaneously. The global branch uses MTCNN for facial keypoint detection and enhances feature extraction under occlusion using an inverted bottleneck structure. The local branch detects eye regions via YOLO v7 and extracts eye movement features using a ResNet-18 network integrated with a convolutional attention module. Following dual-branch feature fusion, an integrated convolutional module optimizes the representation, and classification is performed using an axial attention network. Results and Discussions The performance of GLFAN is evaluated through comprehensive, multi-dimensional experiments. On the self-constructed depression dataset, high accuracy is achieved in binary classification tasks, and non-depression and severe depression categories are accurately distinguished in four-class classification. Under mask-occluded conditions, a precision of 0.72 and a precision of 0.690 are obtained for depression detection. Although these values are lower than the precision of 0.87 and precision of 0.840 observed under non-occluded conditions, reliable screening performance is maintained. Compared with other advanced methods, GLFAN achieves higher recall and F1 scores. On the public AVEC2013 and AVEC2014 datasets, the model achieves lower Mean Absolute Error (MAE) values and shows advantages in both short- and long-sequence video processing. Heatmap visualizations indicate that GLFAN dynamically adjusts its attention according to the degree of facial occlusion, demonstrating stronger adaptability than ResNet-50. Edge device tests further confirm that the average processing delay remains below 17.56 milliseconds per frame, and stable performance is maintained under low-bandwidth conditionsThe performance of GLFAN is evaluated through comprehensive, multi-dimensional experiments. On the self-constructed depression dataset, high accuracy is achieved in binary classification tasks, and non-depression and severe depression categories are accurately distinguished in four-class classification. Under mask-occluded conditions, a precision of 0.72 and a recall of 0.690 are obtained for depression detection. Although these values are lower than the precision of 0.87 and recall of 0.840 observed under non-occluded conditions, reliable screening performance is maintained. Compared with other advanced methods, GLFAN achieves higher recall and F1 scores. On the public AVEC2013 and AVEC2014 datasets, the model achieves lower Mean Absolute Error (MAE) values and shows advantages in both short- and long-sequence video processing. Heatmap visualizations indicate that GLFAN dynamically adjusts its attention according to the degree of facial occlusion, demonstrating stronger adaptability than ResNet-50. Edge device tests further confirm that the average processing delay remains below 17.56 frame/s, and stable performance is maintained under low-bandwidth conditions. Conclusions This study proposes a depression screening approach based on edge vision technology. A lightweight, end-to-end GLFAN is developed to address the limitations of existing screening methods. The model integrates global facial features extracted via MTCNN with local eye-region features captured by YOLO v7, followed by effective feature fusion and classification using an Axial Transformer module. By emphasizing local eye-region information, GLFAN enhances performance in occluded scenarios such as mask-wearing. Experimental validation using both self-constructed and public datasets demonstrates that GLFAN reduces missed detections and improves adaptability to short-duration video inputs compared with existing models. Grad-CAM visualizations further reveal that GLFAN prioritizes eye-region features under occluded conditions and shifts focus to global facial features when full facial information is available, confirming its context-specific adaptability. The model has been successfully deployed on edge devices, offering a lightweight, efficient, and privacy-conscious solution for real-time depression screening.
- Depression screening,
- Short-sequence video segmentation,
- Global-local feature fusion,
- Facial image analysis,
- Edge vision

FullText(HTML)

References(41)

References

[1]	WHO. Depression and other common mental disorders: Global health estimates[R]. 2017.
[2]	HERRMAN H, PATEL V, KIELING C, et al. Time for united action on depression: A lancet–world psychiatric association commission[J]. The Lancet, 2022, 399(10328): 957–1022. doi: 10.1016/S0140-6736(21)02141-3.
[3]	THAPAR A, EYRE O, PATEL V, et al. Depression in young people[J]. The Lancet, 2022, 400(10352): 617–631. doi: 10.1016/S0140-6736(22)01012-1.
[4]	ROBERSON K and CARTER R T. The relationship between race-based traumatic stress and the Trauma Symptom Checklist: Does racial trauma differ in symptom presentation?[J]. Traumatology, 2022, 28(1): 120–128. doi: 10.1037/trm0000306.
[5]	CARROZZINO D, PATIERNO C, PIGNOLO C, et al. The concept of psychological distress and its assessment: A clinimetric analysis of the SCL-90-R[J]. International Journal of Stress Management, 2023, 30(3): 235–248. doi: 10.1037/str0000280.
[6]	HAJDUSKA-DÉR B, KISS G, SZTAHÓ D, et al. The applicability of the Beck Depression Inventory and Hamilton Depression Scale in the automatic recognition of depression based on speech signal processing[J]. Frontiers in Psychiatry, 2022, 13: 879896. doi: 10.3389/fpsyt.2022.879896.
[7]	LEE A and PARK J. Diagnostic test accuracy of the beck depression inventory for detecting major depression in adolescents: A systematic review and meta-analysis[J]. Clinical Nursing Research, 2022, 31(8): 1481–1490. doi: 10.1177/10547738211065105.
[8]	张五芳, 马宁, 王勋, 等. 2020年全国严重精神障碍患者管理治疗现状分析[J]. 中华精神科杂志, 2022, 55(2): 122–128. doi: 10.3760/cma.j.cn113661-20210818-00252. ZHANG Wufang, MA Ning, WANG Xun, et al. Management and services for psychosis in the People′s Republic of China in 2020[J]. Chinese Journal of Psychiatry, 2022, 55(2): 122–128. doi: 10.3760/cma.j.cn113661-20210818-00252.
[9]	BABU N V and KANAGA E G M. Sentiment analysis in social media data for depression detection using artificial intelligence: A review[J]. SN Computer Science, 2022, 3(1): 74. doi: 10.1007/s42979-021-00958-1.
[10]	EID M M, YUNDONG W, MENSAH G B, et al. Treating psychological depression utilising artificial intelligence: AI for precision medicine-focus on procedures[J]. Mesopotamian Journal of Artificial Intelligence in Healthcare, 2023, 2023: 76–81. doi: 10.58496/MJAIH/2023/015.
[11]	SANCHEZ A, VAZQUEZ C, MARKER C, et al. Attentional disengagement predicts stress recovery in depression: An eye-tracking study[J]. Journal of Abnormal Psychology, 2013, 122(2): 303–313. doi: 10.1037/a0031529.
[12]	ZHANG Kaipeng, ZHANG Zhanpeng, LI Zhifeng, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE signal processing letters, 2016, 23(10): 1499–1503. doi: 10.1109/LSP.2016.2603342.
[13]	WANG Yang, WANG Hongyuan, and XIN Zihao. Efficient detection model of steel strip surface defects based on YOLO-V7[J]. IEEE Access, 2022, 10: 133936–133944. doi: 10.1109/ACCESS.2022.3230894.
[14]	HO J, KALCHBRENNER N, WEISSENBORN D, et al. Axial attention in multidimensional transformers[J]. arXiv preprint arXiv: 1912.12180, 2019.
[15]	LIU Zhuang, MAO Hanzi, WU Chaoyuan, et al. A ConvNet for the 2020s[C]. The IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11966–11976. doi: 10.1109/CVPR52688.2022.01167.
[16]	WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]. The 15th European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1.
[17]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. The IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[18]	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 618–626. doi: 10.1109/ICCV.2017.74.
[19]	ZHOU Xiuzhuang, JIN Kai, SHANG Yuanyuan, et al. Visually interpretable representation learning for depression recognition from facial images[J]. IEEE Transactions on Affective Computing, 2020, 11(3): 542–552. doi: 10.1109/TAFFC.2018.2828819.
[20]	DE MELO W C, GRANGER E, and HADID A. Depression detection based on deep distribution learning[C]. 2019 IEEE International Conference on Image Processing (ICIP), Taipei, China, 2019: 4544–4548. doi: 10.1109/ICIP.2019.8803467.
[21]	DE MELO W C, GRANGER E, and LOPEZ M B. Encoding temporal information for automatic depression recognition from facial analysis[C]. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 1080–1084. doi: 10.1109/ICASSP40776.2020.9054375.
[22]	SONG Siyang, JAISWAL S, SHEN Linlin, et al. Spectral representation of behaviour primitives for depression analysis[J]. IEEE Transactions on Affective Computing, 2022, 13(2): 829–844. doi: 10.1109/TAFFC.2020.2970712.
[23]	于明, 徐心怡, 师硕, 等. 基于面部深度空时特征的抑郁症识别算法[J]. 电视技术, 2020, 44(11): 12–18. doi: 10.16280/j.videoe.2020.11.04. YU Ming, XU Xinyi, SHI Shuo, et al. Depression recognition algorithm based on facial deep spatio-temporal features[J]. Video Engineering, 2020, 44(11): 12–18. doi: 10.16280/j.videoe.2020.11.04.
[24]	HE Lang, CHAN J C W, and WANG Zhongmin. Automatic depression recognition using CNN with attention mechanism from videos[J]. Neurocomputing, 2021, 422: 165–175. doi: 10.1016/j.neucom.2020.10.015.
[25]	NIU Mingyue, TAO Jianhua, and LIU Bin. Multi-scale and multi-region facial discriminative representation for automatic depression level prediction[C]. The ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021: 1325–1329. doi: 10.1109/ICASSP39728.2021.9413504.
[26]	ZHENG Yajing, WU Xiaohang, LIN Xiaoming, et al. The prevalence of depression and depressive symptoms among eye disease patients: A systematic review and meta-analysis[J]. Scientific Reports, 2017, 7(1): 46453. doi: 10.1038/srep46453.
[27]	BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: Facial behavior analysis toolkit[C]. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). Xi’an, China, 2018: 59–66. doi: 10.1109/FG.2018.00019.
[28]	BULAT A and TZIMIROPOULOS G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230, 000 3D facial landmarks)[C]. The IEEE International Conference on Computer Vision, Venice, Italy, 2017: 1021–1030. doi: 10.1109/ICCV.2017.116.
[29]	HAN Kai, XIAO An, WU Enhua, et al. Transformer in transformer[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 1217.
[30]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, Austria, 2021.
[31]	DEROGATIS L R and CLEARY P A. Confirmation of the dimensional structure of the scl‐90: A study in construct validation[J]. Journal of Clinical Psychology, 1977, 33(4): 981–989. doi: 10.1002/1097-4679(197710)33:4<981::AID-JCLP2270330412>3.0.CO;2-0.
[32]	美国精神病学协会, 张道龙, 等译. 精神障碍诊断和统计手册[M]. 5版. 北京: 北京大学出版社, 2015: 70–71. American Psychiatric Association, ZHANG Daolong, et al. translation. Diagnostic and Statistical Manual of Mental Disorders[M]. 5th ed. Beijing: Peking University Press, 2015: 70–71.
[33]	SHANG Yuanyuan, PAN Yuchen, JIANG Xiao, et al. LQGDNet: A local quaternion and global deep network for facial depression recognition[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2557–2563. doi: 10.1109/TAFFC.2021.3139651.
[34]	DE MELO W C, GRANGER E, and LÓPEZ M B. MDN: A deep maximization-differentiation network for spatio-temporal depression detection[J]. IEEE Transactions on Affective Computing, 2023, 14(1): 578–590. doi: 10.1109/TAFFC.2021.3072579.
[35]	ONYEMA E M, SHUKLA P K, DALAL S, et al. Enhancement of patient facial recognition through deep learning algorithm: ConvNet[J]. Journal of Healthcare Engineering, 2021, 2021(1): 5196000. doi: 10.1155/2021/5196000.
[36]	PAN Yuchen, SHANG Yuanyuan, SHAO Zhuhong, et al. Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition[J]. IEEE Transactions on Affective Computing, 2024, 15(3): 828–836. doi: 10.1109/TAFFC.2023.3296318.
[37]	CASADO C Á, CAÑELLAS M L, and LÓPEZ M B. Depression recognition using remote photoplethysmography from facial videos[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 3305–3316. doi: 10.1109/TAFFC.2023.3238641.
[38]	ZHANG Shiqing, ZHANG Xingnan, ZHAO Xiaoming, et al. MTDAN: A lightweight multi-scale temporal difference attention networks for automated video depression detection[J]. IEEE Transactions on Affective Computing, 2024, 15(3): 1078–1089. doi: 10.1109/TAFFC.2023.3312263.
[39]	PAN Yuchen, SHANG Yuanyuan, LIU Tie, et al. Spatial–temporal attention network for depression recognition from facial videos[J]. Expert Systems with Applications, 2024, 237: 121410. doi: 10.1016/j.eswa.2023.121410.
[40]	DAI Ziqian, LI Qiuping, SHANG Yichen, et al. Depression detection based on facial expression, audio and gait[C]. 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 2023: 1568–1573. doi: 10.1109/ITNEC56291.2023.10082163.
[41]	XU Jiaqi, GUNES H, KUSUMAM K, et al. Two-stage temporal modelling framework for video-based depression recognition using graph representation[J]. IEEE Transactions on Affective Computing, 2025, 16(1): 161–178. doi: 10.1109/TAFFC.2024.3415770.