Integrating Representation Learning and Knowledge Graph Reasoning for Diabetes and Complications Prediction
-
摘要: 糖尿病及其发并发症的联合预测对于降低慢性病危害、改善患者预后具有重要意义。然而,现有预测方法面临数据异构性和稀疏性、实体关系复杂以及疾病与医学概念间高阶关联难以精确捕捉等挑战,限制了预测准确性和多病症识别能力。针对上述问题,该文提出一种基于表示学习与知识图谱推理的糖尿病及其并发症预测模型(REKG-MDP)。通过整合电子健康记录与医学补充知识构建医疗知识图谱,在患者侧完善个人基本信息、检查指标及现病史,在疾病侧补充疾病共病信息、多发人群、常见病因及诊断依据,从而缓解数据稀疏性与异构性问题。综合考虑对称、反对称、反转和组合4种关系连接模式,并设计层次化注意力机制与图卷积网络相结合的推理模块,在全局和局部动态调整邻居节点权重,有效聚合多阶邻居信息并捕捉高阶语义关系。基于MIMIC-IV数据集的实验结果表明,所提模型在糖尿病及发并发症联合预测任务中明显优于现有方法,预测准确率和多病症识别能力均有显著提升。Abstract:
Objective Diabetes mellitus and its complications are recognized as major global health challenges, causing severe morbidity, high healthcare costs, and reduced quality of life. Accurate joint prediction of these conditions is essential for early intervention but is hindered by data heterogeneity, sparsity, and complex inter-entity relationships. To address these challenges, a Representation Learning Enhanced Knowledge Graph-based Multi-Disease Prediction (REKG-MDP) model is proposed. Electronic Health Records (EHRs) are integrated with supplementary medical knowledge to construct a comprehensive Medical Knowledge Graph (MKG), and higher-order semantic reasoning combined with relation-aware representation learning is applied to capture complex dependencies and improve predictive accuracy across multiple diabetes-related conditions. Methods The REKG-MDP framework consists of three modules. First, a MKG is constructed by integrating structured EHR data from the MIMIC-IV dataset with external disease knowledge. Patient-side features include demographics, laboratory indices, and medical history, whereas disease-side attributes cover comorbidities, susceptible populations, etiological factors, and diagnostic criteria. This integration mitigates data sparsity and enriches semantic representation. Second, a relation-aware embedding module captures four relational patterns: symmetric, antisymmetric, inverse, and compositional. These patterns are used to optimize entity and relation embeddings for semantic reasoning. Third, a Hierarchical Attention-based Graph Convolutional Network (HA-GCN) aggregates multi-hop neighborhood information. Dynamic attention weights capture both local and global dependencies, and a bidirectional mechanism enhances the modeling of patient–disease interactions. Results and Discussions Experiments demonstrate that REKG-MDP consistently outperforms four baselines: two machine learning models (DCKD-RF and bSES-AC-RUN-FKNN) and two graph-based models (KGRec and PyRec). Compared with the strongest baseline, REKG-MDP achieves average improvements in P, F1, and NDCG of 19.39%, 19.67%, and 19.39% for single-disease prediction ($ n=1 $); 16.71%, 21.83%, and 23.53% for $ n=3 $; and 22.01%, 20.34%, and 20.88% for $ n=5 $ ( Table 4 ). Ablation studies confirm the contribution of each module. Removing relation-pattern modeling reduces performance metrics by approximately 12%, removing hierarchical attention decreases them by 5–6%, and excluding disease-side knowledge produces the largest decline of up to 20% (Fig. 5 ). Sensitivity analysis indicates that increasing the embedding dimension from 32 to 128 enhances performance by more than 11%, whereas excessive dimensionality (256) leads to over-smoothing (Fig. 6 ). Adjusting the $ \beta $ parameter strengthens sample discrimination, improving P, F1, and NDCG by 9.28%, 27.9%, and 8.08%, respectively (Fig. 7 ).Conclusions REKG-MDP integrates representation learning with knowledge graph reasoning to enable multi-disease prediction. The main contributions are as follows: (1) integrating heterogeneous EHR data with disease knowledge mitigates data sparsity and enhances semantic representation; (2) modeling diverse relational patterns and applying hierarchical attention improves the capture of higher-order dependencies; and (3) extensive experiments confirm the model’s superiority over state-of-the-art baselines, with ablation and sensitivity analyses validating the contribution of each module. Remaining challenges include managing extremely sparse data and ensuring generalization across broader populations. Future research will extend REKG-MDP to model temporal disease progression and additional chronic conditions. -
表 1 医疗知识图谱中的关系连接模式示例
关系连接模式 解释 医疗案例 对称模式 两个实体之间的关系是相互的,即如果A与B有这种关系,那么B也应该与A有这种关系 (糖尿病,共病,高脂血症) 反对称模式 如果A与B有这种关系,那么B与A没有这种关系 (患者,BMI,肥胖) 反转模式 在某些条件下,这导致原始关系的反转,即如果存在$ {r_1}(A,B) $,那么存在$ {r_2}(B,A) $ (高血糖,导致,糖尿病)
→(糖尿病,风险因素,高血糖)组合模式 一个实体可以通过一系列关系与另一个实体连接,即如果存在$ {r_1}(A,B) $和$ {r_2}(B,C) $,
那么可以推断出$ {r_3}(A,C) $(患者,有,异常检查指标)+
(疾病,诊断依据,异常检查指标)
→(患者,患有,疾病)表 2 知识图谱统计信息
数据类型 数据集大小 训练集 2910 测试集 1942 疾病数量 18 患者数量 4852 检查指标数量 92 基本个人信息类型数量 18 共病/常见病因/多发人群数量 485 关系类型数量 18 知识图中的三元组数量 163118 表 3 该文中使用的疾病信息和疾病分类
疾病类别 疾病ICD-10代码 疾病名称 代谢性疾病 E11 2型糖尿病 E78.5 高脂血症 E11.4 糖尿病性神经病变 E10.2&E11.2 糖尿病性慢性肾病 E10.65&E11.65 高血糖症 E10 1型糖尿病 E78.0 高胆固醇血症 E10.1&E11.1 糖尿病酮症酸中毒 心脑血管疾病 I10 高血压 150 心力衰竭 125.1 冠状动脉粥样硬化性心脏病 121 心肌梗死 163 缺血性中风 G45 短暂性脑缺血发作 170 动脉粥样硬化 肾脏疾病 N18 慢性肾病 非酒精性脂肪肝病 K75.81 非酒精性脂肪性肝炎 K76.0 脂肪肝 表 4 REKG-MDP模型与5种基线方法的性能对比
模型 P@1 P@3 P@5 F1@1 F1@3 F1@5 NDCG@1 NDCG@3 NDCG@5 REKG-MDP 0.9655
(↑19.39%)0.8879
(↑16.71%)0.8280
(↑22.01%)0.4200
(↑19.67%)0.7332
(↑21.83%)0.8121
(↑20.34%)0.9655
(↑19.39%)0.9151
(↑23.53%)0.8946
(↑20.88%)DCKD-RF 0.7199 0.4192 0.3455 0.3058 0.3569 0.3375 0.7199 0.4651 0.4329 bSES-AC-RUN-FKNN 0.7106 0.4670 0.3995 0.3086 0.3972 0.3902 0.7106 0.4855 0.4384 KGRec 0.8087 0.7608 0.6786 0.2910 0.4804 0.5057 0.8087 0.6544 0.6017 PyRec 0.7948 0.7018 0.6537 0.3510 0.6018 0.6748 0.7948 0.7408 0.7401 -
[1] American Diabetes Association. Diagnosis and classification of diabetes mellitus[J]. Diabetes Care, 2014, 37(S1): S81–S90. doi: 10.2337/dc14-S081. [2] 姚欣卉, 肖洪彬, 卞敬琦, 等. 丹参有效成分在治疗糖尿病及其并发症中的作用机制研究进展[J]. 中国实验方剂学杂志, 2021, 27(7): 209–218. doi: 10.13422/j.cnki.syfjx.20210401.YAO Xinhui, XIAO Hongbin, BIAN Jingqi, et al. New progress in mechanism of Salviae Miltiorrhizae Radix et Rhizoma in treatment of diabetes and its complications[J]. Chinese Journal of Experimental Traditional Medical Formulae, 2021, 27(7): 209–218. doi: 10.13422/j.cnki.syfjx.20210401. [3] GUAN Zhouyu, LI Huating, LIU Ruhan, et al. Artificial intelligence in diabetes management: Advancements, opportunities, and challenges[J]. Cell Reports Medicine, 2023, 4(10): 101213. doi: 10.1016/j.xcrm.2023.101213. [4] ZHANG Lufang, YU Renyue, CHEN Keya, et al. Enhancing deep vein thrombosis prediction in patients with coronavirus disease 2019 using improved machine learning model[J]. Computers in Biology and Medicine, 2024, 173: 108294. doi: 10.1016/j.compbiomed.2024.108294. [5] RAHMAN M M, AL-AMIN M, and HOSSAIN J. Machine learning models for chronic kidney disease diagnosis and prediction[J]. Biomedical Signal Processing and Control, 2024, 87: 105368. doi: 10.1016/j.bspc.2023.105368. [6] ALTHOBAITI T, ALTHOBAITI S, and SELIM M M. An optimized diabetes mellitus detection model for improved prediction of accuracy and clinical decision-making[J]. Alexandria Engineering Journal, 2024, 94: 311–324. doi: 10.1016/j.aej.2024.03.044. [7] AL-SSULAMI A M, ALSORORI R S, AZMI A M, et al. Improving coronary heart disease prediction through machine learning and an innovative data augmentation technique[J]. Cognitive Computation, 2023, 15(5): 1687–1702. doi: 10.1007/s12559-023-10151-6. [8] 金怀平, 薛飞跃, 李振辉, 等. 基于病理图像集成深度学习的胃癌预后预测方法[J]. 电子与信息学报, 2023, 45(7): 2623–2633. doi: 10.11999/JEIT220655.JIN Huaiping, XUE Feiyue, LI Zhenhui, et al. Prognostic prediction of gastric cancer based on ensemble deep learning of pathological images[J]. Journal of Electronics & Information Technology, 2023, 45(7): 2623–2633. doi: 10.11999/JEIT220655. [9] 季薇, 王传瑜, 吴迪, 等. 基于跨语种声学分析的帕金森病检测方法[J]. 电子与信息学报, 2024, 46(2): 546–554. doi: 10.11999/JEIT230981.JI Wei, WANG Chuanyu, WU Di, et al. Parkinson's disease detection method based on cross-language acoustic analysis[J]. Journal of Electronics & Information Technology, 2024, 46(2): 546–554. doi: 10.11999/JEIT230981. [10] GHORBANI M, KAZI A, BAGHSHAH M S, et al. RA-GCN: Graph convolutional network for disease prediction problems with imbalanced data[J]. Medical Image Analysis, 2023, 75: 102272. doi: 10.1016/j.media.2021.102272. [11] ZHAO Qing, LI Jianqiang, ZHAO Linna, et al. Knowledge guided feature aggregation for the prediction of chronic obstructive pulmonary disease with Chinese EMRs[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022, 20(6): 3343–3352. doi: 10.1109/TCBB.2022.3198798. [12] PHAM T, TAO Xiaohui, ZHANG Ji, et al. Graph-based multi-label disease prediction model learning from medical data and domain knowledge[J]. Knowledge-Based Systems, 2022, 235: 107662. doi: 10.1016/j.knosys.2021.107662. [13] QU Zhe, CUI Lizhen, and XU Yonghui. Disease risk prediction via heterogeneous graph attention networks[C]. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, USA, IEEE, 2022: 3385–3390. doi: 10.1109/BIBM55620.2022.9995491. [14] LU Chang, HAN Tian, and NING Yue. Context-aware health event prediction via transition functions on dynamic disease graphs[C]. The 36th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2022: 4567–4574. doi: 10.1609/aaai.v36i4.20380. [15] 熊立鹏, 徐修远, 牛颢, 等. 融合nmODE的术后肺部并发症预测模型[J]. 智能系统学报, 2025, 20(1): 198–205. doi: 10.11992/tis.202401007.XIONG Lipeng, XU Xiuyuan, NIU Hao, et al. Predicting postoperative pulmonary complications after lung surgery using nmODE[J]. CAAI Transactions on Intelligent Systems, 2025, 20(1): 198–205. doi: 10.11992/tis.202401007. [16] SUN Zhoujian, DONG Wei, SHI Jinlong, et al. Interpretable disease progression prediction based on reinforcement reasoning over a knowledge graph[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54(3): 1948–1959. doi: 10.1109/TSMC.2023.3331847. [17] CHEN Xiaojun, JIA Shengbin, and XIANG Yang. A review: Knowledge reasoning over knowledge graph[J]. Expert Systems with Applications, 2020, 141: 112948. doi: 10.1016/j.eswa.2019.112948. [18] BORDES A, USUNIER N, GARCIA-DURÁN A, et al. Translating embeddings for modeling multi-relational data[C]. The 27th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2013: 2787–2795. [19] LIN Yankai, LIU Zhiyuan, SUN Maosong, et al. Learning entity and relation embeddings for knowledge graph completion[C]. The 29th AAAI Conference on Artificial Intelligence, Austin, USA, 2015: 2181–2187. doi: 10.1609/aaai.v29i1.9491. [20] TROUILLON T, WELBL J, RIEDEL S, et al. Complex embeddings for simple link prediction[C]. The 33rd International Conference on Machine Learning, New York, USA, 2016: 2071–2080. [21] HE Zexue, YAN An, GENTILI A, et al. “Nothing abnormal”: Disambiguating medical reports via contrastive knowledge infusion[C]. The 37th AAAI Conference on Artificial Intelligence, Washington, USA, 2023: 14232–14240. doi: 10.1609/aaai.v37i12.26665. [22] SUN Zhiqing, DENG Zhihong, NIE Jianyun, et al. Rotate: Knowledge graph embedding by relational rotation in complex space[C]. The 7th International Conference on Learning Representations, New Orleans, USA, 2019: 1–18. [23] QIU Jiezhong, TANG Jian, MA Hao, et al. DeepInf: Social influence prediction with deep learning[C]. The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, United Kingdom, 2018: 2110–2119. doi: 10.1145/3219819.3220077. [24] WANG Xiang, HE Xiangnan, CAO Yixin, et al. KGAT: Knowledge graph attention network for recommendation[C]. The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, USA, 2019: 950–958. doi: 10.1145/3292500.3330989. [25] RENDLE S, FREUDENTHALER C, GANTNER Z, et al. BPR: Bayesian personalized ranking from implicit feedback[C]. The 25th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 2009: 452–461. [26] STEFAN N and CUSI K. A global view of the interplay between non-alcoholic fatty liver disease and diabetes[J]. The Lancet Diabetes & Endocrinology, 2022, 10(4): 284–296. doi: 10.1016/S2213-8587(22)00003-1. [27] CARRASCO-ZANINI J, PIETZNER M, KOPRULU M, et al. Proteomic prediction of diverse incident diseases: A machine learning-guided biomarker discovery study using data from a prospective cohort study[J]. The Lancet Digital Health, 2024, 6(7): e470–e479. doi: 10.1016/S2589-7500(24)00087-6. [28] LI Bo, QUAN Haowei, WANG Jiawei, et al. Neural library recommendation by embedding project-library knowledge graph[J]. IEEE Transactions on Software Engineering, 2024, 50(6): 1620–1638. doi: 10.1109/TSE.2024.3393504. [29] YANG Yuhao, HUANG Chao, XIA Lianghao, et al. Knowledge graph self-supervised rationalization for recommendation[C]. The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, USA, 2023: 3046–3056. doi: 10.1145/3580305.3599400. [30] KINGMA D P and BA J. Adam: A method for stochastic optimization[C]. The 3rd International Conference on Learning Representations, San Diego, USA, 2015: 1–15. [31] HAMILTON W L, YING R, and LESKOVEC J. Inductive representation learning on large graphs[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 1025–1035. -
下载:
下载: