高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多源遗传信息的临床疾病风险评估系统

宁开达 余正阳 赵鑫 李梓妍 代菊 夏立

宁开达, 余正阳, 赵鑫, 李梓妍, 代菊, 夏立. 基于多源遗传信息的临床疾病风险评估系统[J]. 电子与信息学报. doi: 10.11999/JEIT251025
引用本文: 宁开达, 余正阳, 赵鑫, 李梓妍, 代菊, 夏立. 基于多源遗传信息的临床疾病风险评估系统[J]. 电子与信息学报. doi: 10.11999/JEIT251025
NING Kaida, YU Zhengyang, ZHAO Xin, LI Ziyan, DAI Ju, XIA Li. Clinical Disease Risk Assessment System Based on Multi-source Genetic Information[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251025
Citation: NING Kaida, YU Zhengyang, ZHAO Xin, LI Ziyan, DAI Ju, XIA Li. Clinical Disease Risk Assessment System Based on Multi-source Genetic Information[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251025

基于多源遗传信息的临床疾病风险评估系统

doi: 10.11999/JEIT251025 cstr: 32379.14.JEIT251025
基金项目: 鹏城实验室重大攻关项目(PCL2025AS212-3,PCL2024A02-2),国家自然科学基金(12571529),广东省基础与应用基础研究基金会(2024A1515-010699, 2022A1515-011426)
详细信息
    作者简介:

    宁开达:女,助理研究员,研究方向为医疗大数据建模、计算医学等

    余正阳:男,硕士,研究方向为医学人工智能、生物信息学等

    赵鑫:男,博士生,研究方向为医学图像模式识别、人工智能等

    李梓妍:女,硕士生,研究方向为基因数据建模

    代菊:女,助理研究员,研究方向为医疗大数据、图像模式识别等

    夏立:男,教授,研究方向为多模态大数据建模

    通讯作者:

    夏立 lcxia@scut.edu.cn

  • 中图分类号: TN911.7; TP391

Clinical Disease Risk Assessment System Based on Multi-source Genetic Information

Funds: The Major Key Project of Pengcheng Laboratory (PCL2025AS212-3, PCL2024A02-2), The National Natural Science Foundation of China (12571529) , Guangdong Basic and Applied Basic Research Foundation (2024A1515-010699, 2022A1515011426)
  • 摘要: 复杂疾病由多基因遗传与环境因素共同作用,致病机制高度异质,且具有高流行率和高致死率,构成重大公共卫生挑战。传统单疾病多基因风险评分(PRS)仅整合单一性状的遗传变异,忽视跨性状遗传相关性,预测效能受限;同时,多数方法依赖线性建模,难以刻画单核苷酸多态性(SNP)间及SNP与PRS之间的非线性交互,也未充分挖掘多类疾病PRS所蕴含的共享遗传信息。针对上述不足,该文提出一种基于统计学习的多源PRS疾病预测模型 mtSNPPRS_XGB,创新性地构建SNP-PRS融合架构,利用XGBoost捕捉多源遗传特征的非线性交互。该模型联合整合原始SNP数据与多源PRS信息,在UK Biobank的18种疾病中取得平均AUC 66.70%(95%置信区间:66.46%-66.95%),较传统UniPRS提升4.39%,较基于弹性网络的模型提升1.04%。研究为复杂疾病个体化遗传风险预测提供了新思路。
  • 图  1  mtSNPPRS_XGB模型的创新框架

    在SNP特征提取阶段,从GWAS Catalog中选取与目标性状显著相关的SNP位点;在PRS特征提取阶段,利用PGS Catalog计算多个相关性状的多基因风险评分(PRS),以整合不同性状间的共享遗传信息;在分类决策阶段,将SNP特征与PRS特征进行融合拼接,并采用XGBoost模型对组合后的特征进行全局建模,以提升遗传风险预测的准确性。

    图  2  4种方法在18种疾病下的AUC性能对比

    包括平均AUC值及四分位数分布,反映模型的平均性能。

    图  4  各模型在6种疾病的性能对比

    图  3  4种方法在18种疾病上的预测性能对比

    图  5  mtSNPPRS_XGB模型在6种疾病的DCA性能分析

    图  6  4种疾病模型中SHAP最高的15个特征的蜂群图

    每个点代表一个样本,横轴表示SHAP值,纵轴表示特征名称。点的颜色从蓝到红表示特征值从低到高,颜色越红表示该特征值越大。点的分布越分散表示该特征对不同样本的影响差异越大。SHAP值为正表示该特征增加疾病风险,为负则表示降低疾病风险。

    表  1  18种疾病的基本信息

    疾病名称缩写中文名ICD 10正样本数SNP数量
    Coronary Artery DiseaseCAD冠心病I21,I22,I23,
    I24.1,I25.2
    39,5231548
    Heart FailureHF心衰I5014,411173
    Ischaemic StrokeISS缺血性卒中I63,I6412,165159
    Alzheimer's DiseaseAD阿尔兹海默病F00,G303,07763
    Parkinson's DiseasePD帕金森病G203,145355
    Bipolar DisorderBD双相情感障碍F311,689862
    Breast CancerBC乳腺癌C5012,694884
    Colorectal CarcinomaCRC结直肠癌C18,C19,C207,02744
    Skin CancerSKC皮肤癌C4419,143538
    Rheumatoid ArthritisRA类风湿关节炎M05,M06,
    M08.0
    8,8351888
    PsoriasisPSO牛皮癣L4010,865515
    Systemic Lupus ErythematosusSLE系统性红斑狼疮M32650784
    GoutGO痛风M1014,220134
    Celiac DiseaseCED乳糜泻K90.02,455153
    AsthmaAST哮喘J45,J4649,4771241
    Type 2 Diabetes MellitusT2D2型糖尿病E1129,0242131
    Non-alcoholic Fatty Liver DiseaseNAFL非酒精性脂肪肝K76.04,931205
    GlaucomaGLAU青光眼H4014,143701
    下载: 导出CSV

    表  2  不同类别疾病下的各方法AUC性能对比(%)

    疾病类别 方法 平均AUC 95%
    Lower CI
    95%
    Upper CI
    心血管疾病 mtPRS_XGB 64.60 64.50 64.70
    mtPRS_ML 64.83 64.76 64.90
    UniPRS 58.41 58.41 58.41
    mtSNPPRS_XGB 64.90 64.79 64.99
    自身免疫疾病 mtPRS_XGB 65.30 64.96 65.63
    mtPRS_ML 66.07 65.83 66.33
    UniPRS 63.14 63.14 63.14
    mtSNPPRS_XGB 68.89 68.53 69.23
    癌症 mtPRS_XGB 61.52 61.35 61.67
    mtPRS_ML 61.91 61.82 62.01
    UniPRS 61.43 61.42 61.44
    mtSNPPRS_XGB 62.77 62.60 62.94
    精神类疾病 mtPRS_XGB 62.91 62.60 63.22
    mtPRS_ML 63.67 63.50 63.81
    UniPRS 62.93 62.93 62.93
    mtSNPPRS_XGB 63.51 63.12 63.97
    其它疾病 mtPRS_XGB 69.76 69.71 69.81
    mtPRS_ML 70.08 70.06 70.10
    UniPRS 64.38 64.38 64.38
    mtSNPPRS_XGB 70.67 70.52 70.81
    下载: 导出CSV
  • [1] CLAUSSNITZER M, CHO J H, COLLINS R, et al. A brief history of human disease genetics[J]. Nature, 2020, 577(7789): 179–189. doi: 10.1038/s41586-019-1879-7.
    [2] MA Ying and ZHOU Xiang. Genetic prediction of complex traits with polygenic scores: A statistical review[J]. Trends in Genetics, 2021, 37(11): 995–1011. doi: 10.1016/j.tig.2021.06.004.
    [3] ZHANG Sai, SHU Hantao, ZHOU Jingtian, et al. Single-cell polygenic risk scores dissect cellular and molecular heterogeneity of complex human diseases[J]. Nature Biotechnology, 2025: 1–17. doi: 10.1038/s41587-025-02725-6.
    [4] LENNON N J, KOTTYAN L C, KACHULIS C, et al. Selection, optimization and validation of ten chronic disease polygenic risk scores for clinical implementation in diverse US populations[J]. Nature Medicine, 2024, 30(2): 480–487. doi: 10.1038/s41591-024-02796-z.
    [5] LOOS R J F. 15 years of genome-wide association studies and no signs of slowing down[J]. Nature Communications, 2020, 11(1): 5900. doi: 10.1038/s41467-020-19653-5.
    [6] ZHU Wensheng and ZHANG Heping. Why do we test multiple traits in genetic association studies?[J]. Journal of the Korean Statistical Society, 2009, 38(1): 1–10. doi: 10.1016/j.jkss.2008.10.006.
    [7] HU Yiming, LU Qiongshi, LIU Wei, et al. Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction[J]. PLoS Genetics, 2017, 13(6): e1006836. doi: 10.1371/journal.pgen.1006836.
    [8] GUO Ping, GONG Weiming, LI Yuanming, et al. Pinpointing novel risk loci for Lewy body dementia and the shared genetic etiology with Alzheimer’s disease and Parkinson’s disease: A large-scale multi-trait association analysis[J]. BMC Medicine, 2022, 20(1): 214. doi: 10.1186/s12916-022-02404-2.
    [9] VILHJÁLMSSON B J, YANG Jian, FINUCANE H K, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores[J]. American Journal of Human Genetics, 2015, 97(4): 576–592. doi: 10.1016/j.ajhg.2015.09.001.
    [10] GE Tian, CHEN C Y, NI Yang, et al. Polygenic prediction via Bayesian regression and continuous shrinkage priors[J]. Nature Communications, 2019, 10(1): 1776. doi: 10.1038/s41467-019-09718-5.
    [11] KHERA A V, CHAFFIN M, ARAGAM K G, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations[J]. Nature Genetics, 2018, 50(9): 1219–1224. doi: 10.1038/s41588-018-0183-z.
    [12] MAVADDAT N, MICHAILIDOU K, DENNIS J, et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes[J]. American Journal of Human Genetics, 2019, 104(1): 21–34. doi: 10.1016/j.ajhg.2018.11.002.
    [13] PRIVÉ F, ARBEL J, and VILHJÁLMSSON B J. LDpred2: Better, faster, stronger[J]. Bioinformatics, 2021, 36(22/23): 5424–5431. doi: 10.1093/bioinformatics/btaa1029.
    [14] ZHONG Peng, ZHANG Chumeng, WU Qinfeng, et al. Shared genetic loci connect cardiovascular disease with blood pressure and lipid traits in East Asian populations[J]. Frontiers in Genetics, 2025, 16: 1635378. doi: 10.3389/fgene.2025.1635378.
    [15] ALLEGRINI A G, SELZAM S, RIMFELD K, et al. Genomic prediction of cognitive traits in childhood and adolescence[J]. Molecular Psychiatry, 2019, 24(6): 819–827. doi: 10.1038/s41380-019-0394-4.
    [16] KRAPOHL E, PATEL H, NEWHOUSE S, et al. Multi-polygenic score approach to trait prediction[J]. Molecular Psychiatry, 2018, 23(5): 1368–1374. doi: 10.1038/mp.2017.163.
    [17] CHUNG W, CHEN Jun, TURMAN C, et al. Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes[J]. Nature Communications, 2019, 10(1): 569. doi: 10.1038/s41467-019-08535-0.
    [18] ALBIÑANA C, ZHU Zhihong, SCHORK A J, et al. Multi-PGS enhances polygenic prediction by combining 937 polygenic scores[J]. Nature Communications, 2023, 14(1): 4702. doi: 10.1038/s41467-023-40330-w.
    [19] TRUONG B, HULL L E, RUAN Yunfeng, et al. Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases[J]. Cell Genomics, 2024, 4(4): 100523. doi: 10.1016/j.xgen.2024.100523.
    [20] CHEN T H, CHATTERJEE N, LANDI M T, et al. A penalized regression framework for building polygenic risk models based on summary statistics from genome-wide association studies and incorporating external information[J]. Journal of the American Statistical Association, 2021, 116(533): 133–143. doi: 10.1080/01621459.2020.1764849.
    [21] ZHAI Song, GUO Bin, WU Baolin, et al. Integrating multiple traits for improving polygenic risk prediction in disease and pharmacogenomics GWAS[J]. Briefings in Bioinformatics, 2023, 24(4): bbad181. doi: 10.1093/bib/bbad181.
    [22] 王宇翱, 黄叶琪, 李青远, 等. 融合表示学习和知识图谱推理的糖尿病及并发症预测方法[J]. 电子与信息学报. doi: 10.11999/JEIT250798.

    WANG Yuao, HUANG Yeqi, LI Qingyuan, et al. Integrating representation learning and knowledge graph reasoning for diabetes and complications prediction[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250798.
    [23] CHEN Tianqi and GUESTRIN C. XGBoost: A scalable tree boosting system[C]. The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, USA, 2016: 785–794. doi: 10.1145/2939672.2939785.
    [24] LAMBERT S A, WINGFIELD B, GIBSON J T, et al. Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization[J]. Nature Genetics, 2024, 56(10): 1989–1994. doi: 10.1038/s41588-024-01937-x.
  • 加载中
图(6) / 表(2)
计量
  • 文章访问数:  21
  • HTML全文浏览量:  4
  • PDF下载量:  4
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-09-28
  • 修回日期:  2026-03-09
  • 录用日期:  2026-03-09
  • 网络出版日期:  2026-03-12

目录

    /

    返回文章
    返回