高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向大语言模型的医学图像语义挖掘跨模态哈希

刘青海 吴钱林 罗佳 唐伦 徐黎明

刘青海, 吴钱林, 罗佳, 唐伦, 徐黎明. 面向大语言模型的医学图像语义挖掘跨模态哈希[J]. 电子与信息学报. doi: 10.11999/JEIT250529
引用本文: 刘青海, 吴钱林, 罗佳, 唐伦, 徐黎明. 面向大语言模型的医学图像语义挖掘跨模态哈希[J]. 电子与信息学报. doi: 10.11999/JEIT250529
LIU Qinghai, WU Qianlin, LUO Jia, TANG Lun, XU Liming. Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250529
Citation: LIU Qinghai, WU Qianlin, LUO Jia, TANG Lun, XU Liming. Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250529

面向大语言模型的医学图像语义挖掘跨模态哈希

doi: 10.11999/JEIT250529 cstr: 32379.14.JEIT250529
基金项目: 国家自然科学基金(62071078),重庆市教委科学技术研究项目(KJQN202400643),四川省科技计划(2021YFQ0053)
详细信息
    作者简介:

    刘青海:男,博士生,研究方向为图像处理,多媒体信息处理与传输等

    吴钱林:男,硕士生,研究方向为图像处理,多媒体信息处理与传输等

    罗佳:男,讲师,博士,研究方向为下一代无线通信网络、人工智能、区块链等

    唐伦:男,教授,博士生导师,研究方向为下一代无线通信网络、异构蜂窝网络、图像处理等

    徐黎明:男,副教授,硕士生导师,研究方向为图像处理,多媒体信息处理等

    通讯作者:

    罗佳 luojia@cqupt.edu.cn

  • 中图分类号: TN919.81; TP391

Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model

Funds: The National Natural Science Foundation of China (62071078), Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202400643), Sichuan Science and Technology Program (2021YFQ0053)
  • 摘要: 针对医学图像与文本深层语义关系建模不足的挑战,该文提出基于大语言模型(LLM)驱动的医学图像语义挖掘哈希方法。首先,联合大语言模型的语义泛化能力,设计了图像描述型、诊断报告总结型和联合模态推理型提示模板进行模态数据增强,从而实现了对医学图像和诊断报告数据的深层次语义挖掘。其次,设计了结构化编码层,以确保图像和文本的特征能够在统一的嵌入空间内进行精确匹配。然后,设计了提示指令模板,采用软提示和硬提示相结合的方式微调大语言模型,实现图像和文本特征的对齐。最后,引入高斯2元受限玻尔兹曼机进行概率化哈希映射,有效保留数据结构信息。实验验证,该方法与最近的经典跨模态哈希检索方法相比,在两个数据集上平均检索精度分别提升7.21%和7.72%。
  • 图  1  LSCH哈希学习框架

    图  2  LSCH网络架构

    图  3  提示模板

    图  4  MIMIC数据集上不同算法的PR曲线比较

    图  5  MIMIC数据集上不同算法的P@K曲线比较

    图  6  超参数αβ对MIMIC数据集MAP值的影响

    表  1  本文术语和符号总结

    符号 描述
    ${\boldsymbol{Z}}$ 医学图像数据${\boldsymbol{I}}$通过ViT模块后的视觉特征矩阵
    ${\boldsymbol{X}}$ 视觉特征矩阵${\boldsymbol{Z}}$通过结构化编码层得到的视觉语义嵌入矩阵
    ${\boldsymbol{Y}}$ 诊断报告数据${\boldsymbol{T}}$通过大语言模型编码层得到的文本语义嵌入
    矩阵
    ${v^{\boldsymbol{x}}}$ 视觉语义嵌入向量${\boldsymbol{x}}$通过大语言模型后得到的特征表示
    ${v^{\boldsymbol{y}}}$ 文本语义嵌入向量${\boldsymbol{y}}$通过大语言模型后得到的特征表示
    ${h^{\boldsymbol{x}}}$ GBRBM的隐藏层输出,及医学图像对应的哈希码
    ${h^{\boldsymbol{y}}}$ GBRBM的隐藏层输出,及诊断报告对应的哈希码
    g(·) ViT模型网络函数
    $w$ 可见层$v$和隐藏层$h$之间的权重
    $ \sigma $ 可见单元$v$相关的标准差
    $ b $ 可见层$v$的偏置
    $a$ 隐藏层$h$的偏置
    Sma 不同模态实例之间的流形相似度
    下载: 导出CSV

    表  2  MIMIC和URBN数据集划分

    数据集名 数据集 训练集 查询集 检索集
    MIMIC 116 282 58 141 10 000 106 282
    URBN 2 261 1 831 226 2 035
    下载: 导出CSV

    表  3  在MIMIC数据集上不同长度哈希码的MAP值表现

    方法 I2T T2I
    32 bit 64 bit 128 bit 32 bit 64 bit 128 bit
    DCHUC 0.655 0.677 0.683 0.682 0.696 0.705
    BI_CMR 0.661 0.678 0.698 0.718 0.734 0.753
    CPAH 0.640 0.657 0.673 0.696 0.703 0.715
    HCCH 0.647 0.654 0.685 0.635 0.635 0.644
    GCDH 0.652 0.663 0.661 0.664 0.685 0.691
    DCHMT 0.672 0.683 0.691 0.732 0.741 0.742
    DHaPH 0.681 0.674 0.682 0.752 0.765 0.776
    LSCH 0.694 0.693 0.705 0.761 0.784 0.789
    下载: 导出CSV

    表  4  在URBN数据集上不同长度哈希码的MAP值表现

    方法 I2T T2I
    32 bit 64 bit 128 bit 32 bit 64 bit 128 bit
    DCHUC 0.676 0.674 0.682 0.763 0.772 0.792
    BI_CMR 0.718 0.723 0.740 0.798 0.811 0.823
    CPAH 0.663 0.679 0.664 0.757 0.782 0.786
    HCCH 0.721 0.705 0.723 0.736 0.746 0.755
    GCDH 0.726 0.731 0.736 0.778 0.806 0.817
    DCHMT 0.741 0.748 0.756 0.806 0.825 0.845
    DHaPH 0.751 0.759 0.750 0.812 0.834 0.838
    LSCH 0.762 0.773 0.769 0.842 0.861 0.872
    下载: 导出CSV

    表  5  在两个基准数据集上不同任务的最优参数组合

    数据集 任务
    I2T T2I
    MIMIC $\alpha $=2, $\beta $=1 $\alpha $=5,$\beta $=2
    URBN $\alpha $=0.5, $\beta $=0.8 $\alpha $=0.6, $\beta $=0.1
    下载: 导出CSV

    表  6  在MIMIC数据集上对LSCH及其变体进行MAP比较

    方法 I2T T2I
    32 bit 64 bit 128 bit 32 bit 64 bit 128 bit
    LSCH 0.694 0.693 0.705 0.761 0.784 0.789
    LSCH-1 0.674 0.678 0.682 0.723 0.734 0.743
    LSCH-2 0.681 0.685 0.692 0.743 0.762 0.758
    LSCH-3 0.685 0.691 0.693 0.746 0.766 0.774
    LSCH-4 0.688 0.691 0.698 0.756 0.772 0.783
    LSCH-5 0.675 0.688 0.691 0.732 0.742 0.765
    LSCH-6 0.682 0.685 0.689 0.719 0.718 0.733
    下载: 导出CSV
  • [1] HUANG S C, PAREEK A, SEYYEDI S, et al. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines[J]. NPJ Digital Medicine, 2020, 3: 136. doi: 10.1038/s41746-020-00341-z.
    [2] HOLSTE G, PARTRIDGE S C, RAHBAR H, et al. End-to-end learning of fused image and non-image features for improved breast cancer classification from MRI[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 3287–3296. doi: 10.1109/ICCVW54120.2021.00368.
    [3] FANG Shichao, HONG Shenda, LI Qing, et al. Cross-modal similar clinical case retrieval using a modular model based on contrastive learning and k-nearest neighbor search[J]. International Journal of Medical Informatics, 2025, 193: 105680. doi: 10.1016/j.ijmedinf.2024.105680.
    [4] ZHANG Yilin. Multi-modal medical image matching based on multi-task learning and semantic-enhanced cross-modal retrieval[J]. Traitement du Signal, 2023, 40(5): 2041–2049. doi: 10.18280/ts.400522.
    [5] ZHU Xiangru, LI Zhixu, WANG Xiaodan, et al. Multi-modal knowledge graph construction and application: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(2): 715–735. doi: 10.1109/TKDE.2022.3224228.
    [6] XU Liming, ZENG Xianhua, ZHENG Bochuan, et al. Multi-manifold deep discriminative cross-modal hashing for medical image retrieval[J]. IEEE Transactions on Image Processing, 2022, 31: 3371–3385. doi: 10.1109/TIP.2022.3171081.
    [7] FANG Jiansheng, FU Huazhu, and LIU Jiang. Deep triplet hashing network for case-based medical image retrieval[J]. Medical Image Analysis, 2021, 69: 101981. doi: 10.1016/j.media.2021.101981.
    [8] LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]. 40th International Conference on Machine Learning, Honolulu, USA, 2023: 19730–19742.
    [9] ZHANG Pan, DONG Xiaoyi, WANG Bin, et al. InternLM-XComposer: A vision-language large model for advanced text-image comprehension and composition[EB/OL]. https://doi.org/10.48550/arXiv.2309.15112, 2023.
    [10] ZHU Hongyi, HUANG Jiahong, RUDINAC S, et al. Enhancing interactive image retrieval with query rewriting using large language models and vision language models[C]. 2024 International Conference on Multimedia Retrieval, Phuket, Thailand, 2024: 978–987. doi: 10.1145/3652583.3658032.
    [11] LEE J, YOON W, KIM S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234–1240. doi: 10.1093/bioinformatics/btz682.
    [12] SANDERSON K. GPT-4 is here: What scientists think[J]. Nature, 2023, 615(7954): 773. doi: 10.1038/d41586-023-00816-5.
    [13] JIANG Qingyuan and LI Wujun. Deep cross-modal hashing[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 3270–3278. doi: 10.1109/CVPR.2017.348.
    [14] WU Gengshen, LIN Zijia, HAN Jungong, et al. Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval[C]. 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 2854–2860. doi: 10.24963/ijcai.2018/396.
    [15] LI Chao, DENG Cheng, LI Ning, et al. Self-supervised adversarial hashing networks for cross-modal retrieval[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 4242–4251. doi: 10.1109/CVPR.2018.00446.
    [16] LI Tieying, YANG Xiaochun, WANG Bin, et al. Bi-CMR: Bidirectional reinforcement guided hashing for effective cross-modal retrieval[C]. 36th AAAI Conference on Artificial Intelligence, 2022: 10275–10282. doi: 10.1609/aaai.v36i9.21268.
    [17] BAO Hangbo, WANG Wenhui, DONG Li, et al. VLMo: Unified vision-language pre-training with mixture-of-modality-experts[C]. 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 2384.
    [18] MELCHIOR J, WANG Nan, and WISKOTT L. Gaussian-binary restricted boltzmann machines for modeling natural image statistics[J]. PLoS One, 2017, 12(2): e0171015. doi: 10.1371/journal.pone.0171015.
    [19] JOHNSON A E W, POLLARD T J, GREENBAUM N R, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs[EB/OL]. https://doi.org/10.48550/arXiv.1901.07042, 2023.
    [20] DEMNER-FUSHMAN D, KOHLI M D, ROSENMAN M B, et al. Preparing a collection of radiology examinations for distribution and retrieval[J]. Journal of the American Medical Informatics Association, 2016, 23(2): 304–310. doi: 10.1093/jamia/ocv080.
    [21] SHARMA P, DING Nan, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018: 2556–2565. doi: 10.18653/v1/P18-1238.
    [22] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, 2009: 248–255. doi: 10.1109/CVPR.2009.5206848.
    [23] GILARDI F, ALIZADEH M, and KUBLI M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120. doi: 10.1073/pnas.2305016120.
    [24] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [25] LESTER B, AL-RFOU R, and CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]. 2021 Conference on Empirical Methods in Natural Language Processing, 2021: 3045–3059. doi: 10.18653/v1/2021. emnlp-main. 243.
    [26] LI X L and LIANG P. Prefix-tuning: Optimizing continuous prompts for generation[C]. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2024: 4582–4597. doi: 10.18653/v1/2021.acl-long.353.
    [27] 吴钱林, 唐伦, 刘青海, 等. 基于Transformer语义对齐的医学图像跨模态哈希检索[J]. 生物医学工程学杂志, 2025, 42(1): 156–163. doi: 10.7507/1001-5515.202407034.

    WU Qianlin, TANG Lun, LIU Qinghai, et al. Cross-modal hash retrieval of medical images based on Transformer semantic alignment[J]. Journal of Biomedical Engineering, 2005, 42(1): 156–163. doi: 10.7507/1001-5515.202407034.
    [28] TU Rongcheng, MAO Xianling, MA Bing, et al. Deep cross-modal hashing with hashing functions and unified hash codes jointly learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(2): 560–572. doi: 10.1109/TKDE.2020.2987312.
    [29] XIE De, DENG Cheng, LI Chao, et al. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2020, 29: 3626–3637. doi: 10.1109/TIP.2020.2963957.
    [30] SUN Yuan, REN Zhenwen, HU Peng, et al. Hierarchical consensus hashing for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2024, 26: 824–836. doi: 10.1109/TMM.2023.3272169.
    [31] BAI Cong, ZENG Chao, MA Qing, et al. Graph convolutional network discrete hashing for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(4): 4756–4767. doi: 10.1109/TNNLS.2022.3174970.
    [32] TU Junfeng, LIU Xueliang, LIN Zongxiang, et al. Differentiable cross-modal hashing via multimodal transformers[C]. 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 453–461. doi: 10.1145/3503161.3548187.
    [33] HUO Yadong, QIN Qibing, ZHANG Wenfeng, et al. Deep hierarchy-aware proxy hashing with self-paced learning for cross-modal retrieval[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(11): 5926–5939. doi: 10.1109/TKDE.2024.3401050.
  • 加载中
图(6) / 表(6)
计量
  • 文章访问数:  116
  • HTML全文浏览量:  124
  • PDF下载量:  25
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-06-09
  • 修回日期:  2025-07-22
  • 网络出版日期:  2025-07-30

目录

    /

    返回文章
    返回