Advanced Search
Turn off MathJax
Article Contents
LIU Qinghai, WU Qianlin, LUO Jia, TANG Lun, XU Liming. Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250529
Citation: LIU Qinghai, WU Qianlin, LUO Jia, TANG Lun, XU Liming. Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250529

Cross Modal Hashing of Medical Image Semantic Mining for Large Language Model

doi: 10.11999/JEIT250529 cstr: 32379.14.JEIT250529
Funds:  The National Natural Science Foundation of China (62071078), Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202400643), Sichuan Science and Technology Program (2021YFQ0053)
  • Received Date: 2025-06-09
  • Rev Recd Date: 2025-07-22
  • Available Online: 2025-07-30
  •   Objective  A novel cross-modal hashing framework driven by Large Language Models (LLMs) is proposed to address the semantic misalignment between medical images and their corresponding textual reports. The objective is to enhance cross-modal semantic representation and improve retrieval accuracy by effectively mining and matching semantic associations between modalities.  Methods  The generative capacity of LLMs is first leveraged to produce high-quality textual descriptions of medical images. These descriptions are integrated with diagnostic reports and structured clinical data using a dual-stream semantic enhancement module, designed to reinforce inter-modality alignment and improve semantic comprehension. A structural similarity-guided hashing scheme is then developed to encode both visual and textual features into a unified Hamming space, ensuring semantic consistency and enabling efficient retrieval. To further enhance semantic alignment, a prompt-driven attention template is introduced to fuse image and text features through fine-tuned LLMs. Finally, a contrastive loss function with hard negative mining is employed to improve representation discrimination and retrieval accuracy.  Results and Discussions  Experiments are conducted on a multimodal medical dataset to compare the proposed method with existing cross-modal hashing baselines. The results indicate that the proposed method significantly outperforms baseline models in terms of precision and Mean Average Precision (MAP) (Table 3; Table 4). On average, a 7.21% improvement in retrieval accuracy and a 7.72% increase in MAP are achieved across multiple data scales, confirming the effectiveness of the LLM-driven semantic mining and hashing approach.  Conclusions  Experiments are conducted on a multimodal medical dataset to compare the proposed method with existing cross-modal hashing baselines. The results indicate that the proposed method significantly outperforms baseline models in terms of precision and Mean Average Precision (MAP) (Table 3; Table 4). On average, a 7.21% improvement in retrieval accuracy and a 7.72% increase in MAP are achieved across multiple data scales, confirming the effectiveness of the LLM-driven semantic mining and hashing approach.
  • loading
  • [1]
    HUANG S C, PAREEK A, SEYYEDI S, et al. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines[J]. NPJ Digital Medicine, 2020, 3: 136. doi: 10.1038/s41746-020-00341-z.
    [2]
    HOLSTE G, PARTRIDGE S C, RAHBAR H, et al. End-to-end learning of fused image and non-image features for improved breast cancer classification from MRI[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 3287–3296. doi: 10.1109/ICCVW54120.2021.00368.
    [3]
    FANG Shichao, HONG Shenda, LI Qing, et al. Cross-modal similar clinical case retrieval using a modular model based on contrastive learning and k-nearest neighbor search[J]. International Journal of Medical Informatics, 2025, 193: 105680. doi: 10.1016/j.ijmedinf.2024.105680.
    [4]
    ZHANG Yilin. Multi-modal medical image matching based on multi-task learning and semantic-enhanced cross-modal retrieval[J]. Traitement du Signal, 2023, 40(5): 2041–2049. doi: 10.18280/ts.400522.
    [5]
    ZHU Xiangru, LI Zhixu, WANG Xiaodan, et al. Multi-modal knowledge graph construction and application: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(2): 715–735. doi: 10.1109/TKDE.2022.3224228.
    [6]
    XU Liming, ZENG Xianhua, ZHENG Bochuan, et al. Multi-manifold deep discriminative cross-modal hashing for medical image retrieval[J]. IEEE Transactions on Image Processing, 2022, 31: 3371–3385. doi: 10.1109/TIP.2022.3171081.
    [7]
    FANG Jiansheng, FU Huazhu, and LIU Jiang. Deep triplet hashing network for case-based medical image retrieval[J]. Medical Image Analysis, 2021, 69: 101981. doi: 10.1016/j.media.2021.101981.
    [8]
    LI Junnan, LI Dongxu, SAVARESE S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]. 40th International Conference on Machine Learning, Honolulu, USA, 2023: 19730–19742.
    [9]
    ZHANG Pan, DONG Xiaoyi, WANG Bin, et al. InternLM-XComposer: A vision-language large model for advanced text-image comprehension and composition[EB/OL]. https://doi.org/10.48550/arXiv.2309.15112, 2023.
    [10]
    ZHU Hongyi, HUANG Jiahong, RUDINAC S, et al. Enhancing interactive image retrieval with query rewriting using large language models and vision language models[C]. 2024 International Conference on Multimedia Retrieval, Phuket, Thailand, 2024: 978–987. doi: 10.1145/3652583.3658032.
    [11]
    LEE J, YOON W, KIM S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234–1240. doi: 10.1093/bioinformatics/btz682.
    [12]
    SANDERSON K. GPT-4 is here: What scientists think[J]. Nature, 2023, 615(7954): 773. doi: 10.1038/d41586-023-00816-5.
    [13]
    JIANG Qingyuan and LI Wujun. Deep cross-modal hashing[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 3270–3278. doi: 10.1109/CVPR.2017.348.
    [14]
    WU Gengshen, LIN Zijia, HAN Jungong, et al. Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval[C]. 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018: 2854–2860. doi: 10.24963/ijcai.2018/396.
    [15]
    LI Chao, DENG Cheng, LI Ning, et al. Self-supervised adversarial hashing networks for cross-modal retrieval[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 4242–4251. doi: 10.1109/CVPR.2018.00446.
    [16]
    LI Tieying, YANG Xiaochun, WANG Bin, et al. Bi-CMR: Bidirectional reinforcement guided hashing for effective cross-modal retrieval[C]. 36th AAAI Conference on Artificial Intelligence, 2022: 10275–10282. doi: 10.1609/aaai.v36i9.21268.
    [17]
    BAO Hangbo, WANG Wenhui, DONG Li, et al. VLMo: Unified vision-language pre-training with mixture-of-modality-experts[C]. 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 2384.
    [18]
    MELCHIOR J, WANG Nan, and WISKOTT L. Gaussian-binary restricted boltzmann machines for modeling natural image statistics[J]. PLoS One, 2017, 12(2): e0171015. doi: 10.1371/journal.pone.0171015.
    [19]
    JOHNSON A E W, POLLARD T J, GREENBAUM N R, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs[EB/OL]. https://doi.org/10.48550/arXiv.1901.07042, 2023.
    [20]
    DEMNER-FUSHMAN D, KOHLI M D, ROSENMAN M B, et al. Preparing a collection of radiology examinations for distribution and retrieval[J]. Journal of the American Medical Informatics Association, 2016, 23(2): 304–310. doi: 10.1093/jamia/ocv080.
    [21]
    SHARMA P, DING Nan, GOODMAN S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]. 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018: 2556–2565. doi: 10.18653/v1/P18-1238.
    [22]
    DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, 2009: 248–255. doi: 10.1109/CVPR.2009.5206848.
    [23]
    GILARDI F, ALIZADEH M, and KUBLI M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120. doi: 10.1073/pnas.2305016120.
    [24]
    DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
    [25]
    LESTER B, AL-RFOU R, and CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]. 2021 Conference on Empirical Methods in Natural Language Processing, 2021: 3045–3059. doi: 10.18653/v1/2021. emnlp-main. 243.
    [26]
    LI X L and LIANG P. Prefix-tuning: Optimizing continuous prompts for generation[C]. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2024: 4582–4597. doi: 10.18653/v1/2021.acl-long.353.
    [27]
    吴钱林, 唐伦, 刘青海, 等. 基于Transformer语义对齐的医学图像跨模态哈希检索[J]. 生物医学工程学杂志, 2025, 42(1): 156–163. doi: 10.7507/1001-5515.202407034.

    WU Qianlin, TANG Lun, LIU Qinghai, et al. Cross-modal hash retrieval of medical images based on Transformer semantic alignment[J]. Journal of Biomedical Engineering, 2005, 42(1): 156–163. doi: 10.7507/1001-5515.202407034.
    [28]
    TU Rongcheng, MAO Xianling, MA Bing, et al. Deep cross-modal hashing with hashing functions and unified hash codes jointly learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(2): 560–572. doi: 10.1109/TKDE.2020.2987312.
    [29]
    XIE De, DENG Cheng, LI Chao, et al. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2020, 29: 3626–3637. doi: 10.1109/TIP.2020.2963957.
    [30]
    SUN Yuan, REN Zhenwen, HU Peng, et al. Hierarchical consensus hashing for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2024, 26: 824–836. doi: 10.1109/TMM.2023.3272169.
    [31]
    BAI Cong, ZENG Chao, MA Qing, et al. Graph convolutional network discrete hashing for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(4): 4756–4767. doi: 10.1109/TNNLS.2022.3174970.
    [32]
    TU Junfeng, LIU Xueliang, LIN Zongxiang, et al. Differentiable cross-modal hashing via multimodal transformers[C]. 30th ACM International Conference on Multimedia, Lisboa, Portugal, 2022: 453–461. doi: 10.1145/3503161.3548187.
    [33]
    HUO Yadong, QIN Qibing, ZHANG Wenfeng, et al. Deep hierarchy-aware proxy hashing with self-paced learning for cross-modal retrieval[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(11): 5926–5939. doi: 10.1109/TKDE.2024.3401050.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(6)  / Tables(6)

    Article Metrics

    Article views (108) PDF downloads(23) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return