MCL-PhishNet: A Multi-Modal Contrastive Learning Network for Phishing URL Detection
-
摘要: 随着网络钓鱼攻击的复杂性和动态性日益加剧,传统检测方法在对抗新型攻击时面临特征维度虚高、多模态失配及对抗样本鲁棒性不足等挑战。该文提出多模态对比学习框架(MCL-PhishNet),通过层次化语法编码器、双向跨模态注意力机制和课程对比学习策略,实现钓鱼网址(URL)的精准检测。其中,多尺度残差卷积与Transformer协同建模了URL的局部语法模式和全局依赖关系,17维统计特征增强对抗样本的鲁棒性;动态对比学习机制通过在线谱聚类划分语义子空间,结合边界间隔约束优化特征空间分布。实验表明,MCL-PhishNet 在EBUU17, PhishStorm等数据集上实现了99.41%的准确率和99.65%的F1值,显著优于传统机器学习与深度学习方法。该方法为动态对抗攻击检测提供了端到端的技术范式。Abstract:
Objective The growing complexity and rapid evolution of phishing attacks present challenges to traditional detection methods, including feature redundancy, multi-modal mismatch, and limited robustness to adversarial samples. Methods MCL-PhishNet is proposed as a Multi-Modal Contrastive Learning framework that achieves precise phishing URL detection through a hierarchical syntactic encoder, bidirectional cross-modal attention mechanisms, and curriculum contrastive learning strategies. In this framework, multi-scale residual convolutions and Transformers jointly model local grammatical patterns and global dependency relationships of URLs, whereas a 17-dimensional statistical feature set improves robustness to adversarial samples. The dynamic contrastive learning mechanism optimizes the feature-space distribution through online spectral-clustering-based semantic subspace partitioning and boundary-margin constraints. Results and Discussions This study demonstrates consistent performance across different datasets (EBUU17 accuracy 99.41%, PhishStorm 99.41%, Kaggle 99.30%), validating the generalization capability of MCL-PhishNet. The three datasets differ significantly in sample distribution, attack types, and feature dimensions, yet the method in this study maintains stable high performance, indicating that the multimodal contrastive learning framework has strong cross-scenario adaptability. Compared to methods optimized for specific datasets, this approach avoids overfitting to particular dataset distributions through end-to-end learning and an adaptive feature fusion mechanism. Conclusions This paper addresses the core challenges in phishing URL detection, such as the difficulty of dynamic syntax pattern modeling, multimodal feature mismatches, and insufficient adversarial robustness, and proposes a multimodal contrastive learning framework, MCL-PhishNet. Through a collaborative mechanism of hierarchical syntax encoding, dynamic semantic distillation, and curriculum optimization, it achieves 99.41% accuracy and a 99.65% F1 score on datasets like EBUU17 and PhishStorm, improving existing state-of-the-art methods by 0.27%~3.76%. Experiments show that this approach effectively captures local variation patterns in URLs (such as numeric substitution attacks in ‘payp41-log1n.com’) through a residual convolution-Transformer collaborative architecture and reduces the false detection rate of path-sensitive parameters to 0.07% via a bidirectional cross-modal attention mechanism. However, the proposed framework has relatively high complexity. Although the hierarchical encoding module of MCL-PhishNet (including multi-scale CNNs, Transformers, and gated networks) improves detection accuracy, it also increases the number of model parameters. Moreover, the current model is trained primarily on English-based public datasets, resulting in significantly reduced detection accuracy for non-Latin characters (such as Cyrillic domain confusions) and regional phishing strategies (such as ‘fake’ URLs targeting local payment platforms). -
A.1 17维URL 统计特征构成及创新点说明
序号 特征名称 计算方法 创新点说明 传统特征(8维) 1 URL长度 $ L={\mathrm{len}}\left({\mathrm{url}}\right) $ 沿用已有特征 2 域名长度 $ L_{\mathrm{d}}=\mathrm{len}\left(\mathrm{domain}\right) $ 沿用已有特征 3 路径深度 $ D_{\mathrm{p}}=\mathrm{count}(\mathrm{\text{'}}/\mathrm{\text{'}}) $ 沿用已有特征 4 查询参数数量 $ N\mathrm{_q}=\mathrm{count}(\mathrm{\text{'}}=\mathrm{\text{'}}) $ 沿用已有特征 5 特殊字符数量 $ {N}_\text{s}={\mathrm{count}}\left(\right\{\mathrm{\text{'}}-\mathrm{\text{'}},\mathrm{\text{'}}\_\mathrm{\text{'}},\mathrm{\text{'}}@\mathrm{\text{'}}\left\}\right) $ 沿用已有特征 6 数字占比 $ R\mathrm{_d}=\mathrm{count}\left(\mathrm{digits}\right)\mathrm{ }/L\mathrm{ } $ 沿用已有特征 7 HTTPS标识 $ {\mathbb{I}}_{{\mathrm{https}}}\in \left\{\mathrm{0,1}\right\} $ 沿用已有特征 8 子域名数量 $ {N}_{{\mathrm{sub}}}={\mathrm{count}}(\mathrm{\text{'}}.\mathrm{\text{'}})\mathrm{ }-\mathrm{ }1 $ 沿用已有特征 改进特征(5维) 9 自适应加权域名熵 $ {H}_\text{d}^{\mathrm{*}}=\mathrm{ }-\displaystyle\sum\nolimits _{i=1}^{{L}_\text{d}}{w}_{i}\cdot p\left({c}_{i}\right)\mathrm{log} p\left({c}_{i}\right) $\text
$ w_i=\mathrm{e}^{-\frac{\alpha\left(i-1\right)}{L\mathrm{_d}}} $创新1:引入位置衰减权重前缀字符权重更高 10 路径语义密度 $ {\rho }_{{\mathrm{p}}}=\dfrac{{\Sigma }_{w\in {\mathrm{path}}}\mathbb{I}\left(w\in {V}_{{\mathrm{sens}}}\right)}{{N}_{{\mathrm{words}}}} $
$ \mathrm{敏}\mathrm{感}\mathrm{词}\mathrm{库}:\{{\mathrm{login}},{\mathrm{verify}},{\mathrm{secure}},{\mathrm{account}}\} $创新2:敏感词占比$ {V}_{{\mathrm{sens}}} $为预定义敏感词库 11 参数异常度 $ {A}_\text{q}=\mathrm{ }(1/{N}_\text{q})\displaystyle\sum\nolimits_{i=1}^{{N}_\text{q}}\mathbb{I}\left({\mathrm{len}}\right({v}_{i}) > 20) $ 创新3:长参数值占比检测重定向URL隐藏 12 跳转链深度 $ {D}_{{\mathrm{jump}}}={\mathrm{count}}\left({\mathrm{redirects}}\right) $ 创新4:通过HEAD请求检测识别动态跳转攻击 13 证书可信度 $ {T}_{{\mathrm{cert}}}\in \left[\mathrm{0,1}\right] $
基于有效期、颁发机构、域名匹配度创新5:SSL证书有效性评分综合多维度计算 新增特征(4维) 14 字符替换相似度 $ S_{\mathrm{char}}=\mathrm{max}_{b\epsilon\mathcal{B}}\mathrm{sim}\left(d,b\right) $
$ {\mathrm{sim}}\left(d,b\right)= 1 -\dfrac{{\mathrm{ED}}\left(d,b\right)}{{\mathrm{max}}\left(\left|d\right|,\left|b\right|\right)} $创新6:针对混淆攻击设计与已知品牌域名的编辑距离 15 品牌名称匹配度 $ M_{\mathrm{brand}}=\mathrm{m}\mathrm{a}\mathrm{x}_{b\epsilon\mathcal{B}}\dfrac{\mathrm{LCS}\left(d,b\right)}{\left|b\right|} $
$ \mathcal{B}:{\mathrm{AlexaTop1000}}\mathrm{品}\mathrm{牌}\mathrm{域}\mathrm{名}\mathrm{库} $创新7:最长公共子序列比检测品牌仿冒 16 URL片段熵差异 $ \Delta H= |{H}_{{\mathrm{domain}}}-{H}_{{\mathrm{path}}}| $ 创新8:域名与路径熵值差检测随机路径混淆 17 域名注册时长 $ {T}_{{\mathrm{age}}}={T}_{{\mathrm{now}}}-{T}_{{\mathrm{reg}}}\left(\mathrm{天}\right) $ 创新9:WHOIS查询获取新注册域名(<7天)为高风险 表 1 不同方法在Kaggle URL数据集上的性能对比(%)
方法 准确率 精确率 F1值 LR 58.83 99.00 41.74 DT 95.41 95.80 95.91 RF 96.77 96.73 97.12 NB 88.39 94.92 88.96 SVM 71.80 96.34 65.67 VAE-DNN 97.45 97.02 96.54 LR+SVC+DT 98.12 97.31 95.89 PDSMV3-DCRNN 99.05 99.02 99.00 MCL-PhishNet 99.30 99.28 99.65 -
[1] LIU Ruitong, WANG Yanbin, XU Haitao, et al. PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network[J]. Information Fusion, 2025, 113: 102638. doi: 10.1016/j.inffus.2024.102638. [2] 钟文康, 王添, 张功萱. 基于组件分割的钓鱼URL检测方法[J]. 信息安全学报, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10.ZHONG Wenkang, WANG Tian, and ZHANG Gongxuan. Phishing URL detection method based on component segmentation[J]. Journal of Cyber Security, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10. [3] JAIN A K and GUPTA B B. A survey of phishing attack techniques, defence mechanisms and open research challenges[J]. Enterprise Information Systems, 2022, 16(4): 527–565. doi: 10.1080/17517575.2021.1896786. [4] OMOLARA A E and ALAWIDA M. DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online security[J]. Computers & Security, 2025, 148: 104170. doi: 10.1016/j.cose.2024.104170. [5] PANDEY P and MISHRA N. Phish-sight: A new approach for phishing detection using dominant colors on web pages and machine learning[J]. International Journal of Information Security, 2023, 22(4): 881–891. doi: 10.1007/s10207-023-00672-4. [6] CHEN Qisheng and OMOTE K. An intrinsic evaluator for embedding methods in malicious URL detection[J]. International Journal of Information Security, 2025, 24(1): 36. doi: 10.1007/s10207-024-00950-9. [7] 文伟平, 朱一帆, 吕子晗, 等. 针对品牌的网络钓鱼扩线与检测方案[J]. 信息网络安全, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001.WEN Weiping, ZHU Yifan, LYU Zihan, et al. Brand-specific phishing expansion and detection solutions[J]. Netinfo Security, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001. [8] 胡忠义, 张硕果, 吴江. 基于URL多粒度特征融合的钓鱼网站识别[J]. 数据分析与知识发现, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141.HU Zhongyi, ZHANG Shuoguo, and WU Jiang. Identifying phishing websites based on URL multi-granularity feature fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141. [9] SABIR B, BABAR M A, GAIRE R, et al. Reliability and robustness analysis of machine learning based phishing URL detectors[J]. IEEE Transactions on Dependable and Secure Computing, 2022, 1–18. doi: 10.1109/TDSC.2022.3218043. [10] DO N Q, SELAMAT A, FUJITA H, et al. An integrated model based on deep learning classifiers and pre-trained transformer for phishing URL detection[J]. Future Generation Computer Systems, 2024, 161: 269–285. doi: 10.1016/j.future.2024.06.031. [11] ASIRI S, XIAO Yang, ALZAHRANI S, et al. PhishingRTDS: A real-time detection system for phishing attacks using a deep learning model[J]. Computers & Security, 2024, 141: 103843. doi: 10.1016/j.cose.2024.103843. [12] OPARA C, CHEN Yingke, and WEI Bo. Look before you leap: Detecting phishing web pages by exploiting raw URL and HTML characteristics[J]. Expert Systems with Applications, 2024, 236: 121183. doi: 10.1016/j.eswa.2023.121183. [13] 谢丽霞, 张浩, 杨宏宇, 等. 网络钓鱼检测研究综述[J]. 电子科技大学学报, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273.XIE Lixia, ZHANG Hao, YANG Hongyu, et al. A review of phishing detection research[J]. Journal of University of Electronic Science and Technology of China, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273. [14] DU Yuefeng, DUAN Huayi, XU Lei, et al. PEBA: Enhancing user privacy and coverage of safe browsing services[J]. IEEE Transactions on Dependable and Secure Computing, 2023, 20(5): 4343–4358. doi: 10.1109/TDSC.2022.3204767. [15] 胡强, 刘倩, 周杭霞. 基于改进Stacking策略的钓鱼网站检测研究[J]. 广西师范大学学报: 自然科学版, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201.HU Qiang, LIU Qian, and ZHOU Hangxia. Study on phishing website detection based on improved Stacking strategy[J]. Journal of Guangxi Normal University: Natural Science Edition, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201. [16] 杨鹏, 曾朋, 赵广振, 等. 基于Logistic回归和XGBoost的钓鱼网站检测方法[J]. 东南大学学报: 自然科学版, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001.YANG Peng, ZENG Peng, ZHAO Guangzhen, et al. Phishing website detection method based on Logistic regression and XGBoost[J]. Journal of Southeast University: Natural Science Edition, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001. [17] SAHINGOZ O K, BUBER E, DEMIR O, et al. Machine learning based phishing detection from URLs[J]. Expert Systems with Applications, 2019, 117: 345–357. doi: 10.1016/j.eswa.2018.09.029. [18] 卜佑军, 张桥, 陈博, 等. 基于CNN和BiLSTM的钓鱼URL检测技术研究[J]. 郑州大学学报: 工学版, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022.BU Youjun, ZHANG Qiao, CHEN Bo, et al. Research on phishing URL detection technology based on CNN-BiLSTM[J]. Journal of Zhengzhou University: Engineering Science, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022. [19] 张鹏, 孙博文, 李唯实, 等. 基于LSTM的钓鱼邮件检测系统[J]. 北京理工大学学报, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262.ZHANG Peng, SUN Bowen, LI Weishi, et al. Phishing mail detection system based on LSTM neural network[J]. Transactions of Beijing Institute of Technology, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262. [20] AKÇAM Ö Ş, TEKEREK A, and TEKEREK M. Development of BiLSTM deep learning model to detect URL-based phishing attacks[J]. Computers and Electrical Engineering, 2025, 123: 110212. doi: 10.1016/j.compeleceng.2025.110212. [21] PRASAD Y B and DONDETI V. PDSMV3-DCRNN: A novel ensemble deep learning framework for enhancing phishing detection and URL extraction[J]. Computers & Security, 2025, 148: 104123. doi: 10.1016/j.cose.2024.104123. [22] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.ZHANG Chongsheng, CHEN Jie, LI Qilong, et al. Deep contrastive learning: A survey[J]. Acta Automatica Sinica, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421. [23] 侯明泽, 饶蕾, 范光宇, 等. 基于课程学习的跨度级方面情感三元组提取[J]. 浙江大学学报: 工学版, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008.HOU Mingze, RAO Lei, FAN Guangyu, et al. Span-level aspect sentiment triplet extraction based on curriculum learning[J]. Journal of Zhejiang University: Engineering Science, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008. [24] JAMES J, SANDHYA L, and THOMAS C. Detection of phishing URLs using machine learning techniques[C]. 2013 International Conference on Control Communication and Computing, Thiruvananthapuram, India, 2013: 304–309. doi: 10.1109/ICCC.2013.6731669. [25] TYAGI I, SHAD J, SHARMA S, et al. A novel machine learning approach to detect phishing websites[C]. 5th International Conference on Signal Processing and Integrated Networks, Noida, India, 2018: 425–430. doi: 10.1109/SPIN.2018.8474040. [26] PATIL V, THAKKAR P, SHAH C, et al. Detection and prevention of phishing websites using machine learning approach[C]. 4th International Conference on Computing Communication Control and Automation, Pune, India, 2018: 1–5. doi: 10.1109/ICCUBEA.2018.8697412. [27] LI Yukun, YANG Zhenguo, CHEN Xu, et al. A stacking model using URL and HTML features for phishing webpage detection[J]. Future Generation Computer Systems, 2019, 94: 27–39. doi: 10.1016/j.future.2018.11.004. [28] ABDELHAMID N, THABTAH F, and ABDEL-JABER H. Phishing detection: A recent intelligent machine learning comparison based on models content and features[C]. 2017 International Conference on Intelligence and Security Informatics, Beijing, China, 2017: 72–77. doi: 10.1109/ISI.2017.8004877. [29] JAGADEESAN S, CHATURVEDI A, and KUMAR S. URL phishing analysis using random forest[J]. International Journal of Pure and Applied Mathematics, 2018, 118(20): 4159–4163. [30] CHIEW K L, TAN C L, WONG K S, et al. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system[J]. Information Sciences, 2019, 484: 153–166. doi: 10.1016/j.ins.2019.01.064. [31] BOZKIR A S, DALGIC F C, and AYDOS M. GramBeddings: A new neural network for URL based identification of phishing web pages through N-gram embeddings[J]. Computers & Security, 2023, 124: 102964. doi: 10.1016/j.cose.2022.102964. [32] PRABAKARAN M K, SUNDARAM P M, and CHANDRASEKAR A D. An enhanced deep learning-based phishing detection mechanism to effectively identify malicious URLs using variational autoencoders[J]. IET Information Security, 2023, 17(3): 423–440. doi: 10.1049/ise2.12106. -
下载:
下载: