结合预训练模型的双向门控图卷积对抗词义消歧

张春祥; 孙颖; 高可心; 高雪瑶

doi:10.11999/JEIT250386

结合预训练模型的双向门控图卷积对抗词义消歧

doi: 10.11999/JEIT250386 cstr: 32379.14.JEIT250386

哈尔滨理工大学计算机科学与技术学院哈尔滨 150080

基金项目: 国家自然科学基金(61502124, 60903082)，中国博士后科学基金(2014M560249)，黑龙江省自然科学基金(LH2022F031, LH2022F030, F2015041, F201420)

详细信息

作者简介:
张春祥：男，教授，研究方向为自然语言处理、机器学习、图形图像处理

孙颖：女，硕士生，研究方向为自然语言处理

高可心：男，硕士生，研究方向为自然语言处理

高雪瑶：女，教授，研究方向为图形图像处理、自然语言处理、机器学习

通讯作者:
高雪瑶　xueyao_gao@163.com

中图分类号: TN919.8; TP391.1
计量
- 文章访问数: 117
- HTML全文浏览量: 63
- PDF下载量: 16
- 被引次数: 0
出版历程
- 收稿日期: 2025-05-08
- 修回日期: 2025-08-27
- 网络出版日期: 2025-09-02

Combine the Pre-trained Model with Bidirectional Gated Recurrent Units and Graph Convolutional Network for Adversarial Word Sense Disambiguation

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

Funds: The National Natural Science Foundation of China (61502124, 60903082), China Postdoctoral Science Foundation (2014M560249), Heilongjiang Provincial Natural Science Foundation of China (LH2022F031, LH2022F030, F2015041, F201420)

摘要

摘要: 词义消歧(WSD)是提升计算机自然语言理解能力的关键技术，广泛应用于机器翻译、信息检索等领域。为解决现有模型在泛化与鲁棒性方面的不足，该文提出了一种基于预训练模型的双向门控循环单元(Bi-GRU)、交叉注意力(CA)和图卷积网络(GCN)融合的词义消歧模型，引入对抗训练(AT)来优化该模型。将歧义词左右词汇的词形、词性和语义类作为消歧特征，输入LERT获取动态词向量，利用交叉注意力融合Bi-GRU神经网络提取token序列的全局语义信息和CLS序列的局部语义信息，为消歧特征图生成更加完整的句子结点表示。将消歧特征图输入图卷积来更新结点之间的特征信息，然后利用插值预测层和语义分类层来确定歧义词的真实语义类别。计算输入动态词向量的梯度，生成细微的连续扰动，并将扰动加入到原始词向量矩阵中，生成对抗样本。利用对抗样本，融合网络的损失与对抗训练中的损失来优化消歧模型。实验结果表明，该方法不仅能够增强消歧模型处理复杂词汇歧义问题的能力，还能有效提高其鲁棒性和泛化能力，从而表现出更好的消歧性能。
- 词义消歧 /
- 图卷积网络 /
- 对抗训练 /
- 消歧特征 /
- 消歧特征图
Abstract: Objective In Word Sense Disambiguation (WSD), the Linguistically-motivated bidirectional Encoder Representation from Transformer (LERT) is employed to capture rich semantic representations from large-scale corpora, enabling improved contextual understanding of word meanings. However, several challenges remain. Current WSD models are not sufficiently sensitive to temporal and spatial dependencies within sequences, and single-dimensional features are inadequate for representing the diversity of linguistic expressions. To address these limitations, a hybrid network is constructed by integrating LERT, Bidirectional Gated Recurrent Units (Bi-GRU), and Graph Convolutional Network (GCN). This network enhances the modeling of structured text and contextual semantics. Nevertheless, generalization and robustness remain problematic. Therefore, an adversarial training algorithm is applied to improve the overall performance and resilience of the WSD model. Methods An adversarial WSD method is proposed based on a pre-trained model, combining Bi-GRU and GCN. First, word forms, parts of speech, and semantic categories of the neighboring words of an ambiguous term are input into the LERT model to obtain the CLS sequence and token sequence. Second, cross-attention is applied to fuse the global semantic information extracted by Bi-GRU from the token sequence with the local semantic information derived from the CLS sequence. Sentences, word forms, parts of speech, and semantic categories are then used as nodes to construct a disambiguation feature graph, which is subsequently input into GCN to update the feature information of the nodes. Third, the semantic category of the ambiguous word is determined through the interpolated prediction layer and semantic classification layer. Fourth, subtle continuous perturbations are generated by computing the gradient of the dynamic word vectors in the input. These perturbations are added to the original word vector matrix to create adversarial samples, which are used to optimize the LERT+Bi-GRU+CA+GCN (LBGCA-GCN) model. A cross-entropy loss function is applied to measure the performance of the LBGCA-GCN model on adversarial samples. Finally, the loss from the network is combined with the loss from AT to optimize the LBGCA-GCN model.. Results and Discussions When the FreeLB algorithm is applied, stronger adversarial perturbations are generated, and the LBGCA-GCN-AT model achieves the best performance (Table 2). As the number of perturbation steps increases, the strength of AT improves. However, when the number of steps exceeds a certain threshold, the LBGCA-GCN+AT(LBGCA-GCN-AT) model begins to overfit. The Free Large-Batch (FreeLB) algorithm demonstrates strong robustness with three perturbation steps (Table 3). The cross-attention mechanism, which fuses the token sequence with the CLS sequence, yields significant performance gains in complex semantic scenarios (Fig. 3). By incorporating AT, the LBGCA-GCN-AT model achieves notable improvements across multiple evaluation metrics (Table 4). Conclusions This study presents an adversarial WSD method based on a pre-trained model, integrating Bi-GRU and GCN to address the weak generalization ability and robustness of conventional WSD models. LERT is used to transform discriminative features into dynamic word vectors, while cross-attention fuses the global semantic information extracted by Bi-GRU from the token sequence with the local semantic information derived from the CLS sequence. This fusion generates more complete node representations for the disambiguation feature graph. A GCN is then applied to update the relationships among nodes within the feature graph. The interpolated prediction layer and semantic classification layer are used to determine the semantic category of ambiguous words. To further improve robustness, the gradient of the dynamic word vector is computed and perturbed to generate adversarial samples, which are used to optimize the LBGCA-GCN model. The network loss is combined with the AT loss to refine the model. Experiments conducted on the SemEval-2007 Task #05 and HealthWSD datasets examine multiple factors affecting model performance, including adversarial algorithms, perturbation steps, and sequence fusion methods. Results demonstrate that introducing AT improves the model’s ability to handle real-world noise and perturbations. The proposed method not only enhances robustness and generalization but also strengthens the capacity of WSD models to capture subtle semantic distinctions.
- Word Sense Disambiguation (WSD) /
- Graph Convolutional Network (GCN) /
- Adversarial Training (AT) /
- Disambiguation features /
- Disambiguation feature graph

HTML全文

图 1 LBGCA-GCN消歧模型

下载: 全尺寸图片幻灯片

图 2 基于对抗训练的词义消歧模型框架LBGCA-GCN-AT

下载: 全尺寸图片幻灯片

图 3 不同序列融合方法下的消歧性能对比

下载: 全尺寸图片幻灯片

表 1 对抗训练算法描述

算法	输入：LERT输出的词向量T
FGSM FGM	对于每个T： ①计算T的前向损失，反向传播得到梯度$ {\nabla _{\boldsymbol{T}}}L(\theta ,{\boldsymbol{T}},y) $； ②计算输入样本T的扰动值r_adv，生成对抗样本T_adv=T+r_adv； ③计算T_adv的前向损失，累加反向传播的梯度$ {\nabla _{\boldsymbol{T}}}L(\theta ,{\boldsymbol{T}},y) $到步骤①的梯度上； ④恢复LERT输出的向量为步骤①时的值； ⑤根据步骤③的梯度对参数进行更新。
PGD FreeAT	对于每个T： ①计算T的前向损失，反向传播得到梯度$ {\nabla _{\boldsymbol{T}}}L(\theta ,{\boldsymbol{T}},y) $并备份；对于每步t： ②计算输入样本T的扰动值T_t，生成对抗样本T_adv=T_t； ③t != k时：将梯度归0，计算T_adv的前向损失，反向传播得到梯度$ {\nabla _{\boldsymbol{T}}}L(\theta ,{\boldsymbol{T}},y) $； ④t == k时：恢复步骤①的梯度，计算g(T_t)并累加到步骤①上； ⑤恢复LERT输出的向量为步骤①时的值； ⑥根据步骤④的梯度对参数进行更新。
FreeLB	FreeLB算法在PGD算法的基础上，将步骤④改为： ④ t == k时：恢复步骤①的梯度，计算平均梯度$ {\nabla _{\boldsymbol{T}}}L(\theta ,{\boldsymbol{T}},y) $/k累加到步骤①上；

下载: 导出CSV

表 2 对抗训练算法对模型性能的影响

对抗训练算法	SemEval-2007: Task#5语料				HealthWSD语料
对抗训练算法	A_mar	F1_mar	P_mar	R_mar	A_mar	F1_mar	P_mar	R_mar
FGSM算法	0.759 0	0.706 8	0.719 8	0.757 3	0.963 6	0.956 1	0.957 4	0.957 5
FGM算法	0.768 6	0.717 4	0.724 4	0.752 8	0.963 3	0.952 9	0.951 0	0.958 1
PGD算法	0.771 1	0.715 0	0.723 4	0.756 0	0.967 4	0.958 2	0.953 3	0.960 5
FreeAT算法	0.761 6	0.708 7	0.718 8	0.754 8	0.968 4	0.957 0	0.956 7	0.960 6
FreeLB算法	0.806 9	0.780 2	0.780 4	0.810 4	0.969 4	0.959 1	0.959 0	0.961 5

下载: 导出CSV

表 3 不同扰动步数下的消歧性能对比

对抗算法	扰动步数	SemEval-2007: Task#5语料				HealthWSD语料
对抗算法	扰动步数	A_mar	F1_mar	P_mar	R_mar	A_mar	F1_mar	P_mar	R_mar
PGD	2	0.751 4	0.709 4	0.715 4	0.756 6	0.961 7	0.950 5	0.951 0	0.953 7
	3	0.756 3	0.701 8	0.711 7	0.743 3	0.967 0	0.956 4	0.954 4	0.961 3
	4	0.771 1	0.715 0	0.723 4	0.756 0	0.967 4	0.958 2	0.953 3	0.960 5
FreeAT	2	0.756 8	0.706 7	0.714 1	0.757 1	0.965 6	0.957 2	0.957 2	0.961 0
	3	0.749 1	0.706 6	0.713 1	0.747 2	0.963 0	0.952 7	0.950 0	0.960 0
	4	0.761 6	0.708 7	0.718 8	0.754 8	0.968 4	0.957 0	0.956 7	0.960 6
FreeLB	2	0.763 5	0.719 1	0.726 8	0.765 9	0.962 4	0.948 3	0.947 2	0.954 3
	3	0.806 9	0.780 2	0.780 4	0.810 4	0.969 4	0.959 1	0.959 0	0.961 5
	4	0.766 5	0.723 0	0.728 4	0.765 6	0.966 7	0.952 7	0.951 6	0.959 3

下载: 导出CSV

表 4 消融实验

模型	SemEval-2007: Task#5语料				HealthWSD语料
模型	A_mar	F1_mar	P_mar	R_mar	A_mar	F1_mar	P_mar	R_mar
LERT	0.746 9	0.705 9	0.712 4	0.753 5	0.933 8	0.919 4	0.917 8	0.924 4
LBG	0.768 0	0.731 3	0.737 6	0.783 4	0.946 7	0.932 0	0.929 8	0.939 6
LBG-GCN	0.784 5	0.743 1	0.742 7	0.802 8	0.956 0	0.950 9	0.949 9	0.954 1
LBGCA-GCN	0.797 6	0.758 0	0.762 9	0.804 3	0.963 5	0.953 5	0.951 5	0.962 9
LBGCA-GCN-AT	0.806 9	0.780 2	0.780 4	0.810 4	0.969 4	0.959 1	0.959 0	0.961 5

下载: 导出CSV

表 5 对比实验

模型	SemEval-2007: Task#5语料				HealthWSD语料
模型	A_mar	F1_mar	P_mar	R_mar	A_mar	F1_mar	P_mar	R_mar
BiLSTM	0.659 7	0.553 2	0.609 9	0.580 6	0.739 0	0.631 7	0.686 4	0.645 1
TextCNN	0.660 6	0.595 2	0.630 8	0.619 6	0.850 3	0.777 9	0.847 0	0.774 6
TextGCN	0.671 3	0.617 8	0.634 7	0.651 3	0.875 7	0.823 7	0.809 1	0.875 2
GraphSAGE	0.658 7	0.602 9	0.627 3	0.610 4	0.828 9	0.795 4	0.827 3	0.788 5
BERT	0.740 8	0.700 4	0.702 8	0.757 9	0.919 6	0.898 3	0.899 7	0.902 3
RoBERTa	0.742 9	0.703 1	0.708 2	0.740 2	0.928 1	0.915 6	0.914 0	0.921 6
MacBERT	0.736 7	0.699 4	0.705 2	0.731 5	0.922 3	0.911 3	0.911 5	0.915 2
LERT	0.746 9	0.705 9	0.712 4	0.753 5	0.933 8	0.919 4	0.917 8	0.924 4
jina	0.747 2	0.693 5	0.701 5	0.732 7	0.958 8	0.948 4	0.943 2	0.960 5
MRHA	0.761 7	0.684 6	0.724 7	0.694 3	0.901 0	0.807 8	0.823 5	0.810 6
LBGCA-GCN-AT	0.806 9	0.780 2	0.780 4	0.810 4	0.969 4	0.959 1	0.959 0	0.961 5

下载: 导出CSV

参考文献(27)

[1]	MENTE R, ALAND S, and CHENDAGE B. Review of word sense disambiguation and it’s approaches[EB/OL]. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4097221, 2022. doi: 10.2139/ssrn.4097221.
[2]	ABRAHAM A, GUPTA B K, MAURYA A S, et al. Naïe Bayes approach for word sense disambiguation system with a focus on parts-of-speech ambiguity resolution[J]. IEEE Access, 2024, 12: 126668–126678. doi: 10.1109/ACCESS.2024.3453912.
[3]	WANG Yue, LIANG Qiliang, YIN Yaqi, et al. Disambiguate words like composing them: A morphology-informed approach to enhance Chinese word sense disambiguation[C]. The 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024: 15354–15365. doi: 10.18653/v1/2024.acl-long.819.
[4]	LI Linlin, LI Juxing, WANG Hongli, et al. Application of the transformer model algorithm in Chinese word sense disambiguation: A case study in Chinese language[J]. Scientific Reports, 2024, 14(1): 6320. doi: 10.1038/s41598-024-56976-5.
[5]	WAEL T, ELREFAI E, MAKRAM M, et al. Pirates at arabicNLU2024: Enhancing Arabic word sense disambiguation using transformer-based approaches[C]. The Second Arabic Natural Language Processing Conference, Bangkok, Thailand, 2024: 372–376. doi: 10.18653/v1/2024.arabicnlp-1.31.
[6]	MISHRA B K and JAIN S. Word sense disambiguation for Indic language using Bi-LSTM[J]. Multimedia Tools and Applications, 2024, 84(16): 16631–16656. doi: 10.1007/S11042-024-19499-9.
[7]	LYU Meng and MO Shasha. HSRG-WSD: A novel unsupervised Chinese word sense disambiguation method based on heterogeneous sememe-relation graph[C]. The 19th International Conference on Advanced Intelligent Computing Technology and Applications, Zhengzhou, China, 2023: 623–633. doi: 10.1007/978-981-99-4752-2_51.
[8]	PU Xiao, PAPPAS N, HENDERSON J, et al. Integrating weakly supervised word sense disambiguation into neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2018, 6: 635–649. doi: 10.1162/tacl_a_00242.
[9]	PADWAD H, KESWANI G, BISEN W, et al. Leveraging contextual factors for word sense disambiguation in Hindi language[J]. International Journal of Intelligent Systems and Applications in Engineering, 2024, 12(12s): 129–136.
[10]	LI Zhi, YANG Fan, and LUO Yaoru. Context embedding based on Bi-LSTM in semi-supervised biomedical word sense disambiguation[J]. IEEE Access, 2019, 7: 72928–72935. doi: 10.1109/ACCESS.2019.2912584.
[11]	BARBA E, PROCOPIO L, CAMPOLUNGO N, et al. MulaN: Multilingual label propagation for word sense disambiguation[C]. The 29th International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 2021: 3837–3844. doi: 10.24963/ijcai.2020/531.
[12]	JIA Xiaojun, ZHANG Yong, WU Baoyuan, et al. Boosting fast adversarial training with learnable adversarial initialization[J]. IEEE Transactions on Image Processing, 2022, 31: 4417–4430. doi: 10.1109/TIP.2022.3184255.
[13]	RIBEIRO A H, SCHÖN T B, ZACHARIAH D, et al. Efficient optimization algorithms for linear adversarial training[C]. The 28th International Conference on Artificial Intelligence and Statistics, Mai Khao, Thailand, 2025: 1207–1215.
[14]	LI J W, LIANG Renwei, YEH C H, et al. Adversarial robustness overestimation and instability in TRADES[EB/OL]. https://arxiv.org/abs/2410.07675, 2024.
[15]	CHENG Xiwei, FU Kexin, and FARNIA F. Stability and generalization in free adversarial training[EB/OL]. https://arxiv.org/abs/2404.08980, 2024.
[16]	ZHU Chen, CHENG Yu, GAN Zhe, et al. FreeLB: Enhanced adversarial training for natural language understanding[C]. The 8th International Conference on Learning Representations, Xi’an, China, 2020: 11232–11245.
[17]	BAI Tao, LUO Jinqi, ZHAO Jun, et al. Recent advances in adversarial training for adversarial robustness[C]. The 30th International Joint Conference on Artificial Intelligence, Montreal, Canada, 2021: 4312–4321. doi: 10.24963/ijcai.2021/591.
[18]	ZHANG Liwei. Word sense disambiguation model based on Bi-LSTM[C]. The 2022 14th International Conference on Measuring Technology and Mechatronics Automation, Changsha, China, 2022: 848–851. doi: 10.1109/ICMTMA54903.2022.00172.
[19]	KIM Y. Convolutional neural networks for sentence classification[C]. The 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1746–1751. doi: 10.3115/v1/d14-1181.
[20]	YAO Liang, MAO Chengsheng, and LUO Yuan. Graph convolutional networks for text classification[C]. The 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, 2019: 7370–7377. doi: 10.1609/aaai.v33i01.33017370.
[21]	HAMILTON W L, YING Z, and LESKOVEC J. Inductive representation learning on large graphs[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017, 30: 1025–1035.
[22]	CUI Yiming , CHE Wanxiang, LIU Ting, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504–3514. doi: 10.1109/TASLP.2021.3124365.
[23]	LIU Yinhan, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized BERT pretraining approach[EB/OL]. https://doi.org/10.48550/arXiv.1907.11692.2019.7, 2019.
[24]	CUI Yiming, CHE Wanxiang, LIU Ting, et al. Revisiting pre-trained models for Chinese natural language processing[C]. The Findings of the Association for Computational Linguistics: EMNLP, 2020: 657–668. doi: 10.48550/arXiv.2004.13922.
[25]	CUI Yiming, CHE Wanxiang, WANG Shijin, et al. LERT: A linguistically-motivated pre-trained language model[EB/OL]. https://ymcui.com/pdf/lert.pdf, 2022.
[26]	STURUA S, MOHR I, AKRAM M K, et al. jina-embeddings-v3: Multilingual embeddings with task LoRA[EB/OL]. https://arxiv.org/abs/2409.10173, 2024.
[27]	张春祥, 张育隆, 高雪瑶. 基于多通道残差混合空洞卷积的注意力词义消歧[J]. 北京邮电大学学报, 2024, 47(5): 128–134. doi: 10.13190/j.jbupt.2023-179. ZHANG Chunxiang, ZHANG Yulong, and GAO Xueyao. Multi-channel residual hybrid dilated convolution with attention for word sense disambiguation[J]. Journal of Beijing University of Posts and Telecommunications, 2024, 47(5): 128–134. doi: 10.13190/j.jbupt.2023-179.