Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
-
摘要: 为解决学者与成果的精确匹配问题,该文提出了一种基于图卷积半监督学习的论文作者同名消歧方法。该方法使用SciBERT预训练语言模型计算论文题目、关键字获得论文节点语义表示向量,利用论文的作者和机构信息获得论文的合作网络和机构关联网络邻接矩阵,并从论文合作网络中采集伪标签获得正样本集和负样本集,将这些作为输入利用图卷积神经网络进行半监督学习,获得论文节点嵌入表示进行论文节点向量聚类,实现对论文作者同名消歧。实验结果表明,与其他消歧方法相比,该方法在实验数据集上取得了更好的效果。Abstract: In order to solve the problem of exact matching between scholars and articles, a new method of author name disambiguation is proposed based on semi-supervised learning with graph convolutional network. In this method, the SciBERT pre-training language model is applied to calculating the semantic embedding vector of each paper with their title and keywords. Authors and organizations of papers are used to obtain the adjacency matrixes of the paper’s co-author network and co-organization network. The pseudo labels are collected from the co-author network to obtain the positive and negative samples. The semantic embedding vector, adjacency matrixes and the positive and negative samples are used as input to be processed by Graph Convolution neural Network (GCN). In semi-supervised learning, the embedding vectors of papers are learned to be clustered in order to realize the name disambiguation of papers. The experimental results show that, compared with other disambiguation methods, this method achieves better results on the experimental dataset.
-
表 1 基于图卷积半监督学习的作者同名消歧算法
输入:同名作者论文集合$ P $ 输出:论文uuid序列和对应cluster_out列表 (1) 解析论文元数据,获得唯一标识符uuid、标题、关键词、摘
要、出版物名称、作者列表、机构列表;(2) 数据预处理如中英文转换、特殊字符处理等; (3) 将标题和关键词的拼接文本$ d $的列表作为BERT模型的输
入,计算获得BERT语义表示向量$ \boldsymbol{X} $;(4) 遍历论文列表,构建论文网络$ {\mathcal{g}}_{ca} $和$ {\mathcal{g}}_{ci} $,建立合作关系和机
构关联关系,获得邻接矩阵$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{a}} $和$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{i}} $;(5) 从论文合作网络$ {\mathcal{g}}_{\mathrm{c}\mathrm{a}} $中采集伪标签,获得正、负样本集$ {\mathrm{\xi }}_{+} $和
$ {\mathrm{\xi }}_{-} $;(6) 模型初始化,开始 GCN训练 (7) for epoch in range(nums_epoch): (8) 根据式(1)执行图卷积; (9) 根据式(2)执行图卷积; (10) 根据式(3)执行全连接层运算; (11) 根据式(4)获得节点向量$ \boldsymbol{Z} $; (12) 根据式(5)计算损失函数并反向传播梯度更新参数和节点
向量;(13) 反向传播更新参数; (14) end for (15) 利用训练后的最终论文节点向量$ \boldsymbol{Z} $进行Agglomerative
Clustering()聚类表 2 待消歧作者测试集
姓名 论文数 真实
作者数姓名 论文数 真实
作者数Tao Huang 167 2 Yunshan Wang 46 2 Haibo Li 132 3 Liang Wang 119 6 Ming Li 312 10 Lin Wang 56 2 Wei Li 27 2 Gang Xiong 151 2 Jia Liu 29 2 Jun Yang 395 7 Jie Liu 30 2 Peng Zhang 237 10 Jing Liu 277 6 Tao Zhang 939 9 Jun Liu 228 7 Xu Zhao 131 3 Yun Liu 169 5 Feng Zhao 122 6 Bin Wang 201 8 Ming Zhu 31 2 表 3 对比实验结果(%)
姓名 本文方法 匿名图网络嵌入[7] 多维网络嵌入[24] 基于规则的方法 Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1 Tao Huang 98.29 97.20 97.74 87.32 82.90 85.05 90.12 86.20 88.46 73.53 29.33 41.93 Haibo Li 87.57 93.77 90.56 86.58 87.22 86.89 80.06 78.68 79.82 42.77 58.06 49.25 Ming Li 92.25 83.30 87.55 60.01 62.95 61.44 74.64 70.03 72.65 15.78 79.62 26.33 Wei Li 73.37 65.61 69.27 78.70 70.37 74.30 80.14 77.35 78.24 52.16 83.07 64.08 Jia Liu 56.82 84.03 67.8 92.59 84.03 88.11 88.23 78.65 82.78 58.62 60.01 73.91 Jie Liu 74.29 53.61 62.28 100.0 100.0 100.0 80.12 70.30 76.24 72.20 64.26 68.0 Jing Liu 92.63 66.91 77.69 76.96 54.26 63.65 78.11 56.47 64.94 35.47 60.16 44.63 Jun Liu 91.45 74.57 82.15 98.12 95.07 96.61 96.42 80.20 88.23 59.72 96.63 73.81 Yun Liu 99.33 99.72 99.52 97.30 87.35 92.06 91.84 82.42 86.21 48.39 62.36 54.49 Bin Wang 91.13 47.09 62.09 81.16 31.51 45.39 94.39 49.20 64.69 49.38 61.51 54.78 Yunshan Wang 85.8 81.92 83.82 93.01 90.21 91.59 87.01 84.18 86.53 51.30 60.01 57.81 LiangWang 82.92 76.06 79.34 50.77 57.01 53.71 62.65 60.27 61.38 20.74 66.77 31.65 Lin Wang 62.75 82.53 71.30 63.73 86.54 73.41 66.19 88.20 76.69 64.23 90.42 75.11 Gang Xiong 99.00 89.21 93.85 98.43 84.30 90.82 94.33 89.21 92.30 83.70 42.35 56.24 Jun Yang 94.42 83.46 88.60 74.35 71.84 73.08 78.81 75.25 77.59 20.37 32.61 25.07 PengZhang 75.36 70.76 72.99 48.74 40.60 44.30 56.30 58.43 57.38 16.09 62.81 25.62 Tao Zhang 99.02 89.50 94.02 80.12 73.04 76.41 88.23 86.52 87.11 42.99 29.06 34.67 Xu Zhao 89.97 66.54 76.50 99.22 95.54 97.35 90.67 86.15 88.46 61.38 81.91 70.18 Feng Zhao 92.25 89 90.59 86.14 78.16 81.96 83.92 80.54 82.78 27.49 62.33 38.15 Ming Zhu 81.57 84.90 83.20 81.57 84.90 83.20 83.12 85.29 84.22 58.29 41.63 48.57 平均 86.01 78.98 81.54 81.75 75.88 77.97 82.27 76.18 78.84 47.73 59.29 48.96 表 4 组件聚类结果对比(%)
Avg-Pre Avg-Rec Avg-F1 对论文文本语义表
示向量聚类52.20 66.26 57.03 图卷积网络计算合作者/
机构关系进行聚类76.49 78.38 75.76 综合 86.01 78.98 81.54 表 5 使用不同文本语义表示模型的消歧结果对比(%)
模型 Avg-Pre Avg-Rec Avg-F1 文本语言 Word2Vec 47.73 65.24 51.22 中英文混合 BERT-base-multilangual-uncased 45.68 67.07 54.35 中英文混合 BERT-wwm-Chinese 48.07 63.88 51.66 中英文混合 BERT-base-uncased 46.60 71.58 55.34 英文 SciBERT 52.20 66.26 57.03 英文 表 6 针对不同文本内容的消歧结果对比(%)
Avg-Pre Avg-Rec Avg-F1 题目、关键词 52.20 66.26 57.03 题目、关键词、出版物名称 49.43 67.25 55.16 题目、关键词、摘要 50.75 58.87 52.98 -
[1] ORCID. What is ORCID[EB/OL]. https://www.lanl.gov/library/scholarly/orcid.php. [2] Thomson Reuters Company. What is ResearcherID?[EB/OL]. https://libanswers.lib.xjtlu.edu.cn/faq/240918, 2020. [3] HAN Hui, GILES L, ZHA Hongyuan, et al. Two supervised learning approaches for name disambiguation in author citations[C]. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tuscon, USA, 2014: 296–305. doi: 10.1145/996350.996419 [4] MALIN B. Unsupervised name disambiguation via social network similarity[C]. Proceedings of the SIAM Workshop on Link Analysis, Counterterrorism, and Security, Newport Beach, USA, 2005: 93–102. [5] HAN Hui, ZHA Hongyuan, and GILES C L. Name disambiguation in author citations using a K-way spectral clustering method[C]. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, USA, 2005: 334–343. doi: 10.1145/1065385.1065462. [6] ZHANG Yutao, ZHANG Fanjin, YAO Peiran, et al. Name disambiguation in aminer: Clustering, maintenance, and human in the loop[C]. The Twenty-Forth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 2018: 1002–1011. doi: 10.1145/3219819.3219859. [7] ZHANG Baichuan and AL HASAN M. Name disambiguation in anonymized graphs using network embedding[C]. The 2017 ACM on Conference on Information and Knowledge Management, Singapore, 2017: 1239–1248. doi: 10.1145/3132847.3132873. [8] 盖杉, 鲍中运. 基于改进深度卷积神经网络的纸币识别研究[J]. 电子与信息学报, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097GAI Shan and BAO Zhongyun. Banknote recognition research based on improved deep convolutional neural network[J]. Journal of Electronics &Information Technology, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097 [9] 卢俊言, 贾宏光, 高放, 等. 语义分割网络重建单视图遥感影像数字表面模型[J]. 电子与信息学报, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031LU Junyan, JIA Hongguang, GAO Fang, et al. Reconstruction of digital surface model of single-view remote sensing image by semantic segmentation network[J]. Journal of Electronics &Information Technology, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031 [10] 孙晓, 彭晓琪, 胡敏, 等. 基于多维扩展特征与深度学习的微博短文本情感分析[J]. 电子与信息学报, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975SUN Xiao, PENG Xiaoqi, HU Min, et al. Extended multi-modality features and deep learning based microblog short text sentiment analysis[J]. Journal of Electronics &Information Technology, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975 [11] 郑睿刚, 陈伟福, 冯国灿. 图卷积算法的研究进展[J]. 中山大学学报:自然科学版, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001ZHENG Ruigang, CHEN Weifu, and FENG Guocan. A concise survey on graph convolutional networks[J]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001 [12] 徐冰冰, 岑科廷, 黄俊杰, 等. 图卷积神经网络综述[J]. 计算机学报, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755XU Bingbing, CEN Keting, HUANG Junjie, et al. A survey on graph convolutional neural network[J]. Chinese Journal of Computers, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755 [13] 葛尧, 陈松灿. 面向推荐系统的图卷积网络[J]. 软件学报, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016GE Yao and CHEN Songcan. Graph convolutional network for recommender systems[J]. Journal of Software, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016 [14] 王鑫, 李可, 宁晨, 等. 基于深度卷积神经网络和多核学习的遥感图像分类方法[J]. 电子与信息学报, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628WANG Xin, LI Ke, NING Chen, et al. Remote sensing image classification method based on deep convolution neural network and multi-kernel learning[J]. Journal of Electronics &Information Technology, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628 [15] HUANG Jian, ERTEKIN S, and GILES C L. Efficient name disambiguation for large-scale databases[C]. 10th European Conference on Principles and Practice of Knowledge Discovery, Berlin, Germany, 2006: 536–544. doi: 10.1007/11871637_53. [16] YOSHIDA M, IKEDA M, ONO S, et al. Person name disambiguation by bootstrapping[C]. The 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 2010: 10–17. doi: 10.1145/1835449.1835454. [17] ZHU Jia, WU Xingcheng, LIN Xueqin, et al. A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering[J]. Scientometrics, 2018, 114(3): 781–794. doi: 10.1007/s11192-017-2611-8 [18] FAN Xiaoming, WANG Jianyong, PU Xu, et al. On graph-based name disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2): 10. doi: 10.1145/1891879.1891883 [19] TANG Jie, FONG A C M, WANG Bo, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 975–987. doi: 10.1109/TKDE.2011.13 [20] HERMANSSON L, KEROLA T, JOHANSSON F, et al. Entity disambiguation in anonymized graphs using graph kernels[C]. The 22nd ACM International Conference on Information & Knowledge Management, San Francisco, USA, 2013: 1037–1046. doi: 10.1145/2505515.2505565. [21] KIPF T N and WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv: 1609.02907, 2016. [22] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. https://arxiv.org/pdf/1810.04805.pdf, 2019. [23] BELTAGY I, LO K, and COHAN A. SciBERT: A pretrained language model for scientific text[C]. The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kang, China, 2019: 3615–3620. [24] XU Jun, SHEN Siqi, LI Dongsheng, et al. A network-embedding based method for author disambiguation[C]. The 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 2018: 1735–1738. doi: 10.1145/3269206.3269272.