基于图卷积半监督学习的论文作者同名消歧方法研究

盛晓光; 王颖; 钱力; 王颖

doi:10.11999/JEIT200905

基于图卷积半监督学习的论文作者同名消歧方法研究

doi: 10.11999/JEIT200905

盛晓光^1, ,,
王颖²,
钱力^{2, 3},
王颖¹

1.
中国科学院大学人工智能学院北京 100049
2.
中国科学院文献情报中心北京 100190
3.
中国科学院大学图书情报与档案管理系北京 100190

基金项目: 国家自然科学基金(61702038)，国家社会科学基金(15CTQ006)

详细信息

作者简介:
盛晓光：男，1989年生，博士生，研究方向为教育数据挖掘、人工智能

王颖：女，1982年生，副研究馆员，研究方向为知识组织与知识挖掘

钱力：男，1981年生，研究馆员，研究方向为大数据与机器智能

王颖：女，1969年生，教授，研究方向为数字信号处理、教育数据挖掘

通讯作者:
盛晓光　shengxiaoguang@ucas.ac.cn

¹⁾ https://api.fanyi.baidu.com/
中图分类号: TP391.1
计量
- 文章访问数: 952
- HTML全文浏览量: 378
- PDF下载量: 80
- 被引次数: 0
出版历程
- 收稿日期: 2020-10-23
- 修回日期: 2021-09-23
- 录用日期: 2021-11-04
- 网络出版日期: 2021-11-10
- 刊出日期: 2021-12-21

Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network

Xiaoguang SHENG^{1
, ,},
Ying WANG²,
Li QIAN^{2, 3},
Ying WANG¹

1.
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
2.
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
3.
Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China

Funds: The National Natural Science Foundation of China (61702038), The National Social Science Foundation of China (15CTQ006)

摘要

摘要: 为解决学者与成果的精确匹配问题，该文提出了一种基于图卷积半监督学习的论文作者同名消歧方法。该方法使用SciBERT预训练语言模型计算论文题目、关键字获得论文节点语义表示向量，利用论文的作者和机构信息获得论文的合作网络和机构关联网络邻接矩阵，并从论文合作网络中采集伪标签获得正样本集和负样本集，将这些作为输入利用图卷积神经网络进行半监督学习，获得论文节点嵌入表示进行论文节点向量聚类，实现对论文作者同名消歧。实验结果表明，与其他消歧方法相比，该方法在实验数据集上取得了更好的效果。
- 同名消歧 /
- 图卷积神经网络 /
- BERT语言模型
Abstract: In order to solve the problem of exact matching between scholars and articles, a new method of author name disambiguation is proposed based on semi-supervised learning with graph convolutional network. In this method, the SciBERT pre-training language model is applied to calculating the semantic embedding vector of each paper with their title and keywords. Authors and organizations of papers are used to obtain the adjacency matrixes of the paper’s co-author network and co-organization network. The pseudo labels are collected from the co-author network to obtain the positive and negative samples. The semantic embedding vector, adjacency matrixes and the positive and negative samples are used as input to be processed by Graph Convolution neural Network (GCN). In semi-supervised learning, the embedding vectors of papers are learned to be clustered in order to realize the name disambiguation of papers. The experimental results show that, compared with other disambiguation methods, this method achieves better results on the experimental dataset.
- Name disambiguation /
- Graph Convolutional Network (GCN) /
- BERT language model
¹⁾ https://api.fanyi.baidu.com/

HTML全文

图 1 研究框架

下载: 全尺寸图片幻灯片

图 2 基于BERT预训练模型的论文语义表示

下载: 全尺寸图片幻灯片

图 3 论文合作网络和机构关联网络

下载: 全尺寸图片幻灯片

图 4 权重组合性能对比

下载: 全尺寸图片幻灯片

图 5 $ {\mathrm{\beta }}_{1} $权重调节查准率对比

下载: 全尺寸图片幻灯片

图 6 调和参数$ \mathrm{l}\mathrm{a}\mathrm{m} $对比实验结果

下载: 全尺寸图片幻灯片

表 1 基于图卷积半监督学习的作者同名消歧算法

输入：同名作者论文集合$ P $
输出：论文uuid序列和对应cluster_out列表
(1) 解析论文元数据，获得唯一标识符uuid、标题、关键词、摘　　要、出版物名称、作者列表、机构列表；
(2) 数据预处理如中英文转换、特殊字符处理等；
(3) 将标题和关键词的拼接文本$ d $的列表作为BERT模型的输　　入，计算获得BERT语义表示向量$ \boldsymbol{X} $；
(4) 遍历论文列表，构建论文网络$ {\mathcal{g}}_{ca} $和$ {\mathcal{g}}_{ci} $，建立合作关系和机　　构关联关系，获得邻接矩阵$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{a}} $和$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{i}} $；
(5) 从论文合作网络$ {\mathcal{g}}_{\mathrm{c}\mathrm{a}} $中采集伪标签，获得正、负样本集$ {\mathrm{\xi }}_{+} $和　　 $ {\mathrm{\xi }}_{-} $；
(6) 模型初始化，开始 GCN训练
(7) for epoch in range(nums_epoch):
(8) 根据式(1)执行图卷积；
(9) 根据式(2)执行图卷积；
(10) 根据式(3)执行全连接层运算；
(11) 根据式(4)获得节点向量$ \boldsymbol{Z} $；
(12) 根据式(5)计算损失函数并反向传播梯度更新参数和节点　　向量；
(13) 反向传播更新参数；
(14) end for
(15) 利用训练后的最终论文节点向量$ \boldsymbol{Z} $进行Agglomerative 　　 Clustering()聚类

下载: 导出CSV

表 2 待消歧作者测试集

姓名	论文数	真实作者数	姓名	论文数	真实作者数
Tao Huang	167	2	Yunshan Wang	46	2
Haibo Li	132	3	Liang Wang	119	6
Ming Li	312	10	Lin Wang	56	2
Wei Li	27	2	Gang Xiong	151	2
Jia Liu	29	2	Jun Yang	395	7
Jie Liu	30	2	Peng Zhang	237	10
Jing Liu	277	6	Tao Zhang	939	9
Jun Liu	228	7	Xu Zhao	131	3
Yun Liu	169	5	Feng Zhao	122	6
Bin Wang	201	8	Ming Zhu	31	2

下载: 导出CSV

表 3 对比实验结果(%)

姓名	本文方法			匿名图网络嵌入^[7]			多维网络嵌入^[24]			基于规则的方法
姓名	Pre	Rec	F1	Pre	Rec	F1	Pre	Rec	F1	Pre	Rec	F1
Tao Huang	98.29	97.20	97.74	87.32	82.90	85.05	90.12	86.20	88.46	73.53	29.33	41.93
Haibo Li	87.57	93.77	90.56	86.58	87.22	86.89	80.06	78.68	79.82	42.77	58.06	49.25
Ming Li	92.25	83.30	87.55	60.01	62.95	61.44	74.64	70.03	72.65	15.78	79.62	26.33
Wei Li	73.37	65.61	69.27	78.70	70.37	74.30	80.14	77.35	78.24	52.16	83.07	64.08
Jia Liu	56.82	84.03	67.8	92.59	84.03	88.11	88.23	78.65	82.78	58.62	60.01	73.91
Jie Liu	74.29	53.61	62.28	100.0	100.0	100.0	80.12	70.30	76.24	72.20	64.26	68.0
Jing Liu	92.63	66.91	77.69	76.96	54.26	63.65	78.11	56.47	64.94	35.47	60.16	44.63
Jun Liu	91.45	74.57	82.15	98.12	95.07	96.61	96.42	80.20	88.23	59.72	96.63	73.81
Yun Liu	99.33	99.72	99.52	97.30	87.35	92.06	91.84	82.42	86.21	48.39	62.36	54.49
Bin Wang	91.13	47.09	62.09	81.16	31.51	45.39	94.39	49.20	64.69	49.38	61.51	54.78
Yunshan Wang	85.8	81.92	83.82	93.01	90.21	91.59	87.01	84.18	86.53	51.30	60.01	57.81
LiangWang	82.92	76.06	79.34	50.77	57.01	53.71	62.65	60.27	61.38	20.74	66.77	31.65
Lin Wang	62.75	82.53	71.30	63.73	86.54	73.41	66.19	88.20	76.69	64.23	90.42	75.11
Gang Xiong	99.00	89.21	93.85	98.43	84.30	90.82	94.33	89.21	92.30	83.70	42.35	56.24
Jun Yang	94.42	83.46	88.60	74.35	71.84	73.08	78.81	75.25	77.59	20.37	32.61	25.07
PengZhang	75.36	70.76	72.99	48.74	40.60	44.30	56.30	58.43	57.38	16.09	62.81	25.62
Tao Zhang	99.02	89.50	94.02	80.12	73.04	76.41	88.23	86.52	87.11	42.99	29.06	34.67
Xu Zhao	89.97	66.54	76.50	99.22	95.54	97.35	90.67	86.15	88.46	61.38	81.91	70.18
Feng Zhao	92.25	89	90.59	86.14	78.16	81.96	83.92	80.54	82.78	27.49	62.33	38.15
Ming Zhu	81.57	84.90	83.20	81.57	84.90	83.20	83.12	85.29	84.22	58.29	41.63	48.57
平均	86.01	78.98	81.54	81.75	75.88	77.97	82.27	76.18	78.84	47.73	59.29	48.96

下载: 导出CSV

表 4 组件聚类结果对比(%)

	Avg-Pre	Avg-Rec	Avg-F1
对论文文本语义表示向量聚类	52.20	66.26	57.03
图卷积网络计算合作者/ 机构关系进行聚类	76.49	78.38	75.76
综合	86.01	78.98	81.54

下载: 导出CSV

表 5 使用不同文本语义表示模型的消歧结果对比(%)

模型	Avg-Pre	Avg-Rec	Avg-F1	文本语言
Word2Vec	47.73	65.24	51.22	中英文混合
BERT-base-multilangual-uncased	45.68	67.07	54.35	中英文混合
BERT-wwm-Chinese	48.07	63.88	51.66	中英文混合
BERT-base-uncased	46.60	71.58	55.34	英文
SciBERT	52.20	66.26	57.03	英文

下载: 导出CSV

表 6 针对不同文本内容的消歧结果对比(%)

	Avg-Pre	Avg-Rec	Avg-F1
题目、关键词	52.20	66.26	57.03
题目、关键词、出版物名称	49.43	67.25	55.16
题目、关键词、摘要	50.75	58.87	52.98

下载: 导出CSV

参考文献(24)

[1]	ORCID. What is ORCID[EB/OL]. https://www.lanl.gov/library/scholarly/orcid.php.
[2]	Thomson Reuters Company. What is ResearcherID?[EB/OL]. https://libanswers.lib.xjtlu.edu.cn/faq/240918, 2020.
[3]	HAN Hui, GILES L, ZHA Hongyuan, et al. Two supervised learning approaches for name disambiguation in author citations[C]. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tuscon, USA, 2014: 296–305. doi: 10.1145/996350.996419
[4]	MALIN B. Unsupervised name disambiguation via social network similarity[C]. Proceedings of the SIAM Workshop on Link Analysis, Counterterrorism, and Security, Newport Beach, USA, 2005: 93–102.
[5]	HAN Hui, ZHA Hongyuan, and GILES C L. Name disambiguation in author citations using a K-way spectral clustering method[C]. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, USA, 2005: 334–343. doi: 10.1145/1065385.1065462.
[6]	ZHANG Yutao, ZHANG Fanjin, YAO Peiran, et al. Name disambiguation in aminer: Clustering, maintenance, and human in the loop[C]. The Twenty-Forth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 2018: 1002–1011. doi: 10.1145/3219819.3219859.
[7]	ZHANG Baichuan and AL HASAN M. Name disambiguation in anonymized graphs using network embedding[C]. The 2017 ACM on Conference on Information and Knowledge Management, Singapore, 2017: 1239–1248. doi: 10.1145/3132847.3132873.
[8]	盖杉, 鲍中运. 基于改进深度卷积神经网络的纸币识别研究[J]. 电子与信息学报, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097 GAI Shan and BAO Zhongyun. Banknote recognition research based on improved deep convolutional neural network[J]. Journal of Electronics &Information Technology, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097
[9]	卢俊言, 贾宏光, 高放, 等. 语义分割网络重建单视图遥感影像数字表面模型[J]. 电子与信息学报, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031 LU Junyan, JIA Hongguang, GAO Fang, et al. Reconstruction of digital surface model of single-view remote sensing image by semantic segmentation network[J]. Journal of Electronics &Information Technology, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031
[10]	孙晓, 彭晓琪, 胡敏, 等. 基于多维扩展特征与深度学习的微博短文本情感分析[J]. 电子与信息学报, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975 SUN Xiao, PENG Xiaoqi, HU Min, et al. Extended multi-modality features and deep learning based microblog short text sentiment analysis[J]. Journal of Electronics &Information Technology, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975
[11]	郑睿刚, 陈伟福, 冯国灿. 图卷积算法的研究进展[J]. 中山大学学报:自然科学版, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001 ZHENG Ruigang, CHEN Weifu, and FENG Guocan. A concise survey on graph convolutional networks[J]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001
[12]	徐冰冰, 岑科廷, 黄俊杰, 等. 图卷积神经网络综述[J]. 计算机学报, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755 XU Bingbing, CEN Keting, HUANG Junjie, et al. A survey on graph convolutional neural network[J]. Chinese Journal of Computers, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755
[13]	葛尧, 陈松灿. 面向推荐系统的图卷积网络[J]. 软件学报, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016 GE Yao and CHEN Songcan. Graph convolutional network for recommender systems[J]. Journal of Software, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016
[14]	王鑫, 李可, 宁晨, 等. 基于深度卷积神经网络和多核学习的遥感图像分类方法[J]. 电子与信息学报, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628 WANG Xin, LI Ke, NING Chen, et al. Remote sensing image classification method based on deep convolution neural network and multi-kernel learning[J]. Journal of Electronics &Information Technology, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628
[15]	HUANG Jian, ERTEKIN S, and GILES C L. Efficient name disambiguation for large-scale databases[C]. 10th European Conference on Principles and Practice of Knowledge Discovery, Berlin, Germany, 2006: 536–544. doi: 10.1007/11871637_53.
[16]	YOSHIDA M, IKEDA M, ONO S, et al. Person name disambiguation by bootstrapping[C]. The 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 2010: 10–17. doi: 10.1145/1835449.1835454.
[17]	ZHU Jia, WU Xingcheng, LIN Xueqin, et al. A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering[J]. Scientometrics, 2018, 114(3): 781–794. doi: 10.1007/s11192-017-2611-8
[18]	FAN Xiaoming, WANG Jianyong, PU Xu, et al. On graph-based name disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2): 10. doi: 10.1145/1891879.1891883
[19]	TANG Jie, FONG A C M, WANG Bo, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 975–987. doi: 10.1109/TKDE.2011.13
[20]	HERMANSSON L, KEROLA T, JOHANSSON F, et al. Entity disambiguation in anonymized graphs using graph kernels[C]. The 22nd ACM International Conference on Information & Knowledge Management, San Francisco, USA, 2013: 1037–1046. doi: 10.1145/2505515.2505565.
[21]	KIPF T N and WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv: 1609.02907, 2016.
[22]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. https://arxiv.org/pdf/1810.04805.pdf, 2019.
[23]	BELTAGY I, LO K, and COHAN A. SciBERT: A pretrained language model for scientific text[C]. The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kang, China, 2019: 3615–3620.
[24]	XU Jun, SHEN Siqi, LI Dongsheng, et al. A network-embedding based method for author disambiguation[C]. The 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 2018: 1735–1738. doi: 10.1145/3269206.3269272.