高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于图卷积半监督学习的论文作者同名消歧方法研究

盛晓光 王颖 钱力 王颖

盛晓光, 王颖, 钱力, 王颖. 基于图卷积半监督学习的论文作者同名消歧方法研究[J]. 电子与信息学报, 2021, 43(12): 3442-3450. doi: 10.11999/JEIT200905
引用本文: 盛晓光, 王颖, 钱力, 王颖. 基于图卷积半监督学习的论文作者同名消歧方法研究[J]. 电子与信息学报, 2021, 43(12): 3442-3450. doi: 10.11999/JEIT200905
Xiaoguang SHENG, Ying WANG, Li QIAN, Ying WANG. Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network[J]. Journal of Electronics & Information Technology, 2021, 43(12): 3442-3450. doi: 10.11999/JEIT200905
Citation: Xiaoguang SHENG, Ying WANG, Li QIAN, Ying WANG. Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network[J]. Journal of Electronics & Information Technology, 2021, 43(12): 3442-3450. doi: 10.11999/JEIT200905

基于图卷积半监督学习的论文作者同名消歧方法研究

doi: 10.11999/JEIT200905
基金项目: 国家自然科学基金(61702038),国家社会科学基金(15CTQ006)
详细信息
    作者简介:

    盛晓光:男,1989年生,博士生,研究方向为教育数据挖掘、人工智能

    王颖:女,1982年生,副研究馆员,研究方向为知识组织与知识挖掘

    钱力:男,1981年生,研究馆员,研究方向为大数据与机器智能

    王颖:女,1969年生,教授,研究方向为数字信号处理、教育数据挖掘

    通讯作者:

    盛晓光 shengxiaoguang@ucas.ac.cn

  • 1) https://api.fanyi.baidu.com/
  • 中图分类号: TP391.1

Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network

Funds: The National Natural Science Foundation of China (61702038), The National Social Science Foundation of China (15CTQ006)
  • 摘要: 为解决学者与成果的精确匹配问题,该文提出了一种基于图卷积半监督学习的论文作者同名消歧方法。该方法使用SciBERT预训练语言模型计算论文题目、关键字获得论文节点语义表示向量,利用论文的作者和机构信息获得论文的合作网络和机构关联网络邻接矩阵,并从论文合作网络中采集伪标签获得正样本集和负样本集,将这些作为输入利用图卷积神经网络进行半监督学习,获得论文节点嵌入表示进行论文节点向量聚类,实现对论文作者同名消歧。实验结果表明,与其他消歧方法相比,该方法在实验数据集上取得了更好的效果。
  • 图  1  研究框架

    图  2  基于BERT预训练模型的论文语义表示

    图  3  论文合作网络和机构关联网络

    图  4  权重组合性能对比

    图  5  $ {\mathrm{\beta }}_{1} $权重调节查准率对比

    图  6  调和参数$ \mathrm{l}\mathrm{a}\mathrm{m} $对比实验结果

    表  1  基于图卷积半监督学习的作者同名消歧算法

     输入:同名作者论文集合$ P $
     输出:论文uuid序列和对应cluster_out列表
     (1) 解析论文元数据,获得唯一标识符uuid、标题、关键词、摘
       要、出版物名称、作者列表、机构列表;
     (2) 数据预处理如中英文转换、特殊字符处理等;
     (3) 将标题和关键词的拼接文本$ d $的列表作为BERT模型的输
       入,计算获得BERT语义表示向量$ \boldsymbol{X} $;
     (4) 遍历论文列表,构建论文网络$ {\mathcal{g}}_{ca} $和$ {\mathcal{g}}_{ci} $,建立合作关系和机
       构关联关系,获得邻接矩阵$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{a}} $和$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{i}} $;
     (5) 从论文合作网络$ {\mathcal{g}}_{\mathrm{c}\mathrm{a}} $中采集伪标签,获得正、负样本集$ {\mathrm{\xi }}_{+} $和
       $ {\mathrm{\xi }}_{-} $;
     (6) 模型初始化,开始 GCN训练
     (7) for epoch in range(nums_epoch):
     (8) 根据式(1)执行图卷积;
     (9) 根据式(2)执行图卷积;
     (10) 根据式(3)执行全连接层运算;
     (11) 根据式(4)获得节点向量$ \boldsymbol{Z} $;
     (12) 根据式(5)计算损失函数并反向传播梯度更新参数和节点
       向量;
     (13) 反向传播更新参数;
     (14) end for
     (15) 利用训练后的最终论文节点向量$ \boldsymbol{Z} $进行Agglomerative
       Clustering()聚类
    下载: 导出CSV

    表  2  待消歧作者测试集

    姓名论文数真实
    作者数
    姓名论文数真实
    作者数
    Tao Huang1672Yunshan Wang462
    Haibo Li1323Liang Wang1196
    Ming Li31210Lin Wang562
    Wei Li272Gang Xiong1512
    Jia Liu292Jun Yang3957
    Jie Liu302Peng Zhang23710
    Jing Liu2776Tao Zhang9399
    Jun Liu2287Xu Zhao1313
    Yun Liu1695Feng Zhao1226
    Bin Wang2018Ming Zhu312
    下载: 导出CSV

    表  3  对比实验结果(%)

    姓名本文方法匿名图网络嵌入[7]多维网络嵌入[24]基于规则的方法
    PreRecF1PreRecF1PreRecF1PreRecF1
    Tao Huang98.2997.2097.7487.3282.9085.0590.1286.2088.4673.5329.3341.93
    Haibo Li87.5793.7790.5686.5887.2286.8980.0678.6879.8242.7758.0649.25
    Ming Li92.2583.3087.5560.0162.9561.4474.6470.0372.6515.7879.6226.33
    Wei Li73.3765.6169.2778.7070.3774.3080.1477.3578.2452.1683.0764.08
    Jia Liu56.8284.0367.892.5984.0388.1188.2378.6582.7858.6260.0173.91
    Jie Liu74.2953.6162.28100.0100.0100.080.1270.3076.2472.2064.2668.0
    Jing Liu92.6366.9177.6976.9654.2663.6578.1156.4764.9435.4760.1644.63
    Jun Liu91.4574.5782.1598.1295.0796.6196.4280.2088.2359.7296.6373.81
    Yun Liu99.3399.7299.5297.3087.3592.0691.8482.4286.2148.3962.3654.49
    Bin Wang91.1347.0962.0981.1631.5145.3994.3949.2064.6949.3861.5154.78
    Yunshan Wang85.881.9283.8293.0190.2191.5987.0184.1886.5351.3060.0157.81
    LiangWang82.9276.0679.3450.7757.0153.7162.6560.2761.3820.7466.7731.65
    Lin Wang62.7582.5371.3063.7386.5473.4166.1988.2076.6964.2390.4275.11
    Gang Xiong99.0089.2193.8598.4384.3090.8294.3389.2192.3083.7042.3556.24
    Jun Yang94.4283.4688.6074.3571.8473.0878.8175.2577.5920.3732.6125.07
    PengZhang75.3670.7672.9948.7440.6044.3056.3058.4357.3816.0962.8125.62
    Tao Zhang99.0289.5094.0280.1273.0476.4188.2386.5287.1142.9929.0634.67
    Xu Zhao89.9766.5476.5099.2295.5497.3590.6786.1588.4661.3881.9170.18
    Feng Zhao92.258990.5986.1478.1681.9683.9280.5482.7827.4962.3338.15
    Ming Zhu81.5784.9083.2081.5784.9083.2083.1285.2984.2258.2941.6348.57
    平均86.0178.9881.5481.7575.8877.9782.2776.1878.8447.7359.2948.96
    下载: 导出CSV

    表  4  组件聚类结果对比(%)

    Avg-PreAvg-RecAvg-F1
    对论文文本语义表
    示向量聚类
    52.2066.2657.03
    图卷积网络计算合作者/
    机构关系进行聚类
    76.4978.3875.76
    综合86.0178.9881.54
    下载: 导出CSV

    表  5  使用不同文本语义表示模型的消歧结果对比(%)

    模型Avg-PreAvg-RecAvg-F1文本语言
    Word2Vec47.7365.2451.22中英文混合
    BERT-base-multilangual-uncased45.6867.0754.35中英文混合
    BERT-wwm-Chinese48.0763.8851.66中英文混合
    BERT-base-uncased46.6071.5855.34英文
    SciBERT52.2066.2657.03英文
    下载: 导出CSV

    表  6  针对不同文本内容的消歧结果对比(%)

    Avg-PreAvg-RecAvg-F1
    题目、关键词52.2066.2657.03
    题目、关键词、出版物名称49.4367.2555.16
    题目、关键词、摘要50.7558.8752.98
    下载: 导出CSV
  • [1] ORCID. What is ORCID[EB/OL]. https://www.lanl.gov/library/scholarly/orcid.php.
    [2] Thomson Reuters Company. What is ResearcherID?[EB/OL]. https://libanswers.lib.xjtlu.edu.cn/faq/240918, 2020.
    [3] HAN Hui, GILES L, ZHA Hongyuan, et al. Two supervised learning approaches for name disambiguation in author citations[C]. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tuscon, USA, 2014: 296–305. doi: 10.1145/996350.996419
    [4] MALIN B. Unsupervised name disambiguation via social network similarity[C]. Proceedings of the SIAM Workshop on Link Analysis, Counterterrorism, and Security, Newport Beach, USA, 2005: 93–102.
    [5] HAN Hui, ZHA Hongyuan, and GILES C L. Name disambiguation in author citations using a K-way spectral clustering method[C]. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, USA, 2005: 334–343. doi: 10.1145/1065385.1065462.
    [6] ZHANG Yutao, ZHANG Fanjin, YAO Peiran, et al. Name disambiguation in aminer: Clustering, maintenance, and human in the loop[C]. The Twenty-Forth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 2018: 1002–1011. doi: 10.1145/3219819.3219859.
    [7] ZHANG Baichuan and AL HASAN M. Name disambiguation in anonymized graphs using network embedding[C]. The 2017 ACM on Conference on Information and Knowledge Management, Singapore, 2017: 1239–1248. doi: 10.1145/3132847.3132873.
    [8] 盖杉, 鲍中运. 基于改进深度卷积神经网络的纸币识别研究[J]. 电子与信息学报, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097

    GAI Shan and BAO Zhongyun. Banknote recognition research based on improved deep convolutional neural network[J]. Journal of Electronics &Information Technology, 2019, 41(8): 1992–2000. doi: 10.11999/JEIT181097
    [9] 卢俊言, 贾宏光, 高放, 等. 语义分割网络重建单视图遥感影像数字表面模型[J]. 电子与信息学报, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031

    LU Junyan, JIA Hongguang, GAO Fang, et al. Reconstruction of digital surface model of single-view remote sensing image by semantic segmentation network[J]. Journal of Electronics &Information Technology, 2021, 43(4): 974–981. doi: 10.11999/JEIT200031
    [10] 孙晓, 彭晓琪, 胡敏, 等. 基于多维扩展特征与深度学习的微博短文本情感分析[J]. 电子与信息学报, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975

    SUN Xiao, PENG Xiaoqi, HU Min, et al. Extended multi-modality features and deep learning based microblog short text sentiment analysis[J]. Journal of Electronics &Information Technology, 2017, 39(9): 2048–2055. doi: 10.11999/JEIT160975
    [11] 郑睿刚, 陈伟福, 冯国灿. 图卷积算法的研究进展[J]. 中山大学学报:自然科学版, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001

    ZHENG Ruigang, CHEN Weifu, and FENG Guocan. A concise survey on graph convolutional networks[J]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2020, 59(2): 1–14. doi: 10.13471/j.cnki.acta.snus.2020.02.001
    [12] 徐冰冰, 岑科廷, 黄俊杰, 等. 图卷积神经网络综述[J]. 计算机学报, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755

    XU Bingbing, CEN Keting, HUANG Junjie, et al. A survey on graph convolutional neural network[J]. Chinese Journal of Computers, 2020, 43(5): 755–780. doi: 10.11897/SP.J.1016.2020.00755
    [13] 葛尧, 陈松灿. 面向推荐系统的图卷积网络[J]. 软件学报, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016

    GE Yao and CHEN Songcan. Graph convolutional network for recommender systems[J]. Journal of Software, 2020, 31(4): 1101–1112. doi: 10.3969/j.issn.1000-9825.2020.04.016
    [14] 王鑫, 李可, 宁晨, 等. 基于深度卷积神经网络和多核学习的遥感图像分类方法[J]. 电子与信息学报, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628

    WANG Xin, LI Ke, NING Chen, et al. Remote sensing image classification method based on deep convolution neural network and multi-kernel learning[J]. Journal of Electronics &Information Technology, 2019, 41(5): 1098–1105. doi: 10.11999/JEIT180628
    [15] HUANG Jian, ERTEKIN S, and GILES C L. Efficient name disambiguation for large-scale databases[C]. 10th European Conference on Principles and Practice of Knowledge Discovery, Berlin, Germany, 2006: 536–544. doi: 10.1007/11871637_53.
    [16] YOSHIDA M, IKEDA M, ONO S, et al. Person name disambiguation by bootstrapping[C]. The 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 2010: 10–17. doi: 10.1145/1835449.1835454.
    [17] ZHU Jia, WU Xingcheng, LIN Xueqin, et al. A novel multiple layers name disambiguation framework for digital libraries using dynamic clustering[J]. Scientometrics, 2018, 114(3): 781–794. doi: 10.1007/s11192-017-2611-8
    [18] FAN Xiaoming, WANG Jianyong, PU Xu, et al. On graph-based name disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2): 10. doi: 10.1145/1891879.1891883
    [19] TANG Jie, FONG A C M, WANG Bo, et al. A unified probabilistic framework for name disambiguation in digital library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 975–987. doi: 10.1109/TKDE.2011.13
    [20] HERMANSSON L, KEROLA T, JOHANSSON F, et al. Entity disambiguation in anonymized graphs using graph kernels[C]. The 22nd ACM International Conference on Information & Knowledge Management, San Francisco, USA, 2013: 1037–1046. doi: 10.1145/2505515.2505565.
    [21] KIPF T N and WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv: 1609.02907, 2016.
    [22] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. https://arxiv.org/pdf/1810.04805.pdf, 2019.
    [23] BELTAGY I, LO K, and COHAN A. SciBERT: A pretrained language model for scientific text[C]. The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kang, China, 2019: 3615–3620.
    [24] XU Jun, SHEN Siqi, LI Dongsheng, et al. A network-embedding based method for author disambiguation[C]. The 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 2018: 1735–1738. doi: 10.1145/3269206.3269272.
  • 加载中
图(6) / 表(6)
计量
  • 文章访问数:  924
  • HTML全文浏览量:  353
  • PDF下载量:  80
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-10-23
  • 修回日期:  2021-09-23
  • 录用日期:  2021-11-04
  • 网络出版日期:  2021-11-10
  • 刊出日期:  2021-12-21

目录

    /

    返回文章
    返回