Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
摘要: 为解决学者与成果的精确匹配问题,该文提出了一种基于图卷积半监督学习的论文作者同名消歧方法。该方法使用SciBERT预训练语言模型计算论文题目、关键字获得论文节点语义表示向量,利用论文的作者和机构信息获得论文的合作网络和机构关联网络邻接矩阵,并从论文合作网络中采集伪标签获得正样本集和负样本集,将这些作为输入利用图卷积神经网络进行半监督学习,获得论文节点嵌入表示进行论文节点向量聚类,实现对论文作者同名消歧。实验结果表明,与其他消歧方法相比,该方法在实验数据集上取得了更好的效果。Abstract: In order to solve the problem of exact matching between scholars and articles, a new method of author name disambiguation is proposed based on semi-supervised learning with graph convolutional network. In this method, the SciBERT pre-training language model is applied to calculating the semantic embedding vector of each paper with their title and keywords. Authors and organizations of papers are used to obtain the adjacency matrixes of the paper’s co-author network and co-organization network. The pseudo labels are collected from the co-author network to obtain the positive and negative samples. The semantic embedding vector, adjacency matrixes and the positive and negative samples are used as input to be processed by Graph Convolution neural Network (GCN). In semi-supervised learning, the embedding vectors of papers are learned to be clustered in order to realize the name disambiguation of papers. The experimental results show that, compared with other disambiguation methods, this method achieves better results on the experimental dataset.
表 1 基于图卷积半监督学习的作者同名消歧算法
输入:同名作者论文集合$ P $ 输出:论文uuid序列和对应cluster_out列表 (1) 解析论文元数据,获得唯一标识符uuid、标题、关键词、摘
要、出版物名称、作者列表、机构列表;(2) 数据预处理如中英文转换、特殊字符处理等; (3) 将标题和关键词的拼接文本$ d $的列表作为BERT模型的输
入,计算获得BERT语义表示向量$ \boldsymbol{X} $;(4) 遍历论文列表,构建论文网络$ {\mathcal{g}}_{ca} $和$ {\mathcal{g}}_{ci} $,建立合作关系和机
构关联关系,获得邻接矩阵$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{a}} $和$ {\boldsymbol{A}}_{\boldsymbol{c}\boldsymbol{i}} $;(5) 从论文合作网络$ {\mathcal{g}}_{\mathrm{c}\mathrm{a}} $中采集伪标签,获得正、负样本集$ {\mathrm{\xi }}_{+} $和
$ {\mathrm{\xi }}_{-} $;(6) 模型初始化,开始 GCN训练 (7) for epoch in range(nums_epoch): (8) 根据式(1)执行图卷积; (9) 根据式(2)执行图卷积; (10) 根据式(3)执行全连接层运算; (11) 根据式(4)获得节点向量$ \boldsymbol{Z} $; (12) 根据式(5)计算损失函数并反向传播梯度更新参数和节点
向量;(13) 反向传播更新参数; (14) end for (15) 利用训练后的最终论文节点向量$ \boldsymbol{Z} $进行Agglomerative
Clustering()聚类表 2 待消歧作者测试集
姓名 论文数 真实
作者数姓名 论文数 真实
作者数Tao Huang 167 2 Yunshan Wang 46 2 Haibo Li 132 3 Liang Wang 119 6 Ming Li 312 10 Lin Wang 56 2 Wei Li 27 2 Gang Xiong 151 2 Jia Liu 29 2 Jun Yang 395 7 Jie Liu 30 2 Peng Zhang 237 10 Jing Liu 277 6 Tao Zhang 939 9 Jun Liu 228 7 Xu Zhao 131 3 Yun Liu 169 5 Feng Zhao 122 6 Bin Wang 201 8 Ming Zhu 31 2 表 3 对比实验结果(%)
姓名 本文方法 匿名图网络嵌入[7] 多维网络嵌入[24] 基于规则的方法 Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1 Tao Huang 98.29 97.20 97.74 87.32 82.90 85.05 90.12 86.20 88.46 73.53 29.33 41.93 Haibo Li 87.57 93.77 90.56 86.58 87.22 86.89 80.06 78.68 79.82 42.77 58.06 49.25 Ming Li 92.25 83.30 87.55 60.01 62.95 61.44 74.64 70.03 72.65 15.78 79.62 26.33 Wei Li 73.37 65.61 69.27 78.70 70.37 74.30 80.14 77.35 78.24 52.16 83.07 64.08 Jia Liu 56.82 84.03 67.8 92.59 84.03 88.11 88.23 78.65 82.78 58.62 60.01 73.91 Jie Liu 74.29 53.61 62.28 100.0 100.0 100.0 80.12 70.30 76.24 72.20 64.26 68.0 Jing Liu 92.63 66.91 77.69 76.96 54.26 63.65 78.11 56.47 64.94 35.47 60.16 44.63 Jun Liu 91.45 74.57 82.15 98.12 95.07 96.61 96.42 80.20 88.23 59.72 96.63 73.81 Yun Liu 99.33 99.72 99.52 97.30 87.35 92.06 91.84 82.42 86.21 48.39 62.36 54.49 Bin Wang 91.13 47.09 62.09 81.16 31.51 45.39 94.39 49.20 64.69 49.38 61.51 54.78 Yunshan Wang 85.8 81.92 83.82 93.01 90.21 91.59 87.01 84.18 86.53 51.30 60.01 57.81 LiangWang 82.92 76.06 79.34 50.77 57.01 53.71 62.65 60.27 61.38 20.74 66.77 31.65 Lin Wang 62.75 82.53 71.30 63.73 86.54 73.41 66.19 88.20 76.69 64.23 90.42 75.11 Gang Xiong 99.00 89.21 93.85 98.43 84.30 90.82 94.33 89.21 92.30 83.70 42.35 56.24 Jun Yang 94.42 83.46 88.60 74.35 71.84 73.08 78.81 75.25 77.59 20.37 32.61 25.07 PengZhang 75.36 70.76 72.99 48.74 40.60 44.30 56.30 58.43 57.38 16.09 62.81 25.62 Tao Zhang 99.02 89.50 94.02 80.12 73.04 76.41 88.23 86.52 87.11 42.99 29.06 34.67 Xu Zhao 89.97 66.54 76.50 99.22 95.54 97.35 90.67 86.15 88.46 61.38 81.91 70.18 Feng Zhao 92.25 89 90.59 86.14 78.16 81.96 83.92 80.54 82.78 27.49 62.33 38.15 Ming Zhu 81.57 84.90 83.20 81.57 84.90 83.20 83.12 85.29 84.22 58.29 41.63 48.57 平均 86.01 78.98 81.54 81.75 75.88 77.97 82.27 76.18 78.84 47.73 59.29 48.96 表 4 组件聚类结果对比(%)
Avg-Pre Avg-Rec Avg-F1 对论文文本语义表
示向量聚类52.20 66.26 57.03 图卷积网络计算合作者/
机构关系进行聚类76.49 78.38 75.76 综合 86.01 78.98 81.54 表 5 使用不同文本语义表示模型的消歧结果对比(%)
模型 Avg-Pre Avg-Rec Avg-F1 文本语言 Word2Vec 47.73 65.24 51.22 中英文混合 BERT-base-multilangual-uncased 45.68 67.07 54.35 中英文混合 BERT-wwm-Chinese 48.07 63.88 51.66 中英文混合 BERT-base-uncased 46.60 71.58 55.34 英文 SciBERT 52.20 66.26 57.03 英文 表 6 针对不同文本内容的消歧结果对比(%)
Avg-Pre Avg-Rec Avg-F1 题目、关键词 52.20 66.26 57.03 题目、关键词、出版物名称 49.43 67.25 55.16 题目、关键词、摘要 50.75 58.87 52.98 -
