一种基于用户轨迹的跨社交网络用户身份识别算法

陈鸿昶; 徐乾; 黄瑞阳; 程晓涛; 吴铮

doi:10.11999/JEIT180130

一种基于用户轨迹的跨社交网络用户身份识别算法

doi: 10.11999/JEIT180130 cstr: 32379.14.JEIT180130

国家数字交换系统工程技术研究中心郑州 450002

基金项目: 国家自然科学基金(61521003)

详细信息

作者简介:
陈鸿昶：男，1964年生，教授，研究方向为通信与信息系统

徐乾：男，1993年生，硕士生，研究方向为社交网络挖掘、机器学习

黄瑞阳：男，1986年生，副研究员，研究方向为网络大数据处理与分析

程晓涛：男，1986年生，博士生，研究方向为社交网络挖掘、机器学习

吴铮：男，1992年生，博士生，研究方向为复杂网络、网络大数据处理与分析

通讯作者:
徐乾　 549529376@qq.com

中图分类号: TP393; TP391.4
计量
- 文章访问数: 2943
- HTML全文浏览量: 1227
- PDF下载量: 109
- 被引次数: 0
出版历程
- 收稿日期: 2018-01-30
- 修回日期: 2018-06-11
- 网络出版日期: 2018-06-30
- 刊出日期: 2018-11-01

User Identification Across Social Networks Based on User Trajectory

National Digital Switching Engineering & Technological Research Center, Zhengzhou 450002, China

Funds: The National Natural Science Foundation of China (61521003)

摘要

摘要: 针对现有的基于用户轨迹的跨社交网络用户身份识别算法未考虑用户轨迹中的位置访问顺序特征的缺点，该文提出一种基于Paragraph2vec的跨社交网络用户轨迹匹配算法(CDTraj2vec)。首先将用户轨迹转化为易于处理的网格化表示，并按照一定的时间粒度、距离尺度对原始的用户轨迹进行划分，使用户轨迹中的位置访问顺序特征易于抽取；然后利用Paragraph2vec算法中PV-DM模型抽取轨迹序列中位置访问顺序特征，得到用户轨迹的向量表示。最后通过用户轨迹向量判定轨迹是否匹配。在社交网络BrightKite上的实验结果表明，与基于位置访问频率或者基于轨迹间距离的方法相比，F值提高了2%～4%个百分点，所提算法能够有效地抽取出用户轨迹中的位置访问顺序特征，更加准确地实现了基于用户轨迹的跨社交网络用户身份识别。
- 社交网络 /
- 用户身份识别 /
- 轨迹相似度 /
- Paragraph2vec
Abstract: The performance of trajectory based user identification is poor since the existing methods ignore the order feature of location sequence. To solve this problem, a Cross Domain Trajectory matching algorithm based on Paragraph2vec (CDTraj2vec) is proposed. Firstly, the user trajectory is transformed to the grid representation which is easy to handle. The PV-DM model in the Paragraph2vec algorithm is utilized for extracting order feature of location sequence in trajectory. Then the original user trajectories are divided by a certain time size and distance scale to construct a training sample suitable for training PV-DM model. The PV-DM model is trained by different types of training samples, and the vector representation of the user trajectories is obtained. Finally, the matching of the trajectory is determined by the user trajectory vector. Experimental results on BrightKite shows that the F-measure is improved by 2%～4% compared with the existing frequency based and distance based algorithm. The proposed algorithm can effectively extract the order feature of location sequence, and realize the trajectory based user identification across social networks.
- Social networks /
- User identification /
- Trajectory similarity /
- Paragraph2vec

HTML全文

图 1 按时间划分轨迹

下载: 全尺寸图片幻灯片

图 2 按距离划分轨迹

下载: 全尺寸图片幻灯片

图 3 在时空域上构建位置序列

下载: 全尺寸图片幻灯片

图 4 用于轨迹特征抽取的PV-DM模型结构图

下载: 全尺寸图片幻灯片

表 1 CDTranj2vec伪代码

算法1 基于Paragraph2vec的用户轨迹匹配算法 CDTraj2vec $\left( {{D_A},{D_B},w,d} \right)$
输入：时空数据集 ${D_A},{D_B}$，窗口大小 $w$，用户轨迹向量的维数 $d$
输出：匹配结果
(1)将原始时空数据集 ${D_A},{D_B}$转化为小网格表示，转化后的用户轨迹为 $D_A^c,D_B^c$
(2) $D_{AB}^c = D_A^c + D_B^c$
(3)随机初始化 ${{{Φ}}_{\rm{t}}},{{{Φ}}_{\rm{d}}},{{{Φ}}_{{\rm{st}}}} \in {\mathbb{R}^{\left( {\left\| {{D_A}} \right\| + \left\| {{D_B}} \right\|} \right) \times d}}$
(4)按照2.2.2节中的3种训练样本构建方法，由 $D_{AB}^c$构建出3种训练样本集合 ${\rm{S}}{{\rm{p}}_{\rm{t}}}$, ${\rm{S}}{{\rm{p}}_{\rm{d}}}$, ${\rm{S}}{{\rm{p}}_{{\rm{st}}}}$
(5)　for each i in [‘t’, ‘d’, ‘st’] do:
(6)　　　for j = 0 to $\left\| {{\rm{S}}{{\rm{p}}_{{{ij}}}}} \right\|$ do: // ${\rm{S}}{{\rm{p}}_{{{ij}}}} = \left\{ {{‘\rm{ts}’}:\left[ {{c_{{t_1}}},{c_{{t_2}}}, ·\!·\!· ,{c_{{t_{\left\| {{{\rm Sp}_{ij}}} \right\|}}}}} \right],‘{\rm{ui}}’:{\rm{id}}\left( T \,\right)} \right\}$
(7)　　　　PV-DM $\left( {{{Φ} _i},{\rm{Sp}}_{ij},w} \right)$
(8)　　　end for
(9)　　end for
(10)　for i = 0 to $\left\| {{D_A}} \right\| + \left\| {{D_B}} \right\|$ do:
(11)　　　 ${{Φ}}\left( {i,:} \right) = {\rm{concatenate}}\left( {{{Φ}_{\rm t}}\left( {i,:} \right),{{Φ}_{\rm d}}\left( {i,:} \right),{{Φ}_{\rm st}}\left( {i,:} \right)} \right)$
(12)　end for
(13)　 ${\rm{rs}} = {\rm{matching\_strategy}}\left({Φ} \right)$
(14)　return ${\rm{rs}}$

下载: 导出CSV

表 2 PV-DM ( $ {{ {Φ}_i},{{Sp}}_{ij},w} $ )算法

算法2 PV-DM $\left( {{{Φ} _i},{\rm{Sp}}_{ij},w} \right)$
输入：按照方法 $i$生成的训练样本在训练模型的过程中待更新的用户轨迹向量 ${{Φ}_{{i}}}$，按照方法 $i$生成的训练样本中的第 $j$个样本 ${\rm{Sp}}_{ij}$，窗口大小 $w$
输出：无
(1)for $k = w$ to $\left\| {{\rm{Sp}}_{ij}} \right\|$ do: 　(2)　　 ${S} = {{Φ}_i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) + \sum\limits_{l = k - w + 1}^{w - 1} {{vc}\left( {{c_{{t_l}}}} \right)} $ 　(3)　　 $J = - \lg \left( {\left. {{c_{{t_k}}}} \right\|{S}} \right)$ 　(4)　for $l = k - w + 1$ to $w - 1$ do: 　(5)　　　 ${vc}\left( {{c_{{t_l}}}} \right) = {vc}\left( {{c_{{t_l}}}} \right) - \alpha \times \frac{{\partial J}}{{\partial {S}}}$ 　(6)　　end for 　(7)　　 ${{Φ} _i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) = {{Φ}_i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) - \alpha \times \frac{{\partial J}}{{\partial {S}}}$
(8)end for

下载: 导出CSV

表 3 数据集信息

数据来源	Geolife		BrightKite
划分后的数据集	D_A	D_B	D_C	D_E
用户轨迹数	182	182	2971	2974
坐标点个数	16128	16802	332326	328031
不同的小网格个数	3140	3517	73255	72098
网格范围	纬度：39°～41° 经度：115°～117°	纬度：39°～41° 经度：115°～117°	纬度：25°～49° 经度：–130°～–70°	纬度：25°～49° 经度：–130°～–70°
时间范围	2007-04～2012-08	2007-04～2012-08	2008-03～2010-10	2008-03～2010-10

下载: 导出CSV

表 4 文中所提参数的默认值

变量	值	说明
${\rm{sl}}$	100 m	小网格的边长
${\rm{la}}{{\rm{t}}_{\rm{1}}}{\rm{,la}}{{\rm{t}}_{2}}$	29°N, 49°N	小网格的纬度范围
${\rm{lo}}{{\rm{n}}_{\rm{1}}}{\rm{,lo}}{{\rm{n}}_{2}}$	70°W, 30°W	小网格的经度范围
r	53374	小网格行数
c	120932, 87541	最低纬度(29°N)和最高纬度(49°N) 上的小网格列数
Δt	4 h(Geolife), 3 h(BrightKite)	按时间划分轨迹时设定的时间阈值
Δd	1 km	按距离划分轨迹时设定的距离阈值

下载: 导出CSV

表 5 Geolife上的实验结果(Δt = 4 h, Δd = 1 km)

算法	准确率(pc)	召回率(rc)	F值
k-BCT	0.9080	0.9530	0.9299
TF-IDF	0.9651	0.9185	0.9412
CDTraj2vec-sf	0.9241	0.8462	0.8834
CDTraj2vec-wcet	0.9151	0.8694	0.8917
CDTraj2vec (使用训练样本A, B, C)	0.9888	0.9899	0.9893

下载: 导出CSV

表 6 BrightKite上的实验结果(Δt = 3 h, Δd = 1 km)

算法	准确率(pc)	召回率(rc)	F值
k-BCT	0.8802	0.9238	0.9015
TF-IDF	0.8741	0.9164	0.8947
CDTraj2vec-sf	0.8841	0.8457	0.8647
CDTraj2vec-wcet	0.8645	0.8108	0.8367
CDTraj2vec (使用训练样本A, B, C)	0.9465	0.9245	0.9354
CDTraj2vec (只使用训练样本A, B)	0.9440	0.9281	0.9360

下载: 导出CSV

参考文献(15)

桑基韬, 路冬媛, 徐常胜. 基于共同用户的跨网络分析: 社交媒体大数据中的多源问题[J]. 科学通报, 2014, 59(36): 3554–3560 doi: 10.1360/n972014-00292

SANG Jitao, LU Dongyuan, and XU Changsheng. Overlapped user-based cross-network analysis: Exploring variety in big social media data[J]. Chinese Science Bulletin, 2014, 59(36): 3554–3560 doi: 10.1360/n972014-00292

GONZÁLEZ M C, HIDALGO C A, and BARABÁSI A L. Understanding individual human mobility patterns[J]. Nature, 2008, 453(7196): 779–782 doi: 10.1038/nature06958

CAO Wei, WU Zhengwei, WANG Dong, et al. Automatic user identification method across heterogeneous mobility data sources[C]. IEEE International Conference on Data Engineering, Helsinki, Finland, 2016: 978–989.

HAO Tianyi, ZHOU Jingbo, CHENG Yunsheng, et al. User identification in cyber-physical space: A case study on mobile query logs and trajectories[C]. GIS′16 Proceedings of the 24th ACM International Conference on Advances in Geographic Information Systems, California, USA, 2016: 1–4.

HAN Xiaohui, WANG Lianhai, XU Shujiang, et al. Linking social network accounts by modeling user spatiotemporal habits[C]. IEEE International Conference on Intelligence and Security Informatics, Beijing, China, 2017: 19–24.

RIEDERER C, KIM Y, CHAINTREAU A, et al. Linking users across domains with location data: Theory and validation[C]. WWW′16 Proceedings of the 25th International Conference on World Wide Web. Montréal, Canada, 2016: 707–719.

HAN Xiaohui, WANG Lianhai, XU Lijuan, et al. Social Media account linkage using user-generated geo-location data[C]. Intelligence and Security Informatics, Tucson, USA, 2016: 157–162.

LE Q and MIKOLOV T. Distributed representations of sentences and documents[C]. International Conference on International Conference on Machine Learning, Beijing, China, 2014: II-1188.

殷浩腾, 刘洋. 基于社交属性的时空轨迹语义分析[J]. 中国科学: 信息科学, 2017, 47(8): 1051–1065 doi: 10.1360/N112016-00310

YIN Haoteng and LIU Yang. Semantic analysis of spatial temporal trajectory in LBSNs[J]. Scientia Sinica(Informationis), 2017, 47(8): 1051–1065 doi: 10.1360/N112016-00310

MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]. International Conference on Neural Information Processing Systems, Daegu, SUKO. 2013: 3111–3119.

BOYD S and VANDENBERGHE L. Convex Optimization[M]. Cambridge: Cambridge University Press, 2004: 466–468.

吴铮, 于洪涛, 刘树新, 等. 基于信息熵的跨社交网络用户身份识别方法[J]. 计算机应用, 2017, 37(8): 2374–2380 doi: 10.11772/j.issn.1001-9081.2017.08.2374

WU Zheng, YU Hongtao, LIU Shuxin, et al. User identification across multiple social networks based on information entropy[J]. Journal of Computer Applications, 2017, 37(8): 2374–2380 doi: 10.11772/j.issn.1001-9081.2017.08.2374

ZHENG Yu, XIE Xing, and MA Weiying. GeoLife: A collaborative social networking service among user, location and trajectory[J]. Bulletin of the Technical Committee on Data Engineering, 2010, 33(2): 32–39.

CHAINTREAU A. COMS 6998: Social Networks[EB/OL]. http://socialnetworksfall14.wikischolars.columbia.edu/, 2014-10/2017-12.

CHEN Zaiben, SHEN Hengtao, ZHOU Xiaofang, et al. Searching trajectories by locations: An efficiency study[C]. International Conference Proceedings, Association for Computing Machinery, Indianapolis, Indiana, USA, 2010: 255–266.

施引文献

资源附件(0)

访问统计