User Identification Across Social Networks Based on User Trajectory
-
摘要: 针对现有的基于用户轨迹的跨社交网络用户身份识别算法未考虑用户轨迹中的位置访问顺序特征的缺点,该文提出一种基于Paragraph2vec的跨社交网络用户轨迹匹配算法(CDTraj2vec)。首先将用户轨迹转化为易于处理的网格化表示,并按照一定的时间粒度、距离尺度对原始的用户轨迹进行划分,使用户轨迹中的位置访问顺序特征易于抽取;然后利用Paragraph2vec算法中PV-DM模型抽取轨迹序列中位置访问顺序特征,得到用户轨迹的向量表示。最后通过用户轨迹向量判定轨迹是否匹配。在社交网络BrightKite上的实验结果表明,与基于位置访问频率或者基于轨迹间距离的方法相比,F值提高了2%~4%个百分点,所提算法能够有效地抽取出用户轨迹中的位置访问顺序特征,更加准确地实现了基于用户轨迹的跨社交网络用户身份识别。
-
关键词:
- 社交网络 /
- 用户身份识别 /
- 轨迹相似度 /
- Paragraph2vec
Abstract: The performance of trajectory based user identification is poor since the existing methods ignore the order feature of location sequence. To solve this problem, a Cross Domain Trajectory matching algorithm based on Paragraph2vec (CDTraj2vec) is proposed. Firstly, the user trajectory is transformed to the grid representation which is easy to handle. The PV-DM model in the Paragraph2vec algorithm is utilized for extracting order feature of location sequence in trajectory. Then the original user trajectories are divided by a certain time size and distance scale to construct a training sample suitable for training PV-DM model. The PV-DM model is trained by different types of training samples, and the vector representation of the user trajectories is obtained. Finally, the matching of the trajectory is determined by the user trajectory vector. Experimental results on BrightKite shows that the F-measure is improved by 2%~4% compared with the existing frequency based and distance based algorithm. The proposed algorithm can effectively extract the order feature of location sequence, and realize the trajectory based user identification across social networks.-
Key words:
- Social networks /
- User identification /
- Trajectory similarity /
- Paragraph2vec
-
表 1 CDTranj2vec伪代码
算法1 基于Paragraph2vec的用户轨迹匹配算法
CDTraj2vec $\left( {{D_A},{D_B},w,d} \right)$输入:时空数据集 ${D_A},{D_B}$,窗口大小 $w$,用户轨迹向量的维数 $d$ 输出:匹配结果 (1)将原始时空数据集 ${D_A},{D_B}$转化为小网格表示,转化后的用户轨 迹为 $D_A^c,D_B^c$ (2) $D_{AB}^c = D_A^c + D_B^c$ (3)随机初始化 ${{{Φ}}_{\rm{t}}},{{{Φ}}_{\rm{d}}},{{{Φ}}_{{\rm{st}}}} \in {\mathbb{R}^{\left( {\left| {{D_A}} \right| + \left| {{D_B}} \right|} \right) \times d}}$ (4)按照2.2.2节中的3种训练样本构建方法,由 $D_{AB}^c$构建出3种训 练样本集合 ${\rm{S}}{{\rm{p}}_{\rm{t}}}$, ${\rm{S}}{{\rm{p}}_{\rm{d}}}$, ${\rm{S}}{{\rm{p}}_{{\rm{st}}}}$ (5) for each i in [‘t’, ‘d’, ‘st’] do: (6) for j = 0 to $\left| {{\rm{S}}{{\rm{p}}_{{{ij}}}}} \right|$ do: // ${\rm{S}}{{\rm{p}}_{{{ij}}}} = \left\{ {{‘\rm{ts}’}:\left[ {{c_{{t_1}}},{c_{{t_2}}}, ·\!·\!· ,{c_{{t_{\left| {{{\rm Sp}_{ij}}} \right|}}}}} \right],‘{\rm{ui}}’:{\rm{id}}\left( T \,\right)} \right\}$ (7) PV-DM $\left( {{{Φ} _i},{\rm{Sp}}_{ij},w} \right)$ (8) end for (9) end for (10) for i = 0 to $\left| {{D_A}} \right| + \left| {{D_B}} \right|$ do: (11) ${{Φ}}\left( {i,:} \right) = {\rm{concatenate}}\left( {{{Φ}_{\rm t}}\left( {i,:} \right),{{Φ}_{\rm d}}\left( {i,:} \right),{{Φ}_{\rm st}}\left( {i,:} \right)} \right)$ (12) end for (13) ${\rm{rs}} = {\rm{matching\_strategy}}\left({Φ} \right)$ (14) return ${\rm{rs}}$ 表 2 PV-DM (
$ {{ {Φ}_i},{{Sp}}_{ij},w} $ )算法算法2 PV-DM $\left( {{{Φ} _i},{\rm{Sp}}_{ij},w} \right)$ 输入:按照方法 $i$生成的训练样本在训练模型的过程中待更新的 用户轨迹向量 ${{Φ}_{{i}}}$,按照方法 $i$生成的训练样本中的第 $j$个样 本 ${\rm{Sp}}_{ij}$,窗口大小 $w$ 输出:无 (1)for $k = w$ to $\left| {{\rm{Sp}}_{ij}} \right|$ do:
(2) ${S} = {{Φ}_i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) + \sum\limits_{l = k - w + 1}^{w - 1} {{vc}\left( {{c_{{t_l}}}} \right)} $
(3) $J = - \lg \left( {\left. {{c_{{t_k}}}} \right|{S}} \right)$
(4) for $l = k - w + 1$ to $w - 1$ do:
(5) ${vc}\left( {{c_{{t_l}}}} \right) = {vc}\left( {{c_{{t_l}}}} \right) - \alpha \times \frac{{\partial J}}{{\partial {S}}}$
(6) end for
(7) ${{Φ} _i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) = {{Φ}_i}\left( {{\rm{Sp}}_{ij}.{\rm{ui}}} \right) - \alpha \times \frac{{\partial J}}{{\partial {S}}}$(8)end for 表 3 数据集信息
数据来源 Geolife BrightKite 划分后的数据集 DA DB DC DE 用户轨迹数 182 182 2971 2974 坐标点个数 16128 16802 332326 328031 不同的小网格个数 3140 3517 73255 72098 网格范围 纬度:39°~41°
经度:115°~117°纬度:39°~41°
经度:115°~117°纬度:25°~49°
经度:–130°~–70°纬度:25°~49°
经度:–130°~–70°时间范围 2007-04~2012-08 2007-04~2012-08 2008-03~2010-10 2008-03~2010-10 表 4 文中所提参数的默认值
变量 值 说明 ${\rm{sl}}$ 100 m 小网格的边长 ${\rm{la}}{{\rm{t}}_{\rm{1}}}{\rm{,la}}{{\rm{t}}_{2}}$ 29°N, 49°N 小网格的纬度范围 ${\rm{lo}}{{\rm{n}}_{\rm{1}}}{\rm{,lo}}{{\rm{n}}_{2}}$ 70°W, 30°W 小网格的经度范围 r 53374 小网格行数 c 120932, 87541 最低纬度(29°N)和最高纬度(49°N)
上的小网格列数Δt 4 h(Geolife),
3 h(BrightKite)按时间划分轨迹时设定的时间阈值 Δd 1 km 按距离划分轨迹时设定的距离阈值 表 5 Geolife上的实验结果(Δt = 4 h, Δd = 1 km)
算法 准确率(pc) 召回率(rc) F值 k-BCT 0.9080 0.9530 0.9299 TF-IDF 0.9651 0.9185 0.9412 CDTraj2vec-sf 0.9241 0.8462 0.8834 CDTraj2vec-wcet 0.9151 0.8694 0.8917 CDTraj2vec
(使用训练样本A, B, C)0.9888 0.9899 0.9893 表 6 BrightKite上的实验结果(Δt = 3 h, Δd = 1 km)
算法 准确率(pc) 召回率(rc) F值 k-BCT 0.8802 0.9238 0.9015 TF-IDF 0.8741 0.9164 0.8947 CDTraj2vec-sf 0.8841 0.8457 0.8647 CDTraj2vec-wcet 0.8645 0.8108 0.8367 CDTraj2vec
(使用训练样本A, B, C)0.9465 0.9245 0.9354 CDTraj2vec
(只使用训练样本A, B)0.9440 0.9281 0.9360 -
桑基韬, 路冬媛, 徐常胜. 基于共同用户的跨网络分析: 社交媒体大数据中的多源问题[J]. 科学通报, 2014, 59(36): 3554–3560 doi: 10.1360/n972014-00292SANG Jitao, LU Dongyuan, and XU Changsheng. Overlapped user-based cross-network analysis: Exploring variety in big social media data[J]. Chinese Science Bulletin, 2014, 59(36): 3554–3560 doi: 10.1360/n972014-00292 GONZÁLEZ M C, HIDALGO C A, and BARABÁSI A L. Understanding individual human mobility patterns[J]. Nature, 2008, 453(7196): 779–782 doi: 10.1038/nature06958 CAO Wei, WU Zhengwei, WANG Dong, et al. Automatic user identification method across heterogeneous mobility data sources[C]. IEEE International Conference on Data Engineering, Helsinki, Finland, 2016: 978–989. HAO Tianyi, ZHOU Jingbo, CHENG Yunsheng, et al. User identification in cyber-physical space: A case study on mobile query logs and trajectories[C]. GIS′16 Proceedings of the 24th ACM International Conference on Advances in Geographic Information Systems, California, USA, 2016: 1–4. HAN Xiaohui, WANG Lianhai, XU Shujiang, et al. Linking social network accounts by modeling user spatiotemporal habits[C]. IEEE International Conference on Intelligence and Security Informatics, Beijing, China, 2017: 19–24. RIEDERER C, KIM Y, CHAINTREAU A, et al. Linking users across domains with location data: Theory and validation[C]. WWW′16 Proceedings of the 25th International Conference on World Wide Web. Montréal, Canada, 2016: 707–719. HAN Xiaohui, WANG Lianhai, XU Lijuan, et al. Social Media account linkage using user-generated geo-location data[C]. Intelligence and Security Informatics, Tucson, USA, 2016: 157–162. LE Q and MIKOLOV T. Distributed representations of sentences and documents[C]. International Conference on International Conference on Machine Learning, Beijing, China, 2014: II-1188. 殷浩腾, 刘洋. 基于社交属性的时空轨迹语义分析[J]. 中国科学: 信息科学, 2017, 47(8): 1051–1065 doi: 10.1360/N112016-00310YIN Haoteng and LIU Yang. Semantic analysis of spatial temporal trajectory in LBSNs[J]. Scientia Sinica(Informationis) , 2017, 47(8): 1051–1065 doi: 10.1360/N112016-00310 MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]. International Conference on Neural Information Processing Systems, Daegu, SUKO. 2013: 3111–3119. BOYD S and VANDENBERGHE L. Convex Optimization[M]. Cambridge: Cambridge University Press, 2004: 466–468. 吴铮, 于洪涛, 刘树新, 等. 基于信息熵的跨社交网络用户身份识别方法[J]. 计算机应用, 2017, 37(8): 2374–2380 doi: 10.11772/j.issn.1001-9081.2017.08.2374WU Zheng, YU Hongtao, LIU Shuxin, et al. User identification across multiple social networks based on information entropy[J]. Journal of Computer Applications, 2017, 37(8): 2374–2380 doi: 10.11772/j.issn.1001-9081.2017.08.2374 ZHENG Yu, XIE Xing, and MA Weiying. GeoLife: A collaborative social networking service among user, location and trajectory[J]. Bulletin of the Technical Committee on Data Engineering, 2010, 33(2): 32–39. CHAINTREAU A. COMS 6998: Social Networks[EB/OL]. http://socialnetworksfall14.wikischolars.columbia.edu/, 2014-10/2017-12. CHEN Zaiben, SHEN Hengtao, ZHOU Xiaofang, et al. Searching trajectories by locations: An efficiency study[C]. International Conference Proceedings, Association for Computing Machinery, Indianapolis, Indiana, USA, 2010: 255–266.