Network Encrypted Traffic Side-channel Analysis on Chinese Search
-
摘要: 搜索引擎中的增量式搜索服务通过发送实时请求为用户更新建议列表。针对搜索加密流量存在的信息泄露,该文提出一种面向中文搜索的侧信道分析方法,利用搜索请求数据包长度增量和时间间隔的可区分性,构建了3阶段的分析模型以实现对用户输入查询的识别。实验结果表明,该方法在4个常用中文搜索引擎中的识别性能均达到理论量化值,对包含1.4×105查询监控集的综合识别准确率达到76%。最后通过评估4种针对性的缓解机制,证明了通过阻断信息泄露来源可有效防御侧信道分析。Abstract: Incremental search services in search engines update the suggestion list for users by sending real-time requests. Focusing on the information leakage of encrypted search traffic, a side-channel analysis method on Chinese search is proposed. Leveraging the distinguishability of packet length increments and time intervals, a three-stage analysis model is constructed to identify user queries. Experimental results show that the performance in four commonly used Chinese search engines achieves the theoretical quantified value. The identification accuracy for the set containing 1.4×105 monitored queries reaches 76%. Finally, four mitigation methods are evaluated to demonstrate that side-channel analysis can be effectively defended by blocking the information leakage sources.
-
Key words:
- Encrypted traffic /
- Side-channel analysis /
- Search engine /
- Information gain
-
表 1 常见单行模式拼音输入法
操作系统 输入法 应用类型 分隔符 Windows 微软拼音 内置 撇号 Windows QQ拼音 安装 撇号 macOS 简体拼音 内置 空格 macOS 搜狗拼音 安装 撇号 macOS 百度拼音 安装 撇号 iOS 简体拼音 内置 空格 表 2 URL参数对数据包长度增量可区分性的影响
协议 参数 参数变化 $d(x \in {\boldsymbol{A} })$ $d(x \in {\boldsymbol{B} })$ HTTP/1.1 无 – 1 4 cp 产生进位 2 5 pwd 末尾增加${\boldsymbol{A} }$ 2 5 pwd 末尾增加${\boldsymbol{B} }$ 5 8 HTTP/2 无 – 0, 1 2, 3 cp 模10大于2 0, 1 2, 3, 4 cp 产生进位 1, 2 3, 4 表 3 击键探测算法
输入:未知数据包长度序列$ \{ {s_1},{s_2}, \cdots ,{s_N}\} $,DFA模型$ M $ 输出:击键数据包状态序列$ {\boldsymbol{Q}} = \{ {q_1},{q_2}, \cdots ,{q_n}\} $ (1) 初始化$ N $个数据包状态序列${ {\boldsymbol{Q} }_i} = \varnothing$; (2) for $ i $ in range($ N $) do (3) DFA起始状态$ {q_i} = {q_0} $; (4) for $ j $ in range($ i - 1 $) do (5) 数据包长度增量$ {d_i} = {s_i} - {s_j} $; (6) if URL中含有pwd或cp参数 then (7) 根据$ {{\boldsymbol{Q}}_j} $上一接收状态$ {q_j} $调整$ {d_i} $; (8) DFA状态转移$ {q_i} = \delta ({q_j},{d_i}) $; (9) if ${q_i} \ne \varnothing$ and $ \left| {{{\boldsymbol{Q}}_i}} \right| < \left| {{{\boldsymbol{Q}}_j}} \right| $ then (10) 更新已接收的最长前缀$ {{\boldsymbol{Q}}_i} = {{\boldsymbol{Q}}_j} $; (11) 以第$ i $个包结尾的最长序列$ {{\boldsymbol{Q}}_i} = {{\boldsymbol{Q}}_i} + {q_i} $; (12) return 最长接收序列$ {\boldsymbol{Q}} \in \{ {{\boldsymbol{Q}}_i}\} $。 表 4 多层匹配算法
输入:数据包状态序列$ {\boldsymbol{Q}} = \{ {q_1},{q_2}, \cdots ,{q_n}\} $,中文查询$ C $ 输出:$ C $是否匹配状态序列$ {\boldsymbol{Q}} $ (1) 将$ C $转换为拼音字符串$ {\boldsymbol{X}} $; (2) if 拼音字母长度$ \left| {\boldsymbol{X}} \right| \ne n $ then (3) return False; (4) if 约简状态序列$ {\boldsymbol{\bar Q'}} \ne {\boldsymbol{\bar Q}} $ then (5) return False; (6) for r in range(8) do (7) if 初始比特余数为$ r $的状态序列${ {\boldsymbol Q}'_r} = {\boldsymbol{Q} }$ then (8) return True; (9) return False. 表 5 击键探测性能结果(%)及对比
探测方法 网站 HTTP F1分数 F1=100 本文方法 百度 1.1 98.63 98.22 Monaco方法 百度 1.1 96.85 86.70 本文方法 搜狗 1.1 96.40 95.17 本文方法 360 1.1 99.51 99.33 本文方法 必应 2 99.72 99.61 Monaco方法 谷歌 2 97.26 81.12 表 6 查询识别准确率结果(%)及对比
分析方法 网站 样本数 监控数 准确率 本文方法 百度 1800 ~1.4×105 64.56 Monaco方法 百度 4000 ∞ 12.85 本文方法 搜狗 1800 ~1.4×105 61.32 本文方法 360 1800 ~1.4×105 65.94 本文方法 必应 1800 ~1.4×105 76.06 Oh等人方法 必应 100 1×104 44.30 Schaub等人方法 谷歌 10 ~6.8×103 34.00 Monaco方法 谷歌 4000 ∞ 15.83 表 7 查询推断消融实验(%)
推断模型 百度 搜狗 360 必应 BRNN 64.56 61.32 65.94 76.06 RNN 64.11 60.39 65.28 75.44 HMM 61.89 58.28 62.61 73.56 MLP 61.28 57.67 61.50 72.17 Equal 57.94 54.78 59.33 69.83 -
[1] 武思齐, 王俊峰. 基于数据流多维特征的移动流量识别方法研究[J]. 四川大学学报:自然科学版, 2020, 57(2): 247–254. doi: 10.3969/j.issn.0490-6756.2020.02.008WU Siqi and WANG Junfeng. Research on mobile traffic identification based on multidimensional characteristics of data flow[J]. Journal of Sichuan University:Natural Science Edition, 2020, 57(2): 247–254. doi: 10.3969/j.issn.0490-6756.2020.02.008 [2] SIRINAM P, MATHEWS N, RAHMAN M S, et al. Triplet fingerprinting: More practical and portable website fingerprinting with N-shot learning[C]. The 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 2019: 1131–1148. [3] GU Jiaxi, WANG Jiliang, YU Zhiwen, et al. Traffic-based side-channel attack in video streaming[J]. IEEE/ACM Transactions on Networking, 2019, 27(3): 972–985. doi: 10.1109/TNET.2019.2906568 [4] WHITE A M, MATTHEWS A R, SNOW K Z, et al. Phonotactic reconstruction of encrypted VoIP conversations: Hookt on fon-iks[C]. 2011 IEEE Symposium on Security and Privacy, Oakland, USA, 2011: 3–18. [5] LI Hong, HE Yunhua, SUN Limin, et al. Side-channel information leakage of encrypted video stream in video surveillance systems[C]. The 35th Annual IEEE International Conference on Computer Communications, San Francisco, USA, 2016: 1–9. [6] SONG D X, WAGNER D, and TIAN Xuqing. Timing analysis of keystrokes and timing attacks on SSH[C]. The 10th conference on USENIX Security Symposium, Washington, USA, 2001: 25. [7] CHEN Shuo, WANG Rui, WANG Xiaofeng, et al. Side-channel leaks in web applications: A reality today, a challenge tomorrow[C]. 2010 IEEE Symposium on Security and Privacy, Oakland, USA, 2010: 191–206. [8] SCHAUB A, SCHNEIDER E, HOLLENDER A, et al. Attacking suggest boxes in web applications over HTTPS using side-channel stochastic algorithms[C]. The International Conference on Risks and Security of Internet and Systems, Trento, Italy, 2014: 116–130. [9] OH S E, LI Shuai, and HOPPER N. Fingerprinting keywords in search queries over tor[J]. Proceedings on Privacy Enhancing Technologies, 2017, 2017(4): 251–270. doi: 10.1515/popets-2017-0048 [10] MONACO J V. What are you searching for? a remote keylogging attack on search engine autocomplete[C]. The 28th USENIX Conference on Security Symposium, Santa Clara, USA, 2019: 959–976. [11] FITTS P M. The information capacity of the human motor system in controlling the amplitude of movement[J]. Journal of Experimental Psychology, 1954, 47(6): 381–391. doi: 10.1037/h0055392 [12] DHAKAL V, FEIT A M, KRISTENSSON P O, et al. Observations on typing from 136 million keystrokes[C]. The 2018 CHI Conference on Human Factors in Computing Systems, Montreal, Canada, 2018: 646. [13] Verizon: IP latency statistics[EB/OL]. https://www.verizon.com/business/terms/latency/, 2021. [14] KILLOURHY K S and AXION R A. Free vs. transcribed text for keystroke-dynamics evaluations[C]. The 2012 Workshop on Learning from Authoritative Security Experiment Results, Arlington, USA, 2012: 1–8. [15] SCHUSTER M and PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673–2681. doi: 10.1109/78.650093 [16] SCHWARZ M, LIPP M, GRUSS D, et al. KeyDrown: Eliminating software-based keystroke timing side-channel attacks[C]. The 25th Annual Network and Distributed System Security Symposium, San Diego, USA, 2018. [17] BALSA E, TRONCOSO C, and DIAZ C. OB-PWS: Obfuscation-based private web search[C]. 2012 IEEE Symposium on Security and Privacy, San Francisco, USA, 2012: 491–505. [18] ZHENG Li, ZHANG Liren, and XU Dong. Characteristics of network delay and delay jitter and its effect on voice over IP (VoIP)[C]. 2001 IEEE International Conference on Communications, Helsinki, Finland, 2001: 122–126.