(<i>ε</i>, <i>δ</i>)-本地差分隐私模型下的均值估计机制

张跃; 朱友文; 周玉倩; 袁家斌

doi:10.11999/JEIT221047

(ε, δ)-本地差分隐私模型下的均值估计机制

doi: 10.11999/JEIT221047

张跃¹,
朱友文^{1, 2, ,},
周玉倩¹,
袁家斌¹

1.
南京航空航天大学计算机科学与技术学院南京 211106
2.
桂林电子科技大学广西密码与信息安全重点实验室桂林 541004

基金项目: 国家重点研发计划(2021YFB3100400)，国家自然科学基金(62172216)，江苏省自然科学基金(BK20211180)，广西密码学与信息安全重点实验室开放课题(GCIS202107)，江苏省研究生科研与实践创新计划(KYCX19_0203)

详细信息

作者简介:
张跃：女，博士生，研究方向为隐私保护技术

朱友文：男，教授，研究方向为信息安全、数据隐私

周玉倩：女，讲师，研究方向为可信的量子随机数生成、隐私保护

袁家斌：男，教授，研究方向为云计算安全、量子信息安全

通讯作者:
朱友文　zhuyw@nuaa.edu.cn

中图分类号: TN918; TP309
计量
- 文章访问数: 1051
- HTML全文浏览量: 453
- PDF下载量: 174
- 被引次数: 0
出版历程
- 收稿日期: 2022-08-10
- 修回日期: 2022-12-05
- 网络出版日期: 2022-12-08
- 刊出日期: 2023-03-10

Mean Estimation Mechanisms under (ε, δ)-Local Differential Privacy

1.
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2.
Guangxi Key Laboratory of Cryptography and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

Funds: The National Key Research and Development Program of China (2021YFB3100400), The National Natural Science Foundation of China (62172216), The Natural Science Foundation of Jiangsu Province (BK20211180), The Research Fund of Guangxi Key Laboratory of Cryptography and Information Security (GCIS202107), The Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX19_0203)

摘要

摘要: 相对于ε-本地差分隐私(LDP)机制，(ε, δ)-本地差分隐私模型下的方案具有更小的误差边界和更高的数据效用。然而，当前的(ε, δ)-本地差分隐私均值估计机制仍存在估计误差大、数据效用低等问题。因此，针对均值估计问题，该文提出两种新的(ε, δ)-本地差分隐私均值估计机制：基于区间的均值估计机制(IM)和基于近邻的均值估计机制(NM)。IM的主要思想是：划分扰动后的数据到3个区间，真实数据以较大概率扰动到中间的区间，以较小概率扰动到两边的区间，收集者直接对扰动数据求均值得到无偏估计。NM的主要思想是：把真实数据以较大概率扰动到其邻域，以较小概率扰动到距离较远的值，收集者结合期望最大化算法得到高准确度的估计均值。最后，该文通过理论分析证明了IM和NM均可以满足隐私保护要求，并通过实验证实了IM和NM的数据效用优于现有机制。
- 隐私保护 /
- 本地差分隐私 /
- 数据聚合 /
- 均值估计
Abstract: Compared with the ε-Local Differential Privacy (LDP) mechanism, the scheme under (ε, δ)-local differential privacy has a smaller error bound and higher data utility. However, the current mean estimation mechanisms under (ε, δ)-local differential privacy still have problems such as large estimation error and low data utility. Therefore, for the mean estimation problem, two new mean estimation mechanisms under (ε, δ)-local differential privacy are proposed: the Interval-based Mechanism for mean estimation (IM) and the Neighbor-based Mechanism for mean estimation (NM). IM divides the perturbed data into three intervals. Then the real data is perturbed to the middle interval with a large probability, and the two sides are perturbed with a small probability. Collector averages directly the perturbed data to get an unbiased estimation. NM perturbs the real data to its near neighborhood with a large probability and perturbs it far away with a small probability. Then the collector applies the expectation maximization algorithm to obtain an estimated mean value with high accuracy. Finally, both IM and NM satisfy the privacy protection requirements are proved through theoretical analysis, and the data utility of IM and NM is superior to existing mechanisms is confirmed by experiments.
- Privacy protection /
- Local Differential Privacy (LDP) /
- Data collection /
- Mean estimation

HTML全文

图 1 本地差分隐私模型下均值估计

下载: 全尺寸图片幻灯片

图 2 4种机制估计误差的理论对比

下载: 全尺寸图片幻灯片

图 3 设定ε=0.5, 2.0时基于合成数据集的不同隐私参数下4种机制估计准确度的对比

下载: 全尺寸图片幻灯片

图 4 设定δ=10^–8, 10^–6时基于合成数据集的不同隐私参数下4种机制估计准确度的对比

下载: 全尺寸图片幻灯片

图 5 设定ε=0.5, 2.0时基于真实数据集的不同隐私参数下4种机制估计准确度的对比

下载: 全尺寸图片幻灯片

图 6 设定δ=10^–8, 10^–6时基于真实数据集的不同隐私参数下4种机制估计准确度的对比

下载: 全尺寸图片幻灯片

表 1 现有的满足(ε, δ)-LDP的均值估计机制的误差

	GM^[10]	Opt-GM^[11]	NDM^[12]
MSE	$\dfrac{2}{\varepsilon }\sqrt {2\ln \dfrac{ {1.25} }{\delta } }$	$ \dfrac{{\sqrt 2 }}{\varepsilon }(\xi + \sqrt {{\xi ^2} + \varepsilon } ) $^*	${\left( {\dfrac{ { {{\rm{e}}^\varepsilon } + 1} }{ { {{\rm{e}}^\varepsilon } + 2\delta - 1} } } \right)^2}$
^*：其中$ \xi $由${\rm{erfc} }(\xi ) - {{\rm{e}}^\varepsilon }{\rm{erfc} }(\sqrt { {\xi ^2} + \varepsilon } ) = 2\delta$计算可得，${\rm{erfc}}( \cdot )$是互补误差函数^[13]

下载: 导出CSV

算法1　基于区间的均值估计机制(IM)
输入：用户${u_i}$的数据${x_i} \in [ - 1,1]$，隐私预算$\varepsilon $，松弛因子$\delta $
输出：扰动数据${y_i} \in [ - C,C]$
(1) 初始时，令$q = \dfrac{ { {{\rm{e}}^{\varepsilon/2} } - 1 - 4\delta } }{ {2{{\rm{e}}^{\varepsilon/2} }({{\rm{e}}^{\varepsilon/2} } + 1 + 4\delta )} }$, $p = {{\rm{e}}^\varepsilon }q + \delta$, $a = \dfrac{1}{{4q}} - \sqrt {\dfrac{p}{{2q(q - p)}} + \dfrac{1}{{16{q^2}}}} $, $C = \dfrac{{1 - 2a(q - p)}}{{2p}}$, $b = a - C$, 　　 $l({x_i}) = a{x_i} + b$, $r({x_i}) = a{x_i} - b$；
(2) 对数据${x_i}$做如下扰动　　　　　　　　　　　　　　　　　　　${\rm{pdf} }[{y_i} = t\left\| { {x_i} } \right.] = \left\{ \begin{aligned} & p,\;{\text{ } }t \in \left[ {l({x_i}),r({x_i})} \right] \\ & q,\;{\text{ } }t \in \left[ {\left. { - C,l({x_i})} \right) \cup \left( {r({x_i}),C} \right.} \right] \end{aligned} \right.$　　　　　　　　　　　　　　　　　(2)
(3) 将扰动后的$ {y_i} $发送给数据收集者。

下载: 导出CSV

算法2　基于近邻的均值估计机制(NM)
输入：用户${u_i}$的数据${x_i} \in [ - 1,1]$，隐私预算$\varepsilon $，松弛因子$\delta $　输出：扰动数据${y_i} \in [ - b,b + 1]$
(1) 初始时，令$q = \dfrac{ {1 - 2b\delta } }{ {1 + 2b{{\rm{e}}^\varepsilon } } }$, $p = \dfrac{ { {{\rm{e}}^\varepsilon } + \delta } }{ {1 + 2b{{\rm{e}}^\varepsilon } } }$, $b = \dfrac{ { {{\rm{e}}^\varepsilon } - 1 - \varepsilon ({{\rm{e}}^\varepsilon } + \delta )} }{ {2({{\rm{e}}^\varepsilon }(1 - {{\rm{e}}^\varepsilon }) + \varepsilon ({{\rm{e}}^\varepsilon } + \delta ))} }$；
(2) 格式化数据$ {x'_i} = {{({x_i} + 1)} \mathord{\left/ {\vphantom {{({x_i} + 1)} 2}} \right. } 2} $；
(3) 对数据${x_i}$做如式(4)的扰动　　　　　　　　　　　　　　　　　　　　　　　${\rm{pdf} }\left( { {y_i}\left\| { { x'_i} } \right.} \right) = \left\{ \begin{gathered} p,{\text{ } }\left\| { { x'_i} - {y_i} } \right\| \le b \\ q,{\text{ } }\left\| { { x'_i} - {y_i} } \right\| > b \\ \end{gathered} \right.$　　　　　　　　　　　　　　　　　　　(4)
(4) 将扰动后的$ {y_i} $发送给数据收集者。

下载: 导出CSV

算法3　NM机制的后处理过程
输入：扰动数据$ y = \{ {y_1},{y_2}, \cdots ,{y_n}\} $
输出：估计均值$\tilde \mu $
(1) ${d_x} = {d_y} = {2^{\left\lfloor {{{\log }_2}\sqrt n } \right\rfloor }}$, $c = 0$；
(2) 对NM的格式化输入区间$[0,1]$分组为$ {G^x} = \{ G_1^x,G_2^x, \cdots ,G_{{d_x}}^x\} $，输出区间$[ - b,b + 1]$分组为${G^y} = \{ G_1^y,G_2^y, \cdots ,G_{{d_y}}^y\} $；
(3) 初始化原始数据在组$G_i^x$中的估计概率为$ {\tilde f^0} = \left\{ {{{\tilde f}^0}_1,{{\tilde f}^0}_2, \cdots ,{{\tilde f}^0}_{{d_x}}} \right\} = \left\{ {{1 \mathord{\left/ {\vphantom {1 {{d_x}}}} \right. } {{d_x}}},{1 \mathord{\left/ {\vphantom {1 {{d_x}}}} \right. } {{d_x}}}, \cdots ,{1 \mathord{\left/ {\vphantom {1 {{d_x}}}} \right. } {{d_x}}}} \right\} $；
(4) 构建扰动矩阵${\boldsymbol{M}}$：${M_{ji} } = {\Pr _{{\rm{NM}}} }[G_j^y\left\| {G_i^x} \right.]{\text{ } }(1 \le i \le {d_x},1 \le j \le {d_y})$；
(5) while $c < 10000$ and $\left\| {L({{\tilde f}^c}) - L({{\tilde f}^{c - 1}})} \right\| > {10^{ - 3}}$：
$c = c + 1$；　　E步骤：$ {P_i} = {\tilde f_i}^{c - 1}\displaystyle\sum\nolimits_{j = 1}^{{d_y}} {{n_j}\dfrac{{\Pr [y \in G_j^y\left\| {x \in G_i^x,{{\tilde f}^{c - 1}}} \right.]}}{{\Pr [y \in G_j^y\left\| {{{\tilde f}^{c - 1}}} \right.]}}} = {\tilde f_i}^{c - 1}\displaystyle\sum\nolimits_{j = 1}^{{d_y}} {{n_j}\dfrac{{{M_{ji}}}}{{\displaystyle\sum\nolimits_{k = 1}^{{d_x}} {{M_{jk}}{{\tilde f}^{c - 1}}_k} }}} {\text{ }}(1 \le i \le {d_x}) $；
M步骤：$ {\tilde f^c}_i = {{{P_i}} \mathord{\left/ {\vphantom {{{P_i}} {\displaystyle\sum\nolimits_{k = 1}^{{d_x}} {{P_k}} }}} \right. } {\displaystyle\sum\nolimits_{k = 1}^{{d_x}} {{P_k}} }}{\text{ }}(1 \le i \le {d_x}) $；　　平滑步骤：${\tilde f^c}_i = \left\{ \begin{array}{lllllll} \dfrac{2}{3}{ {\tilde f}^c}_i + \dfrac{1}{3}{ {\tilde f}^c}_{i + 1}{\text{, } }& i = 1 \\ \dfrac{1}{2}{ {\tilde f}^c}_i + \dfrac{1}{4}({ {\tilde f}^c}_{i - 1} + { {\tilde f}^c}_{i + 1}){\text{, } }& 1 < i < {d_x} \\ \dfrac{1}{3}{ {\tilde f}^c}_{i - 1} + \dfrac{2}{2}{ {\tilde f}^c}_i{\text{, } }& i = {d_x} \\ \end{array} \right.$；
end while
(6) $\tilde \mu = \displaystyle\sum\nolimits_{i = 1}^{{d_x}} {{{\tilde f}^c}_i({{(1 + 2i)} \mathord{\left/ {\vphantom {{(1 + 2i)} {{d_x}}}} \right. } {{d_x}}} - 1)} $；
(7) 返回$\tilde \mu $。

下载: 导出CSV

参考文献(15)

[1]	HITAJ B, ATENIESE G, and PEREZ-CRUZ F. Deep models under the GAN: Information leakage from collaborative deep learning[C]. The ACM SIGSAC Conference on Computer and Communications Security, Dallas Texas, USA, 2017: 603–618.
[2]	钱文君, 沈晴霓, 吴鹏飞, 等. 大数据计算环境下的隐私保护技术研究进展[J]. 计算机学报, 2022, 45(4): 669–701. doi: 10.11897/SP.J.1016.2022.00669 QIAN Wenjun, SHEN Qingni, WU Pengfei, et al. Research progress on privacy-preserving techniques in big data computing environment[J]. Chinese Journal of Computers, 2022, 45(4): 669–701. doi: 10.11897/SP.J.1016.2022.00669
[3]	KASIVISWANATHAN S P, LEE H K, NISSIM K, et al. What can we learn privately?[C]. The 49th Annual IEEE Symposium on Foundations of Computer Science, Philadelphia, USA, 2008: 531–540.
[4]	KASIVISWANATHAN S P, LEE H K, NISSIM K, et al. What can we learn privately?[J]. SIAM Journal on Computing, 2011, 40(3): 793–826. doi: 10.1137/090756090
[5]	冯登国, 张敏, 叶宇桐. 基于差分隐私模型的位置轨迹发布技术研究[J]. 电子与信息学报, 2020, 42(1): 74–88. doi: 10.11999/JEIT190632 FENG Dengguo, ZHANG Min, and YE Yutong. Research on differentially private trajectory data publishing[J]. Journal of Electronics &Information Technology, 2020, 42(1): 74–88. doi: 10.11999/JEIT190632
[6]	XUE Qiao, ZHU Youwen, WANG Jian, et al. Locally differentially private distributed algorithms for set intersection and union[J]. Science China Information Sciences, 2021, 64: 219101. doi: 10.1007/s11432-018-9899-8
[7]	WANG Ning, XIAO Xiaokui, YANG Yin, et al. Collecting and analyzing multidimensional data with local differential privacy[C]. 2019 IEEE 35th International Conference on Data Engineering, Macao, China, 2019: 638–649.
[8]	LI Zitao, WANG Tianhao, LOPUHAÄ-ZWAKENBERG M, et al. Estimating numerical distributions under local differential privacy[C]. The 2020 ACM SIGMOD International Conference on Management of Data, Portland, USA, 2020: 621–635.
[9]	BASSILY R. Linear queries estimation with local differential privacy[C]. The 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 2019: 721–729.
[10]	DWORK C and ROTH A. The algorithmic foundations of differential privacy[J]. Foundations and Trends® in Theoretical Computer Science, 2014, 9(3/4): 211–407. doi: 10.1561/0400000042
[11]	BALLE B and WANG Yuxiang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising[C]. The 35th International Conference on Machine Learning, Stockholm, Sweden, 2018: 403–412.
[12]	WANG Teng, ZHAO Jun, HU Zhi, et al. Local differential privacy for data collection and analysis[J]. Neurocomputing, 2021, 426: 114–133. doi: 10.1016/j.neucom.2020.09.073
[13]	TELLAMBURA C and ANNAMALAI A. Efficient computation of erfc(x) for large arguments[J]. IEEE Transactions on Communications, 2000, 48(4): 529–532. doi: 10.1109/26.843116
[14]	YEH I C and LIEN C H. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients[J]. Expert Systems with Applications, 2009, 36(2): 2473–2480. doi: 10.1016/j.eswa.2007.12.020
[15]	Minnesota Population Center. Integrated public use microdata series-international: version 7.3 [dataset][EB/OL]. https://doi.org/10.18128/D020.V7.3.