The privacy loss random variable

Part of a series on differential privacy. You might want to start with the previous articles below!

  1. Why differential privacy is awesome presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy presents the two main models of differential privacy, depending on the context.
  6. The privacy loss random variable (this article) explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


Remember the notion of « almost » differential privacy? We changed the original definition to add a new parameter, δδ. We said that δδ was « the probability that something goes wrong ». This was a bit of a shortcut: this nice and easy intuition is sometimes not exactly accurate. In this post, I'll do two things. I'll introduce a crucial concept in differential privacy: the « privacy loss random variable ». Then, I'll use it to explain what δδ really means.

Friendly heads-up: this post has slightly more math than the rest of this series. But don't worry! I made it as nice and visual as I could, with graphs instead of equations. All the equations are in a proof hidden by default.

The privacy loss random variable

Recall the setting of the definition of εε-DP (short for differential privacy). The attacker tries to distinguish between two databases D1D1 and D2D2, that differ by only one record. If a mechanism AA is εε-DP, then A(D1)A(D1) and A(D2)A(D2) will return output OO with similar probability:

P[A(D1)=O]≤eε⋅P[A(D2)=O]P[A(D1)=O]≤eε⋅P[A(D2)=O]

The equality also goes in the other direction, but the relation between D1D1 and D2D2 is symmetrical, so we only use this one inequality, to simplify.

We said before that the εε in εε-DP was the maximal knowledge gain of the attacker. We defined this knowledge gain in Bayesian terms, where the attacker is trying to guess if the real database DD is D1D1 and D2D2. We saw that εε bounds the evolution of betting odds. For each OO, we had:

P[D=D1∣A(D)=O]P[D=D2∣A(D)=O]≤eε⋅P[D=D1]P[D=D2]P[D=D1∣A(D)=O]P[D=D2∣A(D)=O]≤eε⋅P[D=D1]P[D=D2]

What if we don't just want to bound this quantity, but calculate it for a given output OO? Let us define:

LD1,D2(O)=lnP[D=D1∣A(D)=O]P[D=D2∣A(D)=O]P[D=D1]P[D=D2]LD1,D2(O)=ln⁡P[D=D1∣A(D)=O]P[D=D2∣A(D)=O]P[D=D1]P[D=D2]

This formula looks scary, but the intuition behind it is pretty simple. The denominator corresponds to the initial betting odds for D1D1 vs. D2D2. How likely is one option vs. the other, before looking at the result of the mechanism. In Bayesian terms, this is called the "prior". Meanwhile, the numerator of the fraction is the betting odds afterwards — the "posterior". Differential privacy guarantees that LD1,D2(O)≤εLD1,D2(O)≤ε for all OO.

Bayes' rule allows us to reformulate this quantity:

LD1,D2(O)=ln(P[A(D1)=O]P[A(D2)=O])LD1,D2(O)=ln⁡(P[A(D1)=O]P[A(D2)=O])

This is called the privacy loss random variable (PLRV for short). Intuitively, the PLRV is the « actual εε value » for a specific output OO. Why is it a random variable? Because typically, we consider LD1,D2(O)LD1,D2(O) when OO varies according to A(D1)A(D1), which we assume is the "real" database.

OK, this is very abstract. We need a concrete example.

A concrete example

Suppose that we're counting the number of people with blue eyes in the dataset. We make this diferentially private by adding Laplace noise of scale 33, to get ε=ln(3)ε=ln⁡(3). The attacker hesitates between two possible datasets: one with 10001000 blue-eyed people, the other with 10011001. The real number is 10001000, but the attacker doesn't know that. The two distributions look like this:

 

Graph showing two Laplace distributions with scale 1/ln(3), centered on 1000 and 1001

 

Let's consider three possible outputs of the mechanism, given the "real" database is D1D1. We represent them below as O1O1, O2O2, and O3O3.

 

 

 

Say the attacker is very uncertain: initially, they give equal probabilities to D1D1 and D2D2. What are they going to think once we give them the output of the mechanism?

  • If we return O1O1, the attacker is starting to suspect that the real database is D1D1. There's a larger chance to get that output if D=D1D=D1 than if D=D2D=D2. How much larger? Exactly 3 times larger: the attacker's knowledge is tripled.
  • If we return O2O2, the attacker is like: ¯\_(ツ)_/¯. This is not giving them much information. This output could have come from D1D1, but it could just as well have come from D2D2. The attacker's knowledge doesn't change.
  • If we return O3O3, the attacker is getting tricked with wrong information. They will think it's more likely that the real database is D2D2. Their "knowledge" is divided by 3.

Let's look at all possible events O=A(D1)O=A(D1), and order them. We'll put the ones that help the attacker most first, and look at the value of LD1,D2(O)LD1,D2(O). Let's call this LL, for short, and plot it.

 

 

 

This is why Laplace noise is so nice: look at this neat horizontal line. Oh my god. It even has a straight diagonal. It never goes above ε≈1.1ε≈1.1: a beautiful visual proof that Laplace noise gives εε-DP.

Let's change the graph above to more accurately represent that LL is a random variable. On the xx-axis, we represent all events according to their probability. We're also more interested in exp(L)exp⁡(L), so let's plot that instead of LL.

 

 

 

Now, what if you were using some other type of noise? Say, from a normal distribution? It would make data analysts happier: Laplace noise is weird to them, it never shows up in the real world. Normal distributions, by contrast, are familiar and friendly. A lot of natural data distributions can be modeled with them.

In the context of differential privacy, the normal distribution is called « Gaussian noise ». Let's try to add Gaussian noise, of variance σ2=2σ2=2:

 

Graph showing two normal distributions with variance 2, centered on 1000 and 1001

 

OK, looks reasonable, now let's see what eLeL looks like:

 

 

 

Ew. Look at this line going up to infinity on the left side. Gross. We can't just draw a line at eεeε and say "everything is underneath". What do we do, then? We cheat, and use a δδ.

δδ and the PLRV

In a previous article, we said that the δδ in (ε,δ)(ε,δ)-DP is the probability that something terrible happens. What does that mean in the context of Gaussian noise? First, we pick an arbitrary εε, say, ε=ln(3)ε=ln⁡(3). Then, we look at how likely it for eLeL to be above the eε=3eε=3 line. It's easy to do: the xx-axis is the probability space, so we can simply measure the width of the bad events.

 

 

 

This simple intuition is correct: this mechanism is (ln(3),δ1)(ln⁡(3),δ1)-DP, with δ1≈0.054δ1≈0.054. But it misses an important subtlety. Let's zoom in on the part where things go wrong, and consider two possible outputs.

 

 

 

Returning O1O1 is not great: eL>eεeL>eε. But it's not terrible: the privacy loss is only a tiny bit larger than we'd hope. Returning O2O2, however, is scary news: eLeL is huge. Intuitively, O2O2 leaks much more information than O1O1.

With our way of quantifying δδ, we don't account for this. We only measure the xx-axis. What we count is whether eLeL is above the line, not how much it's above the line. For each bad event of probability pp, we're adding p×1p×1 to the δδ. A finer approach is to weigh the bad events by "how bad they are". We want to give a "weight" of ≈1≈1 to the very bad events, and a weight of ≈0≈0 to the "not too bad" ones.

To do this, we transform a bit the curve above by doing two things. First, we take the inverse of the curve: very bad events are now close to 00 instead of very large. Second, we normalize the curve by taking the ratio eε/eLeε/eL. This way, events that are "not too bad" are close to 11.

 

 

 

This allows us to consider the area between the curve and the y=1y=1 line. When LL is very large, the inverse is close to 00, so the distance to 11 is almost 1. And when LL is close to εε, the ratio is one, and the distance is almost 0. Very bad events count more than sort of bad events.

This is the tighter, exact characterization of δδ. In (ε,δ)(ε,δ)-DP, the δδ is the area highlighted above. It is the mass of all possible bad events, weighted by how likely they are and how bad they are. This tells us that the mechanism is (ln(3),δ2)(ln⁡(3),δ2)-DP with δ2≈0.011δ2≈0.011, a much better characterization than before.

The typical definition of (ε,δ)(ε,δ)-DP doesn't use this complicated formulation. A mechanism AA is (ε,δ)(ε,δ)-DP if for any neighboring D1D1 and D2D2, and any set SS of possible outputs:

P[A(D1)∈S]≤eε⋅P[A(D2)∈S]+δ.P[A(D1)∈S]≤eε⋅P[A(D2)∈S]+δ.

This definition is equivalent to the previous characterization. If you want to see the proof of that, click here: Show me the proof

What about infinity values?

Using Gaussian noise, all possible values of LL are finite. But for some mechanisms AA, there are outputs OO such that P[A(D1)=O]>0P[A(D1)=O]>0, but P[A(D2)=O]=0P[A(D2)=O]=0. In that case, L(O)=∞L(O)=∞. This kind of output is called a distinguishing event. If we return a distinguishing event, the attacker immediately finds out that DD is D1D1 and not D2D2. This is the case for the "thresholding" example we looked at previously.

Our interpretation of δδ captures this nicely. Since we inverted the curve, if L=∞L=∞, we simply have eε/eL=0eε/eL=0. The distance to 11 is exactly 11, so we count these events with maximal weight. The graph looks like this:

 

 

 

In that case, δ1=δ2δ1=δ2: all "bad" events are worst-case events. For such a mechanism, the two characterizations of δδ are the same.

Why use Gaussian noise at all if it requires δ>0δ>0?

This is an excellent question. This post is already pretty long, so I'll talk more about Gaussian noise later. Stay tuned (on RSS or Twitter) for more updates!

Thanks to Sebastian Meiser, who wrote the reference paper about the subtleties with δδ. It makes for an excellent reading if you want to dig a bit deeper into this. Thanks also to Antoine Amarilli for proofreading this post.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
4S店客户管理小程序-毕业设计,基于微信小程序+SSM+MySql开发,源码+数据库+论文答辩+毕业论文+视频演示 社会的发展和科学技术的进步,互联网技术越来越受欢迎。手机也逐渐受到广大人民群众的喜爱,也逐渐进入了每个用户的使用。手机具有便利性,速度快,效率高,成本低等优点。 因此,构建符合自己要求的操作系统是非常有意义的。 本文从管理员、用户的功能要求出发,4S店客户管理系统中的功能模块主要是实现管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理,用户客户端:首页、车展、新闻头条、我的。门店客户端:首页、车展、新闻头条、我的经过认真细致的研究,精心准备和规划,最后测试成功,系统可以正常使用。分析功能调整与4S店客户管理系统实现的实际需求相结合,讨论了微信开发者技术与后台结合java语言和MySQL数据库开发4S店客户管理系统的使用。 关键字:4S店客户管理系统小程序 微信开发者 Java技术 MySQL数据库 软件的功能: 1、开发实现4S店客户管理系统的整个系统程序; 2、管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理等。 3、用户客户端:首页、车展、新闻头条、我的 4、门店客户端:首页、车展、新闻头条、我的等相应操作; 5、基础数据管理:实现系统基本信息的添加、修改及删除等操作,并且根据需求进行交流信息的查看及回复相应操作。
现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本微信小程序医院挂号预约系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此微信小程序医院挂号预约系统利用当下成熟完善的SSM框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。微信小程序医院挂号预约系统有管理员,用户两个角色。管理员功能有个人中心,用户管理,医生信息管理,医院信息管理,科室信息管理,预约信息管理,预约取消管理,留言板,系统管理。微信小程序用户可以注册登录,查看医院信息,查看医生信息,查看公告资讯,在科室信息里面进行预约,也可以取消预约。微信小程序医院挂号预约系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值