Differential privacy in (a bit) more detail

Part of a series on differential privacy. In case you need some more reading material!

  1. Why differential privacy is awesome presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail (this article) introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
  6. The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


As I mentioned in the previous article, differential privacy is pretty awesome. If I did a good job, you're now wondering what the real definition looks like. So in this post, I will go into a bit more detail into what differential privacy actually means, and why it works so well. There will be some math! But I promise I will explain all the concepts I use, and give lots of intuition.

The definition

We saw that a process satisfies differential privacy if its output is basically the same if you change the data of one individual. And by "basically the same", we meant "the probabilities are close".

 

 

Let's now translate that into a formal definition.

A process AA is εε-differentially private if for all databases D1D1 and D2D2 which differ in only one individual:

P[A(D1)=O]≤eε⋅P[A(D2)=O]P[A(D1)=O]≤eε⋅P[A(D2)=O]

… and this must be true for all possible outputs OO. Let's unpack this.

P[A(D1)=O]P[A(D1)=O] is the probability that when you run the process AA on the database D1D1, the output is OO. This process is probabilistic: if you run it several times, it might give you different answers. A typical process might be: "count the people with blue eyes, add some random number to this count, and return this sum". Since the random number changes every time you run the process, the results will vary.

eεeε is the exponential function applied to the parameter ε>0ε>0. If εε is very close to 0, then eεeε is very close to 1, so the probabilities are very similar. The bigger εε is, the more the probabilities can differ.

Of course, the definition is symmetrical: you can replace D1D1 by D2D2 and vice-versa, and the two databases will still differ in only one individual. So we could replace it by:

e−ε⋅P[A(D2)=O]≤P[A(D1)=O]≤eε⋅P[A(D2)=O]e−ε⋅P[A(D2)=O]≤P[A(D1)=O]≤eε⋅P[A(D2)=O]

Thus, this formula means that the output of the process is similar if you change or remove the data of one person. The degree of similarity depends on εε: the smaller it is, the more similar the outputs are.

What does this similarity have to do with privacy? First, I'll explain this with an intuitive example. Then, I'll formalize this idea with a more generic interpretation.

A simple example: randomized response

Suppose you want to do a survey to know how many people are illegal drug users. If you naively go out and ask people whether they're using illegal drugs, many will lie to you. So you devise the following mechanism. The participants no longer directly answer the question "have you consumed illegal drugs in the past week?". Instead, each of them will flip a coin, without showing it to you.

  • On heads, the participant tells the truth (Yes or No).
  • On tails, they flip a second coin. If the second coin lands on heads, they answer Yes. Otherwise, they answer No.

How is this better for survey respondents? They can now answer Yes without revealing that they're doing something illegal. When someone answers Yes, you can't know their true answer for sure. They could be actually doing drugs, but they might also have answered at random.

Let's compute the probabilities of each answer for a drug user.

  • With probability 50%, they will say the truth and answer Yes.
  • With probability 50%, they will answer at random.
    • They then have another 50% chance to answer Yes, so 25% chance in total.
    • Similarly, in total, they have a 25% chance to answer No.

All in all, we get a 75% chance to answer Yes and a 25% chance to answer No. For someone who is not doing drugs, the probabilities are reversed: 25% chance to answer Yes and 75% to answer No. Using the notations from earlier:

  • P[A(Yes)=Yes]=0.75P[A(Yes)=Yes]=0.75, P[A(Yes)=No]=0.25P[A(Yes)=No]=0.25
  • P[A(No)=Yes]=0.25P[A(No)=Yes]=0.25, P[A(No)=No]=0.75P[A(No)=No]=0.75

Now, 0.750.75 is three times bigger than 0.250.25. So if we choose εε such as eε=3eε=3 (that's ε≃1.1ε≃1.1), this process is εε-differentially private. So this plausible deniability translates nicely in the language of differential privacy.

Of course, with a differentially private process like this one, you're getting some noise into your data. But if you have enough answers, with high probability, the noise will cancel itself out. Suppose you have 1000 answers in total: 400 of them are Yes and 600 are No. About 50% of all 1000 answers are random, so you can remove 250 answers from each count. In total, you get 150 Yes answers out of 500 non-random answers, so about 30% of Yes overall.

What if you want more privacy? Instead of having the participants say the truth with probability 50%, you can have them tell the truth 25% of the time. What if you want less noise instead, at the cost of less protection? Have them tell the truth 75% of the time. Finding out εε and quantifying the noise for each option is left as an exercise for the reader =)

A generalization: quantifying the attacker's knowledge

Let's forget about the previous example and consider a more generic scenario. In line with the previous article, we will describe this scenario from the attacker's perspective. We have a mechanism AA which is εε-differentially private. We run it on some database DD, and release the output A(D)A(D) to an attacker. Then, the attacker tries to figure out whether someone (their target) is in DD.

Under differential privacy, the attacker can't gain a lot of information about their target. And this is true even if this attacker has a lot of knowledge about the dataset. Let's take the stronger attacker we can think of: they know all the database, except their target. This attacker has to determine which database is the real one, between two options: one with their target in it (let's call it DinDin), the other without (DoutDout)1.

So, in the attacker's model of the world, the actual database DD can be either DinDin or DoutDout. They might have an initial suspicion that their target is in the database. This suspicion is represented by a probability, P[D=Din]P[D=Din]. This probability can be anything between 00 and 11. Say, 0.90.9 if the attacker's suspicion is strong, 0.010.01 if they think it's very unlikely, 0.50.5 if they have no idea… Similarly, their suspicion that their target is not in the dataset is also a probability, P[D=Dout]P[D=Dout]. Since there are only two options, P[D=Dout]=1−P[D=Din]P[D=Dout]=1−P[D=Din].

Now, suppose the attacker sees that the mechanism returns output OO. How much information did the attacker gain? This is captured by looking at how much their suspicion changed after seeing this output. In mathematical terms, we have to compare P[D=Din]P[D=Din] with the updated suspicion P[D=Din∣A(D)=O]P[D=Din∣A(D)=O]. This updated suspicion is the attacker's model of the world after seeing OO.

With differential privacy, the updated probability is never too far from the initial suspicion. And we can quantify this phenomenon exactly. For example, with ε=1.1ε=1.1, here is what the upper and lower bounds look like.

 

 

 

The black line is what happens if the attacker didn't get their suspicion updated at all. The blue lines are the lower and upper bounds on the updated suspicion: it can be anywhere between the two. We can visualize the example mentioned in the previous article: for an initial suspicion of 50%, the updated suspicion is approximately between 25% and 75%.

How do we prove that these bounds hold? We'll need a result from probability theory, and some basic arithmetic manipulation. I reproduced the proof as simply as I could, but you still don't have to read it. If you want to, click here: Show me the proof

What does this look like for various values of εε? We can draw a generalization of this graph with pretty colors:

 

 

 

For larger values of εε, this gets scary quite fast. Let's say you're using ε=5ε=5. Then, an attacker can go from a small suspicion (say, 10%) to a very high degree of certainty (94%).

What about composition?

In the previous section, I formalized two claims I made in my last article. First, I explained what it means to quantify information gain. Furthermore, I picked an attacker with full background knowledge. If the attacker knows less information in the first place, the bounds we showed still hold.

What about the third claim? I said that differential privacy was composable. Suppose that two algorithms AA and BB are εε-differentially private. We want to prove that publishing the result of both is 2ε2ε-differentially private. Let's call CC the algorithm which combines AA and BB: C(D)=(A(D),B(D))C(D)=(A(D),B(D)). The output of this algorithm will be a pair of outputs: O=(OA,OB)O=(OA,OB).

The insight is that the two algorithms are independent. They each have their own randomness, so the result of one does not impact the result of the other. This allows us to simply write:

P[C(D1)=O]=P[A(D1)=OA]⋅P[B(D1)=OB]≤e2ε⋅P[A(D2)=OA]⋅P[B(D2)=OB]≤e2ε⋅P[C(D2)=O]P[C(D1)=O]=P[A(D1)=OA]⋅P[B(D1)=OB]≤e2ε⋅P[A(D2)=OA]⋅P[B(D2)=OB]≤e2ε⋅P[C(D2)=O]

so CC is 2ε2ε-differentially private.

Future steps

I hope that I convinced you that differential privacy can be an excellent way to protect your data (if your εε is low). Now, if everything is going according to my master plan, you should be like… "This is awesome! I want to use it everywhere! How do I do that?"

Initially, I planned to answer this question in this post (insofar as it can be answered). But as I started writing it, I realized three things.

  • There are many different answers depending on what task you want to do.
  • There are many classical mistakes you can do when trying to use differential privacy. I would need to explain them to make sure you don't fall for them.
  • This post is pretty long already.

So, you guessed it, I'll keep that for the next article.

Thanks to Chao Li for introducing me to the Bayesian interpretation of differential privacy, and to a3nmArmavicaimmae and p4bl0 for their helpful comments on drafts of this article (as well as previous ones).


  1. This can mean that DoutDout is the same as DinDin with one fewer user. This can also mean that DoutDout is the same as DinDin, except one user has been changed to some arbitrary other user. This distinction doesn't change anything to the reasoning, so we can simply forget about it. 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
4S店客户管理小程序-毕业设计,基于微信小程序+SSM+MySql开发,源码+数据库+论文答辩+毕业论文+视频演示 社会的发展和科学技术的进步,互联网技术越来越受欢迎。手机也逐渐受到广大人民群众的喜爱,也逐渐进入了每个用户的使用。手机具有便利性,速度快,效率高,成本低等优点。 因此,构建符合自己要求的操作系统是非常有意义的。 本文从管理员、用户的功能要求出发,4S店客户管理系统中的功能模块主要是实现管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理,用户客户端:首页、车展、新闻头条、我的。门店客户端:首页、车展、新闻头条、我的经过认真细致的研究,精心准备和规划,最后测试成功,系统可以正常使用。分析功能调整与4S店客户管理系统实现的实际需求相结合,讨论了微信开发者技术与后台结合java语言和MySQL数据库开发4S店客户管理系统的使用。 关键字:4S店客户管理系统小程序 微信开发者 Java技术 MySQL数据库 软件的功能: 1、开发实现4S店客户管理系统的整个系统程序; 2、管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理等。 3、用户客户端:首页、车展、新闻头条、我的 4、门店客户端:首页、车展、新闻头条、我的等相应操作; 5、基础数据管理:实现系统基本信息的添加、修改及删除等操作,并且根据需求进行交流信息的查看及回复相应操作。
现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本微信小程序医院挂号预约系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此微信小程序医院挂号预约系统利用当下成熟完善的SSM框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。微信小程序医院挂号预约系统有管理员,用户两个角色。管理员功能有个人中心,用户管理,医生信息管理,医院信息管理,科室信息管理,预约信息管理,预约取消管理,留言板,系统管理。微信小程序用户可以注册登录,查看医院信息,查看医生信息,查看公告资讯,在科室信息里面进行预约,也可以取消预约。微信小程序医院挂号预约系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值