Why differential privacy is awesome

Part of a series on differential privacy. In case you need reading material once you finished this post!

  1. Why differential privacy is awesome (this article) presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
  6. The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


Are you following tech- or privacy-related news? If so, you might have heard about differential privacy. The concept is popular both in academic circles and inside tech companies. Both Apple or Google use differential privacy to collect data in a private way.

So, what's this definition about? How is it better than definitions that came before? More importantly, why should you care? What makes it so exciting to researchers and tech companies? In this post, I'll try to explain the idea behind differential privacy and its advantages. I'll do my best to keep it simple and accessible for everyone — not only technical folks.

What it means

Suppose you have a process that takes some database as input, and returns some output.

 

 

This process can be anything. For example, it can be:

  • computing some statistic ("tell me how many users have red hair")
  • an anonymization strategy ("remove names and last three digits of ZIP codes")
  • a machine learning training process ("build a model to predict which users like cats")
  • … you get the idea.

To make a process differentially private, you usually have to modify it a little bit. Typically, you add some randomness, or noise, in some places. What exactly you do, and how much noise you add, depends on which process you're modifying. I'll abstract that part away and simply say that your process is now doing some unspecified ✨ magic ✨.

 

 

Now, remove somebody from your database, and run your new process on it. If the new process is differentially private, then the two outputs are basically the same. This must be true no matter who you remove, and what database you had in the first place.

 

 

By "basically the same", I don't mean "it looks a bit similar". Instead, remember that the magic you added to the process was randomized. You don't always get the same output if you run the new process several times. So what does "basically the same" means in this context? That the probability distributions are similar. You can get the exact same output with database 1 or with database 2, with similar likelihood.

What does this have to do with privacy? Well, suppose you're a creepy person trying to figure out whether your target is in the original data. By looking at the output, you can't be 100% certain of anything. Sure, it could have come from a database with your target in it. But it could also have come from the exact same database, without your target. Both options have a similar probability, so there's not much you can say.

You might have noticed that this definition is not like the ones we've seen before. We're not saying that the output data satisfies differential privacy. We're saying that the process does. This is very different from kk-anonymity and other definitions we've seen. There is no way to look at data and determine whether it satisfies differential privacy. You have to know the process to know whether it is "anonymizing" enough.

And that's about it. It's a tad more abstract than other definitions we've seen, but not that complicated. So, why all the hype? What makes it so awesome compared to older, more straightforward definitions?

Why it's awesome

Privacy experts, especially in academia, are enthusiastic about differential privacy. It was first proposed by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 20061. Very soon, almost all researchers working on anonymization started building differentially private algorithms. And, as we've already mentioned, tech companies are also trying to use it whenever possible. So, why all the hype? I can count three main reasons.

You no longer need attack modeling

Remember the previous definitions we've seen? (If not, you're fine, just take my word for it :D) Why did we need kk-map in certain cases, and kk-anonymity or δδ-presence in others? To choose the right one, we had to figure out the attacker's capabilities and goals. In practice, this is pretty difficult. You might not know exactly what your attacker is capable of. Worse, there might be unknown unknowns: attack vectors that you hadn't imagined at all. You can't make very broad statements when you use old-school definitions. You have to make some assumptions, which you can't be 100% sure of.

By contrast, when you use differential privacy, you get two awesome guarantees.

  1. You protect any kind of information about an individual. It doesn't matter what the attacker wants to do. Reidentify their target, know if they're in the dataset, deduce some sensitive attribute… All those things are protected. Thus, you don't have to think about the goals of your attacker.
  2. It works no matter what the attacker knows about your data. They might already know some people in the database. They might even add some fake users to your system. With differential privacy, it doesn't matter. The users that the attacker doesn't know are still protected.

You can quantify the privacy loss

We saw that when using kk-anonymity, choosing the parameter kk is pretty tricky. There is no clear link between which kk to choose and how "private" the dataset is. The problem is even worse with other definitions. This problem is present in all other definitions we've seen so far.

Differential privacy is much better. When you use it, you can quantify the greatest possible information gain by the attacker. The corresponding parameter, usually named εε, allows you to make very strong statements. Suppose ε=1.1ε=1.1. Then, you can say: "an attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%."

And do you remember the previous point about attack modeling? It means you can change this statement in many ways. You can replace "their target is is the dataset" by anything about one individual. And you can add "no matter what the attacker knows" if you want to be extra-precise. Altogether, that makes differential privacy much stronger than all definitions that came before.

You can compose multiple mechanisms

Suppose you have some data. You want to share it with Alex and with Brinn, in some anonymized fashion. You trust Alex and Brinn equally, so you use the same definition of privacy for both of them. They are not interested in the same aspects of the data, so you give them two different versions of your data. Both versions are "anonymous", for the definition you've chosen.

What happens if Alex and Brinn decide to conspire, and compare the data you gave them? Will the union of the two anonymized versions still be anonymous? It turns out that for most definitions of privacy, this is not the case. If you put two kk-anonymous versions of the same data together, the result won't be kk-anonymous. So if Alex and Brinn conspire, they might be able to reidentify users on their own… or even reconstruct all the original data! That's definitely not good news.

If you used differential privacy, you get to avoid this type of scenario. Suppose that you gave differentially private data to Alex and Brinn. Each time, you used a parameter of εε. Then if they conspire, the resulting data is still protected by differential privacy, except that the privacy is now weaker: the parameter becomes 2ε2ε. So they gain something, but you still quantify how much information they got. Privacy experts call this property composition.

This scenario sounds a bit far-fetched, but composition is super useful in practice. Organizations often want to do many things with data. Publish statistics, release an anonymized version, train machine learning algorithms… Composition is a way to stay in control of the level of risk as new use cases appear and processes evolve.

Conclusion

I hope the basic intuition behind differential privacy is now clear. Want a one-line summary? Uncertainty in the process means uncertainty for the attacker, which means better privacy.

I also hope that you're now wondering how it actually works! What hides behind this magic that makes everything private and safe? Why does differential privacy have all the awesome properties I've mentioned? What a coincidence! That's the topic of a follow-up article, which tries to give more details while still staying clear of heavy math.


  1. The idea was first proposed in a scientific paper (pdf) presented at TCC 2006, and can also be found in a patent (pdf) filed by Dwork and McSherry in 2005. The name differential privacy seems to have appeared first in an invited paper (pdf) presented at ICALP 2006 by Dwork. 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
4S店客户管理小程序-毕业设计,基于微信小程序+SSM+MySql开发,源码+数据库+论文答辩+毕业论文+视频演示 社会的发展和科学技术的进步,互联网技术越来越受欢迎。手机也逐渐受到广大人民群众的喜爱,也逐渐进入了每个用户的使用。手机具有便利性,速度快,效率高,成本低等优点。 因此,构建符合自己要求的操作系统是非常有意义的。 本文从管理员、用户的功能要求出发,4S店客户管理系统中的功能模块主要是实现管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理,用户客户端:首页、车展、新闻头条、我的。门店客户端:首页、车展、新闻头条、我的经过认真细致的研究,精心准备和规划,最后测试成功,系统可以正常使用。分析功能调整与4S店客户管理系统实现的实际需求相结合,讨论了微信开发者技术与后台结合java语言和MySQL数据库开发4S店客户管理系统的使用。 关键字:4S店客户管理系统小程序 微信开发者 Java技术 MySQL数据库 软件的功能: 1、开发实现4S店客户管理系统的整个系统程序; 2、管理员服务端;首页、个人中心、用户管理、门店管理、车展管理、汽车品牌管理、新闻头条管理、预约试驾管理、我的收藏管理、系统管理等。 3、用户客户端:首页、车展、新闻头条、我的 4、门店客户端:首页、车展、新闻头条、我的等相应操作; 5、基础数据管理:实现系统基本信息的添加、修改及删除等操作,并且根据需求进行交流信息的查看及回复相应操作。
现代经济快节奏发展以及不断完善升级的信息化技术,让传统数据信息的管理升级为软件存储,归纳,集中处理数据信息的管理方式。本微信小程序医院挂号预约系统就是在这样的大环境下诞生,其可以帮助管理者在短时间内处理完毕庞大的数据信息,使用这种软件工具可以帮助管理人员提高事务处理效率,达到事半功倍的效果。此微信小程序医院挂号预约系统利用当下成熟完善的SSM框架,使用跨平台的可开发大型商业网站的Java语言,以及最受欢迎的RDBMS应用软件之一的MySQL数据库进行程序开发。微信小程序医院挂号预约系统有管理员,用户两个角色。管理员功能有个人中心,用户管理,医生信息管理,医院信息管理,科室信息管理,预约信息管理,预约取消管理,留言板,系统管理。微信小程序用户可以注册登录,查看医院信息,查看医生信息,查看公告资讯,在科室信息里面进行预约,也可以取消预约。微信小程序医院挂号预约系统的开发根据操作人员需要设计的界面简洁美观,在功能模块布局上跟同类型网站保持一致,程序在实现基本要求功能时,也为数据信息面临的安全问题提供了一些实用的解决方案。可以说该程序在帮助管理者高效率地处理工作事务的同时,也实现了数据信息的整体化,规范化与自动化。
局部差分隐私的操纵攻击是指攻击者试图影响隐私保护机制以获取敏感信息的行为。该攻击针对局部差分隐私机制的特性和缺点进行利用,以窃取隐私数据或干扰数据发布的结果。 局部差分隐私的目标是在保护个体隐私的前提下,提供对于整体数据集的有意义的分析结果。然而,攻击者可通过操纵自己的个体数据或其他数据的投入,来影响数据分析结果。例如,攻击者可能故意修改或篡改自己的数据,以改变数据发布的结论,或者通过协作或串通他人进行攻击。 操纵攻击的目的是干扰数据发布的结果,以推断出更多的隐私信息或获得误导性的数据分析结果。攻击者可能通过加入虚假的数据或者删除真实的数据来扰乱数据集的特性,使得发布的结果偏离真实情况。这种攻击可能会导致分析人员得出错误的结论或泄露隐私信息。 对抗局部差分隐私操纵攻击的方法包括对数据进行更严格的验证和校验、采用更复杂的算法进行数据发布,以及增加对攻击行为的监测和检测。此外,用户和数据发布者在数据分享和数据发布过程中需要保持警惕,增强对潜在攻击的认识和防范意识。 总之,局部差分隐私的操纵攻击是一种针对隐私保护机制的攻击行为,可通过操纵个体数据或其他数据的投入来干扰数据发布的结果。为了应对这种攻击,需要采取相应的安全措施和对攻击行为进行检测和防范。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值