Local vs. global differential privacy

Part of a series on differential privacy. You might want to start with the previous articles below!

  1. Why differential privacy is awesome presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy (this article) presents the two main models of differential privacy, depending on the context.
  6. The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are two "flavors" of differential privacy: the global model and the local model. In this post, I'll first explain the difference between the two options. Then, I'll give a glimpse of an exciting new idea that combines them.

What do I mean by "context"? In abstract terms, a differentially private algorithm transforms an input into an output.

 

Diagram "raw input" → "anonymized output", with "magic" on the arrow.

 

The main difference will be: who has access to the raw input?

Global differential privacy

In the global model of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.

 

 

 

The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.

This model has one big advantage: accuracy. In the global model, you usually don't need to add a lot of noise to get valuable results with a low εε. Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this global model. It worked pretty well: only a little noise was needed to hide someone in a count.

Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the global model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.

The most famous real-world example of the global model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.

Local differential privacy

What's the alternative, then? It's the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.

 

 

 

After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.

The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.

Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the global model to get useful results. To mitigate this problem, practical applications often use high values of εε.

Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.

The best of both worlds

Until very recently, there was no middle ground between the two options. The choice was binary: either accept a much larger level of noise, or collect raw data. This is starting to change, thanks to recent work on a novel type of architecture. This new design is called ESA: Encode, Shuffle, Analyze. In the two previous models, there was only two types of participants: users and aggregator. By contrast, the ESA architecture has three elements.

  • The encoder is a fancy name to say "user". It collects the data, encrypts it twice, and passes it to the shuffler.
  • The shuffler is an intermediary process. First, it removes identifiers, and groups similar pieces of data together. Then, it passes these pieces of data to the analyzer if there are enough of them.
  • The analyzer actually decrypts the data, and computes the statistics we're interested in.

 

 

 

In ESA, the data is encrypted twice, in two layers. The shuffler can only decrypt the first layer. It contains the user IDs, and something called "group ID". This group ID describes what kind of data this is, but not what is the actual value of the data. For example, group ID could be a label like "app latency for iOS", while the actual data would be the latency value.

The shuffler then groups all group IDs together and counts how many users are in each group. If there are enough users in a group, it passes them all to the analyzer. The analyzer can then decrypt the second layer of the data, and compute the output.

Having two layers allows to separate the data and its metadata. The shuffler can see IDs, but not the actual data. Meanwhile, the analyzer only sees the data in batches, and cannot know who sent what. The magic comes with tweaking what the shuffler does. You add some randomness in the process, in well-chosen places… And it ensures that the final output is differentially private.

This nice property comes from the separation between shuffler and analyzer. Of course, if the same entity runs both, that benefit disappears. This design can still be useful! In particular, catastrophic failures are much less likely. But from a trust perspective, it becomes like the global model. So there are two options:

  • A different organization can run the shuffler as a service. In principle, a nonprofit like the EFF could play this role.
  • The shuffler can run on secure hardware, using e.g. remote attestation. This way, the client can verify that the shuffler does what it's supposed to.

Both options are probably a few years away, but I find them pretty exciting. They could, in theory, bring the best of both worlds: the low noise levels of the global model, and the trustless aspect of the local model. If you want to learn more about it, you can check out the paper that introduced this design.


What comes next

What will we be learning next? I'm not sure yet. I recently wrote a research paper about differential privacy variants, so I'll cover that here at some point. Apart from that, here are a few ideas…

  • Differentially private algorithms to publish things that are more complex than counts or sums. (Medians are interesting!)
  • The trade-off between privacy and utility, and how to give people an intuition about accuracy loss. (How to talk to data scientists when you're a privacy person.)
  • The unexpected challenges that occur when using differential privacy in practice. (It's like crypto: so many opportunities to shoot yourself in the foot!)

If you have opinions about what you want to learn next, let me know — my contact info is below.

局部差分隐私的操纵攻击是指攻击者试图影响隐私保护机制以获取敏感信息的行为。该攻击针对局部差分隐私机制的特性和缺点进行利用,以窃取隐私数据或干扰数据发布的结果。 局部差分隐私的目标是在保护个体隐私的前提下,提供对于整体数据集的有意义的分析结果。然而,攻击者可通过操纵自己的个体数据或其他数据的投入,来影响数据分析结果。例如,攻击者可能故意修改或篡改自己的数据,以改变数据发布的结论,或者通过协作或串通他人进行攻击。 操纵攻击的目的是干扰数据发布的结果,以推断出更多的隐私信息或获得误导性的数据分析结果。攻击者可能通过加入虚假的数据或者删除真实的数据来扰乱数据集的特性,使得发布的结果偏离真实情况。这种攻击可能会导致分析人员得出错误的结论或泄露隐私信息。 对抗局部差分隐私操纵攻击的方法包括对数据进行更严格的验证和校验、采用更复杂的算法进行数据发布,以及增加对攻击行为的监测和检测。此外,用户和数据发布者在数据分享和数据发布过程中需要保持警惕,增强对潜在攻击的认识和防范意识。 总之,局部差分隐私的操纵攻击是一种针对隐私保护机制的攻击行为,可通过操纵个体数据或其他数据的投入来干扰数据发布的结果。为了应对这种攻击,需要采取相应的安全措施和对攻击行为进行检测和防范。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值