隐私保护举例_完美保护隐私

最新推荐文章于 2024-06-22 20:40:50 发布

weixin_26636643

最新推荐文章于 2024-06-22 20:40:50 发布

阅读量2.4k

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/perfectly-privacy-preserving-ai-c14698f322f5

版权

隐私保护举例

Data privacy has been called “the most important issue in the next decade,” and has taken center stage thanks to legislation like the European Union’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Companies, developers, and researchers are scrambling to keep up with the requirements. In particular, “Privacy by Design” is integral to the GDPR and will likely only gain in popularity this decade. When using privacy preserving techniques, legislation suddenly becomes less daunting, as does ensuring data security which is central to maintaining user trust.

数据隐私被称为“ 未来十年最重要的问题 ”，并且由于诸如欧盟的《通用数据保护条例》(GDPR)和《加利福尼亚消费者隐私法》(CCPA)之类的立法而成为了焦点。公司，开发人员和研究人员都在努力满足要求。特别是，“ 设计中的隐私权 ”是GDPR不可或缺的，并且可能只会在此十年中流行起来。当使用隐私保护技术时，法规突然变得不那么令人生畏，确保数据安全对于保持用户信任至关重要。

Data privacy is a central issue to training and testing AI models, especially ones that train and infer on sensitive data. Yet, to our knowledge, there have been no guides published regarding what it means to have perfectly privacy-preserving AI. We introduce the four pillars required to achieve perfectly privacy-preserving AI and discuss various technologies that can help address each of the pillars. We back our claims up with relatively new research in the quickly growing subfield of privacy-preserving machine learning.

数据隐私是训练和测试AI模型(尤其是对敏感数据进行训练和推断的AI模型)的核心问题。然而，据我们所知，尚未发布有关完全保护隐私的AI意味着什么的指南。我们介绍了实现完美的隐私保护AI所需的四个Struts，并讨论了可以帮助解决每个Struts的各种技术。我们在快速发展的隐私保护机器学习子领域中进行了相对较新的研究来支持我们的主张。

完美隐私保护AI的四大Struts (The Four Pillars of Perfectly-Privacy Preserving AI)

During our research, we identified four pillars of privacy-preserving machine learning. These are:

在我们的研究过程中，我们确定了隐私保护机器学习的四个Struts。这些是：

Training Data Privacy: The guarantee that a malicious actor will not be able to reverse-engineer the training data.
培训数据隐私 ：保证恶意行为者无法对培训数据进行反向工程。
Input Privacy: The guarantee that a user’s input data cannot be observed by other parties, including the model creator.
输入隐私 ：保证用户的输入数据不会被其他各方(包括模型创建者)观察到。
Output Privacy: The guarantee that the output of a model is not visible by anyone except for the user whose data is being inferred upon.
输出隐私 ：保证除了推断数据的用户外，任何人都看不到模型的输出。
Model Privacy: The guarantee that the model cannot be stolen by a malicious party.
模型隐私 ：保证模型不会被恶意方窃取。

While 1–3 deal with protecting data creators, 4 is meant to protect the model creator.

1–3用于保护数据创建者，而4用于保护模型创建者。

培训数据隐私 (Training data privacy)

While it may be slightly more difficult to gather information about training data and model weights than it is from plaintext (the technical term for unencrypted) input and output data, recent research has demonstrated that reconstructing training data and reverse-engineering models is not as huge challenge as one would hope.

虽然从明文(未加密的技术术语)输入和输出数据中收集有关训练数据和模型权重的信息可能要困难一些，但最近的研究表明，重建训练数据和逆向工程模型并不那么庞大就像人们希望的那样挑战。

Image for post — Source: https://xkcd.com/2169/

Evidence

证据

In [1], Carlini and Wagner calculate just how quickly generative sequence models (e.g., character language models) can memorize rare information within a training set. Carlini and Wagner train a character language model on the Penn Treebank with a “secret” inserted into it once: “the random number is ooooooooo” where ooooooooo is a meant to be a (fake) social security number. They show how the exposure of a secret which they hide within their copy of the Penn Treebank Dataset (PTD). They train a character language model on 5% of the PTD and calculate the network’s amount of memorization. Memorization peaks when the test set loss is lowest. This coincides with peak exposure of the secret.

在[1]中，Carlini和Wagner计算了生成序列模型(例如，字符语言模型)可以在多快的时间内记忆训练集中的稀有信息。 Carlini和Wagner在Penn Treebank上训练了一种字符语言模型，并在其中插入了一个“秘密”：“随机数是ooooooooo”，其中ooooooooo的意思是一个(假的)社会安全号码。他们展示了如何隐藏在Penn树库数据集(PTD)副本中的秘密。他们在5％的PTD上训练字符语言模型，并计算网络的记忆量。当测试集损耗最低时，记忆达到峰值。这与秘密的暴露高峰相吻合。

Metrics

指标

So how can we quantify how likely it is that a secret can be reverse-engineered from model outputs? [1] develops a metric known as exposure:

那么，我们如何量化从模型输出进行反向工程的机率呢？ [1]建立了一个指标，称为曝光度 ：

Given a canary s[r], a model with parameters θ, and the randomness space R, the exposure s[r] is

给定金丝雀s [r]，参数为θ的模型和随机空间R，则曝光s [r]为

and the rank is the index at which the true secret (or canary) is among all possible secrets given the model’s perplexity for the inputs. The smaller the index, the greater the likelihood that the sequence appears in the training data, so the goal is to minimize the exposure of a secret, which is something that Carlini and Wagner. achieve by using differentially private gradient descent (see Solutions below).

排名是给定模型输入的复杂性时，真实秘密(或金丝雀)在所有可能秘密中的索引。索引越小，该序列出现在训练数据中的可能性越大，因此目标是最大程度地减少秘密的暴露，而这正是卡利尼和瓦格纳所为。通过使用差分私有梯度下降来实现(请参见下面的解决方案 )。

Another exposure metric is presented in [2], in which the authors calculate how much information can be leaked from a latent representation of private data sent over an insecure channel. While this paper falls more into the category of input data privacy analysis, it’s still worth looking at the metric proposed to compare it with the one presented in [1]. In fact, they propose two privacy metrics. One for demographic variables such as sentiment analysis and blog post topic classification, and one for named entities such as news topic classification. Their privacy metrics are:

[2]中提出了另一种暴露度量，作者计算了通过不安全通道发送的私有数据的潜在表示可以泄漏多少信息。尽管本文更多地属于输入数据隐私分析类别，但仍然值得一提的是将其与[1]中提出的指标进行比较。实际上，他们提出了两个隐私指标。一种用于人口统计变量(例如，情绪分析和博客帖子主题分类)，另一种用于命名实体(例如，新闻主题分类)。他们的隐私指标是：

Demographic variables: “1 − X, where X is the average of the accuracy of the attacker on the prediction of gender and age,”
人口统计学变量：“ 1- X ，其中X是攻击者在预测性别和年龄方面的准确性的平均值，”
Named entities: “1−F, where F is an F-score computed over the set of binary variables in z that indicate the presence of named entities in the input example,” where “z is a vector of private information contained in a [natural language text].”
命名实体：“ 1- F ，其中F是在z中的二进制变量集合上计算出的F分数，表示在输入示例中存在命名实体，”其中“ z是[[自然语言文字]。”

When looking at the evidence, it’s important to keep in mind that this sub-field of AI (privacy-preserving AI) is brand-spanking new, so there are likely a lot of potential exploits that either have not been analyzed or even haven’t been thought of yet.

查看证据时，请务必牢记，此AI(隐私保护AI)子领域是全新的品牌，因此可能存在许多尚未被分析甚至尚未被利用的潜在漏洞。还没想到。

Solutions

解决方案

There are two main proposed solutions for the problem of training data memorization which not only guarantee privacy, but also improve the generalizability of machine learning models:

针对训练数据记忆问题，主要提出了两种解决方案，它们不仅可以保证隐私，而且可以提高机器学习模型的通用性：

Differentially Private Stochastic Gradient Descent (DPSGD) [3, 4]: While differential privacy was originally created to allow one to make generalizations about a dataset without revealing any personal information about any individual within the dataset, the theory has been adapted to preserve training data privacy within deep learning systems.
差异性私人随机梯度下降(DPSGD)[3，4]：差异性隐私最初是为了允许人们在不透露数据集中任何个人信息的情况下对数据集进行概括而创建的，但该理论已被调整以保留训练数据深度学习系统中的隐私。

For a thorough discussion on the use of differential privacy in machine learning, please read
有关在机器学习中使用差异隐私的详尽讨论，请阅读

this interview with Dr. Parinaz Sobhani, Director of Machine Learning at Georgian Partners, one of Canada’s leading Venture Capital firms.
采访了乔治亚州合伙人(加拿大领先的风险投资公司之一)的机器学习总监Parinaz Sobhani博士。
Papernot’s PATE [5]: Professor Papernot created PATE as a more intuitive alternative to DPSGD. PATE can be thought of as an ensemble approach & works by training multiple models on iid subsets of the dataset. At inference, if the majority of the models agree on the output, then the output doesn’t reveal any private information about the training data and can therefore be shared.
Papernot的PATE [5]：Papernot教授将PATE创建为DPSGD的更直观的替代方案。通过在数据集的iid子集上训练多个模型，可以将PATE视为整体方法和作品。可以推断，如果大多数模型都同意输出，则输出不会显示有关训练数据的任何私人信息，因此可以共享。

输入和输出隐私 (Input and output privacy)

Input user data and resulting model outputs inferred from that data should not be visible to any parties except for the user in order to comply with the four pillars of perfectly privacy-preserving AI. Preserving user data privacy is not only beneficial for the users themselves, but also for the companies processing potentially sensitive information. Privacy goes hand in hand with security. Having proper security in place means that data leaks are much less likely to occur, leading to the ideal scenario: no loss of user trust and no fines for improper data management.

为了遵守完全保护隐私的AI的四大Struts，除用户以外，任何一方都不应看到输入的用户数据和由此数据推断出的模型输出。维护用户数据隐私不仅对用户本身有利，而且对处理潜在敏感信息的公司也有利。隐私与安全息息相关。拥有适当的安全性意味着数据泄漏的可能性要小得多，从而导致了理想的情况：不会失去用户信任，也不会因数据管理不当而受到罚款。

Evidence

证据

This is important to ensure that private data do not:

这对于确保私有数据不会：

get misused (e.g., location tracking as reported in the NYT)
被滥用(例如，《纽约时报》报道的位置跟踪 )
fall into the wrong hands due to, say, a hack, or
由于被黑客入侵而落入不正确的人手中，或
get used for tasks that a user had either not expected or explicitly consented to (e.g., Amazon admits employees listen to Alexa conversations).
习惯于用户未曾期望或未明确同意的任务(例如，亚马逊承认员工在听Alexa对话 )。

While it is standard for data to be encrypted in transit and (if a company is responsible) at rest as well, data is vulnerable when it is decrypted for processing.

虽然在传输过程中以及在闲置时(如果由公司负责)也对数据进行加密是标准的，但是在解密数据进行处理时，数据很容易受到攻击。

Solutions

解决方案

Homomorphic Encryption: homomorphic encryption allows for non-polynomial operations on encrypted data. For machine learning, this means training & inference can be performed directly on encrypted data. Homomorphic encryption has successfully been applied to random forests, naive Bayes, and logistic regression [6]. [7] designed low-degree polynomial algorithms that classify encrypted data. More recently, there have been adaptations of deep learning models to the encrypted domain [8, 9, 10].
同态加密：同态加密允许对加密数据进行非多项式运算。对于机器学习，这意味着可以直接对加密数据进行训练和推理。同态加密已成功应用于随机森林，朴素贝叶斯和逻辑回归[6]。 [7]设计了对加密数据进行分类的低次多项式算法。最近，深度学习模型已适应加密域[8、9、10]。

See
看到

this post for an introduction to homomorphic encryption.
这篇文章介绍了同态加密。
Secure Multiparty Computation (MPC): the idea behind MPC is that two or more parties’ who do not trust each other can transform their inputs into “nonsense” which gets sent into a function whose output is only sensical when the correct number of inputs are used. Among other applications, MPC has been used for genomic diagnosis using the genomic data owned by different hospitals [11], and linear regression, logistic regression, and neural networks for classifying MNIST images [12]. [11] is a prime example of the kind of progress that can be made by having access to sensitive data if privacy is guaranteed. There are a number of tasks which cannot be accomplished with machine learning given to the lack of data required to train classification and generative models. Not because the data isn’t out there, but because the sensitive nature of the information means that it cannot be shared or sometimes even collected, spanning medical data or even speaker-specific metadata which might help improve automatic speech recognition systems (e.g., age group, location, first language).
安全多方计算(MPC)：MPC背后的想法是，两个或两个以上彼此不信任的方可以将其输入转换为“废话”，然后将其发送到仅当输入数目正确时才有意义的函数中。用过的。在其他应用中，MPC已被用于使用不同医院拥有的基因组数据进行基因组诊断[11]，以及线性回归，逻辑回归和神经网络对MNIST图像进行分类[12]。 [11]是可以保证隐私权的情况下，通过访问敏感数据可以取得的进展的一个典型例子。由于缺乏训练分类和生成模型所需的数据，因此机器学习无法完成许多任务。不是因为数据不存在，而是因为信息的敏感性质意味着它无法共享甚至有时甚至无法收集，因此跨越医学数据甚至特定于说话者的元数据可能有助于改善自动语音识别系统(例如年龄)组，位置，第一语言)。

A great introduction to MPC can be found
可以找到MPC的精彩介绍

here. You can find Dr. Morten Dahl’s Private Deep Learning with MPC tutorial here.
在这里。您可以在此处找到Morten Dahl博士的MPC专用深度学习教程。
Federated Learning: federated learning is basically on-device machine learning. It is only truly made private when combined with differentially private training (see DPSGD in the previous section) and MPC for secure model aggregation [13], so the data that was used to train a model cannot be reverse engineered from the weight updates output out of a single phone. In practice, Google has deployed federated learning on Gboard (see their blog post about it) and Apple introduced federated learning support in CoreML3.
联合学习：联合学习基本上是设备上的机器学习。仅当与差分私有训练(请参阅上一节中的DPSGD)和MPC结合以进行安全模型聚合时才真正私有化[13]，因此用于训练模型的数据不能从输出的权重更新中进行反向工程。一部手机。实际上，谷歌已经在Gboard上部署了联合学习(请参阅有关它的博客文章 )，而Apple在CoreML3中引入了联合学习支持。

模型隐私 (Model privacy)

AI models can be companies’ bread and butter, many of which provide predictive capabilities to developers through APIs or, more recently, through downloadable software. Model privacy is the last of the four pillars that must be considered and is also core to both user and company interests. Companies will have little motivation to provide interesting products and spend money on improving AI capabilities if their competitors can easily copy their models (an act which is not straightforward to investigate).

人工智能模型可以是公司的生死攸关的食品，其中许多通过API或最近通过可下载的软件为开发人员提供了预测功能。模型隐私是必须考虑的四个Struts中的最后一个，也是用户和公司利益的核心。如果竞争对手可以轻松复制其模型(这种行为难以直接调查)，公司将没有动力提供有趣的产品并花钱改善AI能力。

Evidence

证据

Machine learning models form the core product & IP of many companies, so having a model stolen is a severe threat and can have significant negative business implications. A model can be stolen outright or can be reverse-engineered based on its outputs [14].

机器学习模型构成许多公司的核心产品和IP，因此，模型被盗是一个严重的威胁，并且可能会对业务产生重大负面影响。一个模型可以被完全窃取，也可以根据其输出进行逆向工程[14]。

Solutions

解决方案

There has been some work on applying differential privacy to model outputs to prevent model inversion attacks. Differential privacy usually means compromising model accuracy; however, [15] presents a method that does not sacrifice accuracy in exchange for privacy.
在将差分隐私应用于模型输出以防止模型反转攻击方面已经开展了一些工作。差异性隐私通常意味着损害模型的准确性；然而，[15]提出了一种不牺牲准确性来换取隐私的方法。
Homomorphic encryption can be used not only to preserve input and output privacy, but also model privacy, if one chooses to encrypt a model in the cloud. This comes at significant computational cost, however, and does not prevent model inversion attacks.
如果人们选择在云中加密模型，同态加密不仅可以用于保留输入和输出的隐私，还可以用于模型的隐私。但是，这需要大量的计算成本，并且不能防止模型反转攻击。

满足所有四个Struts (Satisfying All Four Pillars)

As can be seen from the previous sections, there is no blanket technology that will cover all privacy problems. Rather, to have perfectly privacy-preserving AI (something that both the research community and industry have yet to achieve), one must combine technologies:

从前面的部分可以看出，没有覆盖所有隐私问题的覆盖技术。相反，要拥有完美的保护隐私的AI(研究界和行业尚未实现的目标)，必须结合以下技术：

Homomorphic Encryption + Differential Privacy
同态加密+差分隐私
Secure Multiparty Computation + Differential Privacy
安全的多方计算+差异隐私
Federated Learning + Differential Privacy + Secure Multiparty Computation
联合学习+差异隐私+安全的多方计算
Homomorphic Encryption + PATE
同态加密+ PATE
Secure Multiparty Computation + PATE
安全的多方计算+ PATE
Federated Learning + PATE + Homomorphic Encryption
联合学习+ PATE +同态加密

Other combinations also exist, including some with alternative technologies that do not have robust mathematical guarantees yet; namely, (1) secure enclaves (e.g., Intel SGX) which allow for computations to be performed without even the system kernel having access, (2) data de-identification, and (3) data synthesis.

还存在其他组合，包括一些具有替代技术但尚无可靠的数学保证的组合。即，(1)安全区域(例如Intel SGX)，即使在没有系统内核访问权限的情况下，也可以执行计算；(2)数据去标识；以及(3)数据合成。

For now, perfectly privacy-preserving AI is still a research problem, but there are a few tools that can address some of the most urgent privacy needs.

目前，完美保留隐私的AI仍然是一个研究问题，但是有一些工具可以解决一些最迫切的隐私需求。

隐私保护机器学习工具 (Privacy-Preserving Machine Learning Tools)

致谢 (Acknowledgments)

Many thanks to Pieter Luitjens and Dr. Siavash Kazemian for their feedback on earlier drafts of this post and Professor Gerald Penn for his contributions to this work.

非常感谢Pieter Luitjens和Siavash Kazemian博士对本文的早期草稿提供了反馈，并感谢Gerald Penn教授对这项工作的贡献。

翻译自: https://towardsdatascience.com/perfectly-privacy-preserving-ai-c14698f322f5

隐私保护举例

weixin_26636643

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
隐私保护举例_完美保护隐私

隐私保护举例Data privacy has been called “the most important issue in the next decade,” and has taken center stage thanks to legislation like the European Union’s General Data Protection Regulation (GDPR) ...
复制链接

扫一扫