sap发出商品差异差异分摊_什么是差异隐私？

最新推荐文章于 2023-03-20 10:17:21 发布

weixin_26752075

最新推荐文章于 2023-03-20 10:17:21 发布

阅读量719

点赞数

文章标签： python

原文链接：https://medium.com/analytics-vidhya/what-is-differential-privacy-553f41a757fd

版权

sap发出商品差异差异分摊

Differential Privacy is an important research branch in AI. It has brought a fundamental change to AI, and continues to morph the AI development. That’s the motive for me to write the series of articles on Differential Privacy and Privacy-preserving Machine Learning (ML).

差异隐私是AI中重要的研究分支。它给AI带来了根本的变化，并继续改变AI的发展。这是我撰写差异化隐私和隐私保护机器学习(ML)系列文章的动机。

Two emerging trends in AI are “Explainable AI” and “Differential Privacy”. On Explainable AI, Dataman has published a series of articles including “An Explanation for eXplainable AI”, “Explain Your Model with the SHAP Values”, “Explain Your Model with LIME”, and “Explain Your Model with Microsoft’s InterpretML. In this series, Dataman wants to bring Differential Privacy to your attention. On Differentiated Privacy, Dataman published “You Can Be Identified by Your Netflix Watching History” and “What Is Differential Privacy?” and more to come in the future.

人工智能的两个新兴趋势是“可解释的人工智能”和“差异性隐私”。在可解释的AI上，Dataman发表了一系列文章，包括“ 可解释AI的解释 ”，“ 使用SHAP值解释模型 ”，“ 使用LIME解释模型 ”和“ 使用Microsoft的InterpretML解释模型 ”。在本系列中，Dataman希望引起您的差异性隐私。关于差异化隐私，Dataman发表了“ 您可以通过Netflix的观看历史来识别您的身份 ”和“ 什么是差异化隐私？ ”，将来还会有更多。

Let’s Start with Data Privacy

让我们从数据隐私开始

Any private information in digital form is at risk. When you open your Facebook account, you gave them your personal identifiable information (PII) such as name, address, date of birth, marital status, etc. This is sensitive and may be compromised.

任何数字形式的私人信息都将受到威胁。当您打开Facebook帐户时，您给了他们您的个人身份信息(PII)，例如姓名，地址，出生日期，婚姻状况等。这很敏感，可能会受到影响。

Even with anonymity of the PII, Your true identification still can be revealed and at risk. In my previous post “You Can Be Identified by Your Netflix Watching History”, two researchers show individuals can be identified by their anonymous Netflix watching history. With all the digital data about you in the modern era — shopping data, medical data, GPS data, your true identification is still at risk even if you use fake user names.

即使使用PII的匿名身份，您的真实身份仍然可以被泄露并处于危险之中。在我之前的文章“ 您可以通过Netflix的观看记录来识别您的身份 ”中，两名研究人员显示可以通过匿名Netflix的观看记录来识别个人。利用现代所有有关您的数字数据(购物数据，医疗数据，GPS数据)，即使您使用假用户名，您的真实身份仍然有风险。

Sensitive survey questions are private data too. Suppose a researcher needs to know the percentage of males that ever had sex with a prostitute. He surveys random people. Does the researcher need to know which individuals had sex with a prostitute? Probably not. But the collected data can reveal which individuals.

敏感的调查问题也是私人数据。假设研究人员需要知道曾经与妓女发生性关系的男性比例。他调查随机的人。研究人员是否需要知道哪些人与妓女发生性关系？可能不是。但是收集的数据可以揭示哪些人。

In summary, are there any better ways to protect individual data privacy, and meanwhile provide meaningful statistic? Yes. That’s Differential Privacy in AI.

总而言之， 有没有更好的方法来保护个人数据隐私，同时提供有意义的统计信息？ 是。那就是AI中的差异隐私。

What is Differential Privacy in AI?

什么是AI的差异隐私？

Differential privacy is a formal mathematical definition of privacy. We need algorithms in rigid mathematical form because we deal with trillions of data and more. In the above survey example, the researcher computes the statistics (such as count or percentage) of the collected data. An algorithm is said to be differentially private if by looking at the outcome statistics, one cannot tell which individual was included in the dataset. On the other hand, if the researcher asks one more participant and the outcome percentage changed noticeably, that participant can be identified. This is not differentially private. Thus, a differentially-private algorithm is that the outcome statistics hardly changes when a single individual joins or leaves the dataset. This concept was proposed by Dwork, McSherry, Nissim and Smith in 2006 and called ε-differential privacy.

差异隐私是隐私的正式数学定义。我们需要采用严格的数学形式的算法，因为我们要处理数万亿的数据，甚至更多。在上面的调查示例中，研究人员计算收集到的数据的统计信息(例如计数或百分比)。如果通过查看结果统计信息无法确定数据集中包括哪个个人，则该算法被称为差分私有算法。另一方面，如果研究者再问一个参与者，并且结果百分比明显变化，则可以识别该参与者。这不是私有的。因此，差分私有算法是当单个人加入或离开数据集时，结果统计几乎不变。 这个概念是由Dwork，McSherry，Nissim和Smith在2006年提出的，称为ε差分隐私。

Apple, Google and Big Tech Companies Adopted This Standard

苹果，谷歌和大型科技公司采用了该标准

Apple collects data from the computers to research and improve the user experience. They want to know what new words are trending, which emoji are chosen the most, or which websites can affect the computer battery life. These marketing or system insights can help Apple to make the most relevant recommendations. But the challenge is that the data is specific to each user. Does Apple need to know such specific data? No. Apple announced in 2016 that after iOS10 they are using Differential Privacy technology to discover the usage patterns without compromising individual privacy. Apple explains in this report that before data is sent to Apple server, the differential privacy algorithm adds random noises to the original data. The slightly-biased noise make individual users unidentifiable. But the noises will average out over a large numbers of data points so Apple still can get valuable insights.

Apple从计算机收集数据以研究和改善用户体验。他们想知道流行的是哪些新单词，哪些表情符号使用最多，或者哪些网站会影响计算机的电池寿命。这些营销或系统洞察力可以帮助Apple提出最相关的建议。但是挑战在于数据是特定于每个用户的。苹果需要知道这些具体数据吗？苹果公司在2016年宣布，在iOS10之后，他们将使用差异隐私技术来发现使用模式，而不会损害个人隐私。 Apple在此报告中解释说，在将数据发送到Apple服务器之前，差分隐私算法会向原始数据添加随机噪声。轻微偏斜的噪音使个人用户无法识别。但是，噪音会在大量数据点上平均，因此苹果仍然可以获得有价值的见解。

Google adopts the mechanism in the article “RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response”. The RAPPOR mechanism provides a guarantee for individual data when collecting statistics from end-user or client-side software. The RAPPOR builds on the ideas of randomized response, which is a a surveying technique developed in the 1960s.

Google采用了“ RAPPOR：随机汇总可保护隐私的顺序响应 ”一文中的机制。当从最终用户或客户端软件收集统计信息时，RAPPOR机制为单个数据提供了保证。 RAPPOR建立在1960年代发展起来的一种测量技术- 随机响应的思想上。

The Randomized Response technique was first proposed by S. L. Warner in 1965 in the paper “Randomized response: a survey technique for eliminating evasive answer bias” to allow respondents to answer sensitive questions while maintain confidentiality. For example, a social scientist wants to know the percentage of people who have ever evaded taxes. To get the outcome statistics, the social scientist can survey 1,000 people randomly and count how many answer “Yes”. But many respondents will not feel comfortable to answer this sensitive question. (They may concern the data be shared with the IRS.) What can the researcher do? The researcher can design the interview process like this:

在随机应答技术是在论文“于1965年首先提出SL华纳：消除回避回答偏差的调查技术，随机响应，让被调查者回答敏感问题，同时保持机密”。例如，一位社会科学家想知道曾经逃税的人所占的百分比。为了获得结果统计数据，社会科学家可以随机调查1,000人，并计算有多少人回答“是”。但是，许多受访者都不愿意回答这个敏感问题。 (他们可能担心与IRS共享数据。)研究人员可以做什么？研究人员可以这样设计采访过程：

Each respondent rolls the dice without letting the interviewer know the outcome. If it is a 6, answer question (A), otherwise answering the inverse question (B):

每个受访者掷骰子时都不会让采访者知道结果。如果是6，则回答问题(A)，否则回答反问题(B)：

(A) I have evaded taxes,

(A)我偷税了，

(B) I have never evaded taxes.

(B)我从来没有逃税。

The basic assumptions are: (1) the randomization distribution is known to the researchers (in this case it is 1/6), and (2) respondents comply with the instructions and answer the sensitive question truthfully (this can be difficult).

基本假设为：(1)研究人员已知随机分布(在这种情况下为1/6)，并且(2)被调查者遵守说明并如实回答敏感问题(这可能很困难)。

Let T be the truth percentage that the researcher wants to know, p the probability (=1/6 here), and S be the surveyed percentage known to the researcher. The surveyed percentage S can be expressed by Eq. (1). After arranging the terms, the true percentage T can be obtained by Eq. (2).

设T为研究人员想知道的真实百分比，p为概率(此处为1/6)，S为研究人员已知的被调查百分比。被调查的百分比S可以由等式表示。 (1)。排列完这些项后，可以通过等式获得真实百分比T。 (2)。

Suppose the researcher gets S = 40%. He can derive the true percentage T = (40% + 1/6 -1 ) / (2 * 1/6 -1 ) = 65%.

假设研究人员得到S = 40％。他可以得出真实百分比T =(40％+ 1/6 -1)/(2 * 1/6 -1)= 65％。

There can be variations of randomized response designs, for other examples, see the article “Design and Analysis of the Randomized Response Technique” (2015) by Graeme Blair, Kosuke Imai, and Yang-Yang Zhou.

随机响应设计可能有所不同，有关其他示例，请参阅Graeme Blair，今井浩介和周扬阳的文章“ 随机响应技术的设计和分析 ”(2015年)。

RAPPOR to Protect Longitudinal Privacy

RAPPOR保护纵向隐私

However, if the above one-time randomized response is done on the same person repeatedly, the sequence of data of an individual still can be used to reveal his identity. Apple or Google collects data from individual computer periodically. Even if each one-time data is randomized, the sequence of the data, called by the authors of RAPPOR as the Longitudinal Privacy, still can be attacked to identify individuals. This vulnerability is similar to any sequence of data described in “You Can Be Identified by Your Netflix Watching History”, like the history of your Netflix visits, or shopping, or medical visits.

但是，如果对同一个人重复执行上述一次性随机响应，则仍然可以使用该个人的数据序列来揭示其身份。 Apple或Google会定期从单个计算机收集数据。即使每个一次性数据都是随机的，RAPPOR的作者称之为“ 纵向隐私 ”的数据序列仍然可以被攻击以识别个人。此漏洞类似于“ 您可以通过Netflix的观看记录识别 ”中描述的任何数据序列，例如Netflix的访问历史，购物或医疗访问记录。

How does RAPPOR prevent this?

RAPPOR如何防止这种情况发生？

In order to protect longitudinal privacy, the authors of RAPPOR proposed a mechanism called the Permanent randomized response. This mechanism replaces the one-time randomized response with a randomized noisy value permanently, i.e, all future data will use the same derived randomized value. The derived noise can be ‘1’, ‘0’, or the original one-time randomized response value with certainly probabilities respectively (summing up to 1). This mechanism can ensure privacy because the attackers can not differentiate between true and “noisy” data in the longitudinal data. The authors of RAPPOR gave an extreme case. Suppose each individual have to report their age (in days) everyday. After a period of time, some individual’s data will stop being collected, or mixed with an increased noises permanently. This makes the longitudinal data of the individual unidentifiable by attackers.

为了保护纵向隐私，RAPPOR的作者提出了一种称为永久随机响应的机制。这种机制将一次性随机响应永久替换为随机噪声值，即所有未来数据将使用相同的派生随机值。导出的噪声可以是“ 1”，“ 0”或原始的一次性随机响应值，且分别具有确定的概率(总计1)。该机制可以确保隐私，因为攻击者无法在纵向数据中区分真实数据和“嘈杂”数据。 RAPPOR的作者给出了一个极端的案例。假设每个人都必须每天报告其年龄(以天为单位)。一段时间后，某些个人数据将停止收集，或永久与增加的噪音混合。这使得攻击者无法识别个人的纵向数据。

Google Github for RAPPOR

适用于RAPPOR的Google Github

Readers who are interested in the simulation code in Python and R can check this github to see how the RAPPOR produces inferring statistics about populations while preserving the privacy of individual users.

对Python和R中的模拟代码感兴趣的读者可以查看此github，以了解RAPPOR如何在保持个人用户隐私的同时生成有关人口的推断统计信息。