机器学习泄漏私人信息时

最新推荐文章于 2022-10-25 11:56:38 发布

weixin_26729375

最新推荐文章于 2022-10-25 11:56:38 发布

阅读量389

点赞数

文章标签：机器学习 python 人工智能

原文链接：https://medium.com/ncript/when-machine-learning-leaks-private-information-284d584303b1

版权

We’re obsessed with the data that goes into machine learning models, but what about the data that comes out of them?

我们沉迷于机器学习模型中的数据，但是从中得出的数据又如何呢？

Scientia potentia est. Knowledge is power. For computers involved in machine learning, knowledge assumes the form of data. The more there is, the better trained they get at recognising patterns and the more accurate they become at making predictions — whether it’s recommending a new song on Spotify, identifying a malignant lump, or helping autonomous cars drive.

科学潜力。知识就是力量。对于参与机器学习的计算机，知识采用数据形式。数量越多，他们越能识别模式，就越能准确地进行预测-无论是在Spotify上推荐一首新歌，识别恶性肿块，还是帮助自动驾驶汽车。

Much research has been, and continues to be, focused on the data that’s fed into such machine learning models. But increasingly, computer scientists are turning to look at the data that comes out of them, data that isn’t supposed to.

过去并将继续进行许多研究，以研究输入到这种机器学习模型中的数据。但是，越来越多的计算机科学家开始研究来自其中的数据，而这些数据本来是不应该的。

“One of the fundamental problems in machine learning is information leakage,” says Reza Shokri, the NUS Presidential Young Professor at the NUS School of Computing and a researcher at the university’s Centre for Research in Privacy Technologies (N-CRiPT). A model is supposed to churn out new insights, yes, but it isn’t supposed to reveal the individual data used to train them. And yet, this can happen due to vulnerabilities in the models, says Prof Shokri who led designing inference attacks to expose such weaknesses.

“其中一个机器学习的根本问题是信息泄露，说：”礼Shokri ，新加坡国立大学总统青年教授计算的新加坡国立大学学院和大学的研究人员研究中心保密技术(N-CRIPT)。是的，应该建立一个模型来产生新的见解，但是它不能揭示用于训练它们的单个数据。但是，这可能是由于模型中的漏洞而发生的，Shokri教授说，他领导设计了推理攻击以揭露此类弱点。

“One of the fundamental problems in machine learning is information leakage.”

“机器学习的基本问题之一是信息泄漏。”

Through membership inference attacks, a malicious agent can access the data a machine learning model uses for its training, and figure out if a particular person’s data was part of that training set. This can be used as a tool to quantify the privacy risks of machine learning algorithms, to help comply with privacy regulations such as the European Union’s General Data Protection Regulation (GDPR). To this end, Prof Shokri and his team recently released an open source tool, named ML Privacy Meter, to enable practitioners to test their machine learning models and evaluate their privacy risk on training data.

通过成员身份推断攻击，恶意代理可以访问机器学习模型用于其训练的数据，并确定某个人的数据是否属于该训练集。这可以用作量化机器学习算法的隐私风险的工具，以帮助遵守隐私法规，例如欧盟的《通用数据保护法规》( GDPR )。为此，Shokri教授及其团队最近发布了一个名为ML Privacy Meter的开源工具，以使从业人员能够测试他们的机器学习模型并评估他们在训练数据上的隐私风险。

Another type of attack that is more advanced is called a reconstruction attack, which is aimed at recovering information in the dataset. “The consequence of a model leaking information is that the attacker can infer sensitive attributes of the individuals,” says Prof Shokri. This includes details such as a person’s national identification number, location, credit rating, if they have a certain disease, etc.

另一种更高级的攻击称为重建攻击，它旨在恢复数据集中的信息。 Shokri教授说：“模型泄漏信息的结果是，攻击者可以推断出个人的敏感属性。” 这包括细节，例如一个人的身份证号码，位置，信用等级，是否患有某种疾病等。

Making matters worse, information leakage can occur across different types of machine learning. One such system that is gaining increasing popularity is collaborative or federated machine learning, where multiple parties share some aspects of the data they have in order to create a global model.

更糟糕的是，信息泄漏可能会在不同类型的机器学习中发生。这样一种越来越受欢迎的系统是协作或联合机器学习，其中多方共享他们拥有的数据的某些方面以创建全局模型。

Remember how knowledge is power? The more data points there are, the more accurate the machine learning model becomes. And so hospitals pool together the information they have on different patient populations to better diagnose future cases, banks collaborate to recognise fraudulent transactions, and so on.

还记得知识是力量吗？数据点越多，机器学习模型就越准确。因此，医院将他们在不同患者人群中拥有的信息汇总在一起，以更好地诊断将来的病例，银行合作以识别欺诈性交易，等等。

“But we have shown that membership inference attacks work very well in this federated machine learning setting,” says Prof Shokri. “The reason is because this exchange of information happens repeatedly.” Local models get updated constantly and these updates are sent to the server which then aggregates all other models. In the process, information leaks out. “You want to make your model more accurate, but then you leak more about your training,” he says.

Shokri教授说：“但是，我们已经证明，隶属推理攻击在这种联合机器学习环境中非常有效。” “原因是因为这种信息交换反复发生。” 本地模型会不断更新，并将这些更新发送到服务器，然后服务器将汇总所有其他模型。在此过程中，信息会泄漏出去。他说：“您想使模型更准确，但随后您会泄漏出更多关于培训的信息。”

This happens with other types of machine learning as well, for example when algorithms are used to assist in decision-making processes such as determining if an applicant should receive a bank loan or if a patient has cancer. This is called explainable or interpretable machine learning, because in these instances you want the model to be able to justify how it arrived at the decision it has made. But this means making models more transparent by releasing additional information about its constituent data and underlying algorithm, which amounts to a privacy risk.

这在其他类型的机器学习中也会发生，例如，当使用算法来辅助决策过程时，例如确定申请人是否应该获得银行贷款或患者是否患有癌症。这就是所谓的可解释或可解释的机器学习，因为在这些情况下，您希望模型能够证明它是如何到达它已经做出了决定。但这意味着通过发布有关其组成数据和基础算法的附加信息使模型更加透明，这构成了隐私风险。

A third type of machine learning that is vulnerable to privacy attacks and information leaks is robust machine learning against adversarial examples. Models such as these are used to by email programs to recognise spam, search engines and social media platforms to detect harmful content, driverless cars to correctly identify road signs, among other things. But these models can be fooled, says Prof Shokri, by altering a few words in the email, tweaking the image ever so slightly, or covering up parts of a sign.

容易受到隐私攻击和信息泄露的第三种机器学习是针对对抗性示例的强大机器学习。电子邮件程序使用这种模型来识别垃圾邮件，使用搜索引擎和社交媒体平台来检测有害内容，使用无人驾驶汽车来正确识别道路标志等。肖克里教授说，但是通过改变电子邮件中的一些单词，对图像进行微调或掩盖标志的一部分，这些模型可以被愚弄。

Similar to explainable machine learning, these models can be made more trustworthy by revealing parts of their inner workings — at the expense of privacy.

与可解释的机器学习类似，通过揭示其内部工作的一部分，可以使这些模型更值得信赖-从而牺牲了隐私。

“We have designed attacks that can extract sensitive information from the training data of robust models,” says Prof Shokri. “And the solutions to prevent these information leaks are not good enough yet.” But he, and other researchers at N-CRiPT, are working on ways that offer a good balance between the need for greater transparency in machine learning models with the need to protect the data used to create them. These methods will be covered in an upcoming post — so stay tuned!

Shokri教授说：“我们设计的攻击可以从健壮模型的训练数据中提取敏感信息。” “防止这些信息泄漏的解决方案还不够好。” 但是他和N-CRiPT的其他研究人员正在研究在机器学习模型的更高透明度与保护用于创建它们的数据的需求之间取得良好平衡的方法。这些方法将在以后的文章中介绍-请继续关注！