加密数据机器学习_机器学习加密数据不再是幻想

最新推荐文章于 2023-12-27 18:07:33 发布

weixin_26636643

最新推荐文章于 2023-12-27 18:07:33 发布

阅读量771

点赞数

文章标签：机器学习 python 人工智能 java 大数据

原文链接：https://medium.com/intuit-engineering/machine-learning-on-encrypted-data-no-longer-a-fantasy-58e37e9f31d7

版权

本文探讨了在加密数据上进行机器学习的可能性，打破了传统观念，使得敏感信息在保持隐私的同时也能进行有效的分析。通过介绍相关技术和案例，展示了如何在不暴露原始数据的情况下实现机器学习模型的训练和预测。

摘要由CSDN通过智能技术生成

加密数据机器学习

At Intuit, the proud maker of TurboTax, QuickBooks, and Mint, we’re the trusted stewards of our customers’ data, which is a responsibility we take seriously. As part of this commitment, we innovate in our security practices — just as we innovate in our product development — which includes exploring advanced privacy technologies. As someone who cares about privacy and encryption, this makes Intuit truly exciting for me.

Intuit是TurboTax，QuickBooks和Mint的骄傲制造商，我们是客户数据值得信赖的管理者，这是我们认真承担的责任。作为这项承诺的一部分，我们在安全实践方面进行了创新(就像在产品开发方面进行了创新一样)，其中包括探索先进的隐私技术。作为关心隐私和加密的人，这使Intuit真正令我兴奋。

Lately, my team has been researching “operations on encrypted data.” Through this research, we aim to provide a secure environment for data scientists.

最近，我的团队一直在研究“对加密数据的操作”。通过这项研究，我们旨在为数据科学家提供一个安全的环境。

Now, what exactly do we mean by operations on encrypted data? People tend to think of data encryption as a black box: on the one side enter unencrypted plaintext data and a secret key, and out the other side comes gibberish, also known as ciphertext. A second black box, called decryption, reverses the process and reveals the plaintext again. It’s important to note that if you change the ciphertext — even slightly, by flipping a single bit — your plaintext will be corrupted and come out completely random.

现在，对加密数据进行操作到底意味着什么？人们倾向于将数据加密视为一个黑匣子：一方面输入未加密的明文数据和密钥，另一方面输入乱码，也称为密文。第二个黑匣子(称为解密)使该过程反向，并再次显示明文。重要的是要注意，如果您更改密文(甚至通过一点一点地改变密文)，您的明文将被破坏，并且完全是随机出现的。

This view of encryption is still generally true, at least for the most common block ciphers like AES (Advanced Encryption Standard), but in the last few years an important exception has emerged in the case of homomorphic encryption. Homomorphic encryption is, first of all, encryption. As with AES, it can be represented this way:

至少对于像AES(高级加密标准)这样最常见的块密码，这种加密观点通常仍然是正确的，但是在最近几年中，同态加密的情况出现了一个重要的例外。同态加密首先是加密。与AES一样，它可以这样表示：

ciphertext = Enc(key, plaintext) and

密文= Enc(密钥，明文) 和

plaintext = Dec(key, ciphertext)

明文= Dec(密钥，密文)

Homomorphic encryption also has more interesting properties. One of them is this: you can change the ciphertext and get back useful plaintext, and specifically, you can actually compute basic arithmetic on encrypted values. To simplify a bit:

同态加密还具有更有趣的属性。其中之一是：您可以更改密文并获取有用的明文，特别是，您实际上可以根据加密值计算基本算术。为了简化一点：

Enc(a + b) = Enc(a) + Enc(b) and

Enc(a + b)= Enc(a)+ Enc(b) 和

Enc(a * b) = Enc(a) * Enc(b)

Enc(a * b)= Enc(a)* Enc(b)

In other words, while performing any operations on data encrypted using a traditional cipher would result in gibberish, homomorphic encryption allows you to do it without corrupting the data. This goes further than basic operations. Being able to perform addition and multiplication also means that you can compute polynomials. And, with polynomials you can approximate essentially any function. We’ll discuss what this means in a minute.

换句话说，虽然对使用传统密码加密的数据执行任何操作都会导致乱码，但同态加密使您能够做到这一点而不会破坏数据。这远远超出了基本操作。能够执行加法和乘法也意味着您可以计算多项式。而且，利用多项式，您几乎可以近似任何函数。我们将在稍后讨论这意味着什么。

Encryption and big data

加密和大数据

Homomorphic encryption is fascinating for a cryptography buff like me — but is it at all useful? For a long time, people have suggested that a version called fully homomorphic encryption, or FHE, might offer a way for cloud providers to run computations on their tenants’ data, without having cleartext access to this data. That might improve security and privacy, but I personally doubt this will ever become viable. The performance degradation associated with this technique makes it unsuitable for general computing, as would be expected from a cloud provider. On the other hand, FHE does show considerable promise as a way for enterprises to significantly harden their resistance to cyber attacks in specific use cases. To see why, let’s look at today’s artificial intelligence (AI) practices.

同态加密对于像我这样的密码迷来说非常着迷-但这真的有用吗？长期以来，人们一直在建议一种称为完全同态加密(FHE)的版本，它可能为云提供商提供一种在其租户的数据上运行计算而无需以明文方式访问该数据的方法。这可能会提高安全性和隐私性，但是我个人怀疑这是否可行。正如云提供商所期望的那样，与该技术相关的性能下降使其不适用于通用计算。另一方面，FHE确实显示出可观的前景，作为企业在特定用例中显着增强其对网络攻击的抵抗力的一种方式。为了了解原因，让我们看一下当今的人工智能(AI)实践。

Some large enterprises concentrate large amounts of data into a so-called data lake. In the middle of this lake sits a cloud application server which runs a large number of AI applications called machine learning (ML) models. A centralized architecture often leads to a large number of people having partial or full access to an enterprise’s data lake.

一些大型企业将大量数据集中到所谓的数据湖中。在这湖的中间是一个云应用服务器，该服务器运行着大量称为机器学习(ML)模型的AI应用。集中式体系结构通常导致大量人员可以部分或完全访问企业的数据湖。

My team’s goal is to allow as many ML models as possible to run using only encrypted data, so that in the event of an attempted attack anywhere in the data lake, the attacker wouldn’t be able to access sensitive data. How? Remember that homomorphic encryption allows you to perform mathematical functions on encrypted data. In this way, sensitive data is stored in encrypted form, and ML models are re-implemented using homomorphic operations.

我团队的目标是仅使用加密数据就可以运行尽可能多的ML模型，这样，如果在数据湖中的任何地方尝试进行攻击，攻击者将无法访问敏感数据。怎么样？请记住，同态加密允许您对加密的数据执行数学功能。通过这种方式，敏感数据以加密形式存储，并且使用同态运算来重新实现ML模型。

Here’s what such a solution would look like:

这样的解决方案如下所示：

(As a side note: homomorphic encryption is not a panacea, and it should be deployed along with more conventional security solutions such as fine-grained access control.)

(附带说明：同态加密不是万能药，它应与更常规的安全解决方案(例如细粒度的访问控制)一起部署。)

Working at the leading edge of cryptography

在加密技术的前沿工作

Neural networks currently represent the most advanced types of ML models, and in fact there has been some academic work done on homomorphic computation in this context. For now, the workhorse of many ML applications across the industry and at Intuit remains decision trees and their cousins, random forests and boosted trees. Recently, my team collaborated with researchers from Haifa University on a paper showing that decision trees can be evaluated homomorphically in real time for realistically-sized data sets. The paper also shows that such models can be trained in practical time, again for real-life data sets. We do it by approximating the threshold function:

神经网络目前代表着最先进的ML模型类型，事实上，在这种情况下，已经有一些关于同态计算的学术研究。到目前为止，整个行业和Intuit的许多ML应用程序的主力军仍然是决策树及其表亲，随机森林和人工林。最近，我的团队与海法大学的研究人员合作发表了一篇论文，该论文表明可以对同等大小的数据集进行实时同态评估。本文还显示，可以再次针对实际数据集在实际时间内训练此类模型。我们通过近似阈值函数来实现：

y is 1 if x>T, otherwise y is 0

如果x> T则y为1，否则y为0

The function is approximated by a polynomial which in turn can be computed homomorphically. This becomes a building block in the homomorphic evaluation of the decision tree, as the tree is a sequence of conditional (“if”) statements. In this way, the paper shows that homomorphic encryption can be a practical method for safeguarding data in real-world scenarios.

该函数由多项式近似，而多项式又可以同态计算。由于决策树是条件(“ if”)语句的序列，因此这成为决策树同态评估的基础。通过这种方式，本文表明，同态加密可以成为在实际场景中保护数据的实用方法。

Which kinds of scenarios, exactly? In the public cloud use case, we have to assume very rigid separation of cryptographic keys: the cloud provider never gets access to the encryption keys for the data it holds. That’s one of the things that makes me skeptical about the viability of homomorphic encryption for this type of application.

到底是哪种情况？在公共云用例中，我们必须假定加密密钥非常严格地分开：云提供商永远无法访问其拥有的数据的加密密钥。这就是让我怀疑此类应用程序同态加密的可行性的事情之一。

The enterprise use case, on the other hand, allows more architectural flexibility, which we can use to speed up homomorphic computations. To do this, the proposed Intuit architecture includes a dedicated server called “the oracle” (no relation to Oracle the company): a stateless server that has access to the encryption key. Security folks can think of it as analogous to a hardware security module (HSM). Of course, we do not run the actual ML models on the oracle; this would get us back to the original architecture with the oracle serving as the application server, which in turn would pose serious security risks. Instead, we use the oracle only for specific calculations. The oracle is capable of a very limited set of operations, and only has access to aggregates of the data, rather than to individual data points. Despite these limitations and the overhead of calling a remote server, the oracle provides homomorphic operations with a significant speed-up.

另一方面，企业用例允许更大的体系结构灵活性，我们可以使用它来加快同态计算的速度。为此，建议的Intuit体系结构包括一个称为“ oracle”(与Oracle公司无关)的专用服务器：可以访问加密密钥的无状态服务器。安全人员可以将其视为类似于硬件安全模块(HSM)。当然，我们不会在oracle上运行实际的ML模型。这将使我们回到使用oracle作为应用程序服务器的原始体系结构，这又会带来严重的安全风险。相反，我们仅将oracle用于特定的计算。 oracle只能执行非常有限的一组操作，并且只能访问数据集合，而不能访问单个数据点。尽管存在这些限制和调用远程服务器的开销，但oracle提供了同构操作，并显着提高了速度。

Architecturally, the oracle is a separate component that sits alongside the ML server:

从结构上讲，oracle是位于ML服务器旁边的独立组件：

A similar approach results in another major benefit compared to other fully homomorphic storage solutions: it prevents what’s known as ciphertext blow up. With standard FHE techniques, the size in bytes of the ciphertext is about 10,000 times the size of the original plaintext (for comparison, standard symmetric data encryption only adds a few bytes to the plaintext size). This is clearly unacceptable for bulk storage of big data of the type we usually see in AI applications. The oracle approach keeps the size of the stored data manageable and realistic for real-time operations. We recently published a note describing how the combination of oracle and blinding makes it possible to store bulk data with standard symmetric encryption and re-encrypt data on the fly into FHE with good security properties and practical performance. In particular, in this solution absolutely no information is leaked to the oracle, not even aggregate values.

与其他完全同态的存储解决方案相比，类似的方法还可以带来另一个主要好处：它可以防止所谓的密文爆炸。使用标准的FHE技术，密文的字节大小约为原始明文大小的10,000倍(为了进行比较，标准对称数据加密仅将几个字节添加到明文大小中)。对于我们通常在AI应用程序中看到的那种大数据的大容量存储，这显然是不可接受的。 oracle方法使存储数据的大小可管理，并且对于实时操作而言是现实的。我们最近发布了一份说明，描述了oracle和盲法的组合如何使使用标准对称加密的大容量数据存储以及如何将数据即时重新加密为具有良好安全性和实用性能的FHE成为可能。特别是，在此解决方案中，绝对没有信息泄漏到oracle，甚至没有泄漏到合计值。

We’re excited about the results of our research to date, but we’re not done yet. For example, the existing techniques still require a lot of custom work for each ML model to be adapted to the FHE environment. Streamlining this process will go a long way to enhance the practicality of this approach in widespread use.

迄今为止，我们对研究结果感到兴奋，但尚未完成。例如，现有技术仍然需要大量定制工作才能使每个ML模型适应FHE环境。简化此过程将大大提高该方法在广泛使用中的实用性。

The technological challenges around homomorphic encryption — and the opportunities it presents — are too big for any one company to take on. The tremendous progress made since Craig Gentry published the initial FHE schema in 2009 has been made possible by active collaboration across organizations. Nowadays, much of this work is centered around the homomorphicencryption.org, an industry consortium that Intuit is active in.

围绕同态加密的技术挑战及其带来的机遇，对于任何一家公司来说都太大了。自组织克雷格·金特里(Craig Gentry)在2009年发布最初的FHE模式以来，所取得的巨大进步已通过组织之间的积极协作得以实现。如今，许多工作集中在Intuit活跃的行业联盟homomorphicencryption.org上。

My team is incorporating recent cryptographic research into Intuit’s multi-tiered security strategy to resolve real-life security and privacy problems. If you’d like to become part of these efforts, please reach out. We welcome your contributions!

我的团队正在将最新的加密研究纳入Intuit的多层安全策略中，以解决现实生活中的安全和隐私问题。如果您想成为这些努力的一部分，请伸手。我们欢迎您的贡献！