综合数据的许多用例

最新推荐文章于 2022-10-07 16:11:33 发布

weixin_26720761

最新推荐文章于 2022-10-07 16:11:33 发布

阅读量201

点赞数

文章标签： python java 人工智能机器学习算法

原文链接：https://towardsdatascience.com/the-many-use-cases-for-synthetic-data-60e0b0193afe

版权

朦胧的数据科学 (Data Science by Hazy)

A 2016 study found that, after just 15 minutes of monitoring driver braking patterns, researchers were able to identify that driver with an accuracy of 87 percent. Turns out, the way that you press the brake pedal is almost entirely unique to you.

2016年的一项研究发现，在仅监测驾驶员制动模式15分钟之后，研究人员就能够以87％的准确度识别驾驶员。事实证明，您踩刹车踏板的方式几乎完全是您独有的。

This sensitivity of data extends into every aspect of our lives. That fancy hipster coffee that you buy at your favourite cafe also leaves a data trail of behaviour. And companies are chomping at the bit to get hold of this data so that they can formulate new business strategies that aim to attract your business. This is why privacy protection laws, like Europe’s GDPR are rapidly changing the data landscape, by prioritising consumer protection, giving you the right to be forgotten, and controlling who has the right to own and access your data.

数据的敏感性延伸到我们生活的方方面面。您在自己喜欢的咖啡馆购买的时髦咖啡也留下了行为数据记录。而且公司正在争先恐后地获取这些数据，以便他们制定旨在吸引您的业务的新业务战略。这就是为什么像欧洲GDPR这样的隐私保护法通过优先考虑消费者保护，赋予您被遗忘的权利以及控制谁有权拥有和访问您的数据而Swift改变数据格局的原因。

This is where the magic of synthetic data comes in. Synthetic data is generated using machine learning algorithms that ingest the real data, train on the patterns of behaviour, and then expel entirely artificial data that retains the statistical characteristics of the original dataset. This should be distinguished from the more traditional anonymised datasets that are actually quite vulnerable to re-identification techniques. Since synthetic data is inherently artificial, this vulnerability does not apply.

这就是合成数据不可思议的地方。合成数据是使用机器学习算法生成的，该算法吸收真实数据，训练行为模式，然后排除保留原始数据集统计特征的完全人工数据。这应该与实际上很容易受到重新识别技术影响的更传统的匿名数据集区分开。由于合成数据本质上是人工的，因此该漏洞不适用。

Due to the privacy-preserving nature of synthetic data, it is not governed by the same data protection laws. Machine learning engineers and data scientists can confidently use this synthetic data for their analyses and modelling, knowing that it will behave in the same manner as the real data. This simultaneously protects customer privacy and mitigates risk for the companies that leverage it — all while unblocking data that is otherwise frozen behind compliance barriers… often for many months or even years.

由于合成数据具有保护隐私的性质，因此不受同一数据保护法律的约束。机器学习工程师和数据科学家可以放心地使用此合成数据进行分析和建模，因为它会以与真实数据相同的方式运行。这同时保护了客户的隐私并为利用它的公司减轻了风险-同时还释放了冻结在合规性屏障后面的数据，通常长达数月甚至数年。

Since the end of June, I have been a data science intern at Hazy synthetic data. The Hazy team has built a sophisticated synthetic data generator and enterprise platform that helps customers unlock their data’s full potential, increasing the speed at which they are able to innovate, while minimising risk exposure.

自6月底以来，我一直担任Hazy综合数据的数据科学实习生。朦胧的团队已经构建了一个复杂的综合数据生成器和企业平台，可帮助客户释放其数据的全部潜力，提高他们创新的速度，同时最大程度地降低风险敞口。

综合数据用例。 (Synthetic data use cases.)

Now that you’ve been introduced to synthetic data and the high-level problems that it can help solve, let’s get into some more detailed synthetic data use cases.

既然已经向您介绍了综合数据及其可以解决的高级问题，那么让我们进入一些更详细的综合数据用例。

供应商评估。 (Vendor evaluations.)

Picture this. You work at an organisation that is looking to outsource some work, like app development, testing, data science, and analytics and business intelligence. Like with any big purchase, you want to test drive before you buy. Often this means handing real — and highly sensitive — data to third parties, which is not only a security risk, but can also take as many as six to 18 months to jump over legal and procurement hurdles. This is a lot of hassle considering that it’s all just to determine whether or not you want to partner with this vendor.

想象一下。您在一个组织中工作，该组织希望将某些工作外包，例如应用程序开发，测试，数据科学以及分析和商业智能。像购买任何大型商品一样，您要在购买前先试驾。通常，这意味着将真实且高度敏感的数据交给第三方，这不仅存在安全风险，而且还可能需要长达6到18个月的时间才能克服法律和采购方面的障碍。考虑到所有这些只是为了确定您是否要与该供应商合作，这很麻烦。

As the data is no longer sensitive, using synthetic data eliminates the lag in this process. AI-generated synthetic data can be representative enough that, if you choose to work with that vendor, you could just eliminate the risk of any security compromises down the line, by continuing to build on only artificial data.

由于数据不再敏感，因此使用合成数据可以消除此过程中的滞后。 AI生成的合成数据具有足够的代表性，因此，如果您选择与该供应商合作，则仅继续构建人工数据，就可以消除任何安全隐患的风险。

与第三方服务共享数据。 (Sharing data with third-party services.)

In a similar vein to vendor evaluation, using third-party services such as online applications or cloud compute resources would require handing over sensitive data to that service. The same goes for sharing data with third parties for better or at least external analytics. Due to hardware limitations, a business may not be able to keep all of its data on-premise, and therefore it needs to use an online storage platform or faster cloud providers. However, compliance rules dictate this data must remain on-premise. Along with the usual headache of compliance, this can (and should) be a significant worry for companies as a security breach can leave both your customers and your reputation vulnerable. With synthetic data it’s all Hakuna Matata.

与供应商评估类似，使用第三方服务(例如在线应用程序或云计算资源)将需要将敏感数据移交给该服务。与第三方共享数据以改善或至少进行外部分析也是如此。由于硬件限制，企业可能无法将其所有数据保留在本地，因此需要使用在线存储平台或更快的云提供商。但是，合规性规则指示此数据必须保留在内部。除了通常令人头痛的合规性外，这还可能(并且应该)使公司倍感担忧，因为安全漏洞可能使您的客户和声誉都受到威胁。有了综合数据，一切都是Hakuna Matata。

数据货币化。 (Data monetisation.)

Many business models these days are entirely based around monetising the data that they collect from their user base. If you’re not paying for the product then it’s more than likely that this is the case. Companies can collect data, conduct analyses, and sell any of the insights on to external businesses that have a vested interest. Some organisations sell the raw data so that the external companies can conduct their own nuanced analyses, but this comes with many more regulatory compliance issues, and often the data is deemed too sensitive to do so.

如今，许多业务模型完全基于将他们从用户群收集的数据货币化。如果您不为产品付费，那么这种情况很有可能。公司可以收集数据，进行分析，并将任何见解出售给既得利益的外部企业。一些组织出售原始数据，以便外部公司可以进行自己的细微差别的分析，但这会带来更多的法规遵从性问题，并且通常认为数据过于敏感而无法这样做。

With synthetic data, compliance and risk are no longer issues — subsequently the value of that data and the speed at which value can be generated from it are drastically increased. Companies may even be able to generate entirely new revenue streams. After-all, the value of most data isn’t the personal information, but the insights gained from it. Plus, synthetic data is more flexible than real data, as it can be infinitely automated, amplified and enriched, opening up even more monetisation opportunities.

使用综合数据，合规性和风险不再成为问题-随后，该数据的价值以及从中生成价值的速度将大大提高。公司甚至可以产生全新的收入来源。毕竟，大多数数据的价值不是个人信息，而是从中获得的见解。另外，合成数据比真实数据更灵活，因为它可以无限地自动化，放大和丰富，从而开辟了更多的获利机会。

跨组织的数据可移植性。 (Cross-organisational data portability.)

Restrictions on the transfer of data are not only limited to that of dealings with external companies. Within one organisation, there can be many compliance criteria that must be met before data can be passed between departments, and this can often take weeks. Even longer if it involves sharing across geographical boundaries and regulations.

数据传输的限制不仅限于与外部公司的交易。在一个组织内，可能需要满足许多合规标准，才能在部门之间传递数据，这通常需要数周的时间。如果涉及跨地理边界和法规的共享，则需要更长的时间。

Being able to create a safe and synthetic dataset means that organisations can have centralised data repositories — often called data pools — that can be managed by simple role-based access control. For example, banks have a particular wealth of data in their customers’ transaction histories. By pooling synthetic twins of this data, it can be safely shared among data scientists from multiple departments and across borders.

能够创建安全且综合的数据集意味着组织可以拥有集中的数据存储库(通常称为数据池)，可以通过基于角色的简单访问控制来对其进行管理。例如，银行在其客户的交易历史中拥有特别丰富的数据。通过合并这些数据的合成双胞胎，可以在多个部门和跨境的数据科学家之间安全地共享该数据。

This unprecedented level of collaboration can be used for training on much larger datasets that unearth more patterns for better money laundering and fraud detection algorithms. With the freedom to share information internally, enterprises can innovate and act on new data much faster — from personalised marketing or international crime. This gives businesses a significant edge over competitors that have more traditional data lifecycles and artificial barriers to innovation.

这种前所未有的协作水平可用于在更大的数据集上进行训练，这些数据集可挖掘更多的模式，以实现更好的洗钱和欺诈检测算法。借助内部共享信息的自由，企业可以更快地进行创新，并根据个性化营销或国际犯罪对新数据采取行动。与具有更多传统数据生命周期和人为创新障碍的竞争对手相比，这使企业具有了显着优势。

数据保留。 (Data retention.)

Regulations are also in place that limit the amount of time a company is able to keep a hold of personal data, making it very difficult to conduct longer term analyses, such as when trying to detect seasonality over several years. Remember synthetic data is not dictated by the same privacy protection laws — while it retains the customer usage patterns, it’s utterly artificial. With no risk of re-identification, companies are able to hold onto their synthetic data for as long as they wish, and can come back to it any time in the future to conduct analyses that were not previously being carried out or even technologically feasible at the time of data collection.

还制定了法规，限制了公司保留个人数据的时间量，这使得进行长期分析(例如尝试检测几年的季节性变化)变得非常困难。请记住，合成数据不受相同的隐私保护法律约束-尽管它保留了客户的使用模式，但完全是人为的。无需重新识别的风险，公司便可以保留其合成数据的时间长短，并且可以在将来的任何时候返回该数据，以进行以前从未进行过，甚至在技术上不可行的分析。数据收集的时间。

模拟意外事件。 (Simulating unforeseen events.)

Preparation is usually better than a knee jerk reaction. More and more companies are looking to use data to prepare for unforeseen circumstances, never more so than now in these unprecedented times. This kind of preparedness is now possible thanks to conditional synthetic data generation. It’s possible to take a ‘normal’ or precedent dataset, add conditions to the generator, and output a synthetic dataset that is representative of events that have never occurred before, which allows you to analyse, model and subsequently prepare for such circumstances.

准备工作通常比膝跳React要好。越来越多的公司希望使用数据为不可预见的情况做准备，在前所未有的时代，这种情况从未像现在这样。由于有条件的综合数据生成，现在可以进行这种准备。可以获取“正常”或先例数据集，为生成器添加条件，并输出代表从未发生过的事件的综合数据集，这使您可以分析，建模并为此类情况做准备。

Conditional synthetic data use cases can range from predicting customer behaviour if there’s a second wave of this pandemic to the probability a type of cancer will metastasise to the effects of global heating. More generally, it could combine customer behaviour in one country with open public data sources to accurately predict how a product or service would perform in a completely new location.

有条件的综合数据用例的范围可以从预测客户行为(如果第二次大流行)到某种癌症会转移到全球热效应的可能性。更一般而言，它可以将一个国家/地区的客户行为与开放的公共数据源结合起来，以准确预测产品或服务在全新位置的效果。

不要落伍。 (Don’t be left behind.)

Ninety percent of the world’s data was created in the last two years, with 2.5 quintillion bytes of new data being captured every day. The data economy is already a highly regulated space, but with data’s current trajectory, it is likely to become even more so as governments and regulatory organisations rush to catch up with the unfathomable amount being collected.

世界数据的百分之九十是在过去两年中创建的，每天捕获2.5亿亿字节的新数据。数据经济已经是一个高度监管的空间，但是随着数据的当前轨迹，它可能会变得更加庞大，以致于政府和监管组织争先恐后地追赶收集到的不可思议的数量。

Businesses who utilise synthetic data will be one step ahead of the competition. It will increase the speed at which you are able to develop new products, create fresh partnerships with third parties, and even generate entirely new revenue streams, all while substantially reducing your risk vector.

利用综合数据的企业将比竞争对手领先一步。这将大大加快您开发新产品，与第三方建立新的合作伙伴关系，甚至产生全新的收入来源的速度，同时大大降低了您的风险载体。

翻译自: https://towardsdatascience.com/the-many-use-cases-for-synthetic-data-60e0b0193afe

weixin_26720761

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
综合数据的许多用例

朦胧的数据科学 (Data Science by Hazy)A 2016 study found that, after just 15 minutes of monitoring driver braking patterns, researchers were able to identify that driver with an accuracy of 87 percent. Turn...
复制链接

扫一扫