使用gretel AI和python在您的云中创建高质量的合成数据

最新推荐文章于 2024-10-18 00:00:00 发布

杨_明

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量345

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/create-high-quality-synthetic-data-in-your-cloud-with-gretel-ai-and-python-fff3c98addef

版权

Whether your concern is HIPAA for Healthcare, PCI for the financial industry, or GDPR or CCPA for protecting consumer data, being able to get started building without needing a data processing agreement (DPA) in place to work with SaaS services can significantly reduce the time it takes to start your project and start creating value. Today we will walk through an example using Gretel.ai in a local (your cloud, or on-premises) configuration to generate high quality synthetic models and datasets.

无论您关心的是用于医疗保健的HIPAA ，用于金融行业的PCI还是用于保护消费者数据的GDPR或CCPA ，无需使用数据处理协议(DPA)就可以开始构建以使用SaaS服务的工作，都可以大大减少时间它需要启动您的项目并开始创造价值。今天，我们将通过在本地(您的云或本地)配置中使用Gretel.ai生成高品质综合模型和数据集的示例。

设置您的本地环境 (Set up your local environment)

To get started you need just three things.

要开始使用，您只需要三件事。

Dataset to synthesize in CSV or Pandas Dataframe format
数据集以CSV或Pandas Dataframe格式合成
Gretel.ai API key (it’s free)
Gretel.ai API密钥(免费)
Local computer / VM / cloud instance
本地计算机/ VM /云实例

Recommended setup. We recommend the following hardware configuration: CPU: 8+ vCPU cores recommended for synthetic record generation. GPU: Nvidia Tesla P4 with CUDA 10.x support recommended for training. RAM: 8GB+. Operating system: Ubuntu 18.04 for GPU support, or Mac OS X (no GPU support with Macs).

推荐的设置。 我们建议使用以下硬件配置：CPU：建议使用8个以上vCPU内核来生成合成记录。 GPU：建议使用具有CUDA 10.x支持的Nvidia Tesla P4。内存：8GB以上。操作系统：支持GPU的Ubuntu 18.04或Mac OS X(Mac不支持GPU)。

See TensorFlow’s excellent setup guide for GPU acceleration. While a GPU is not required, it is generally at least 10x faster training on GPU than CPU. Or run on CPU and grab a ☕.

请参阅TensorFlow出色的GPU加速设置指南。尽管不需要GPU，但在GPU上的训练速度通常至少比CPU快10倍。或者在CPU上运行并抓住a。

生成API密钥 (Generate an API key)

With an API key, you get free access to the Gretel public beta’s premium features which augment our open source library for synthetic data generation with improved field-to-field correlations, automated synthetic data record validation, and reporting for synthetic data quality.

使用API密钥，您可以免费访问Gretel公共Beta的高级功能，这些功能通过改进的字段间关联性，自动的合成数据记录验证和合成数据质量报告，扩大了我们的开放源代码库，以用于合成数据的生成。

Log in or create a free account to Gretel.ai with a Github or Google email. Click on your profile icon at the top right, then API Key. Generate a new API token and copy to the clipboard.

使用Github或Google电子邮件登录或为Gretel.ai创建免费帐户。点击右上角的个人资料图标，然后点击API密钥。生成一个新的API令牌并复制到剪贴板。

gretel.ai synthetic data service with python SDKs — Generate an API key at https://console.gretel.cloud

设置系统并安装依赖项(Setup your system and install dependencies)

We recommend setting up a virtual Python environment for your runtime to keep your system tidy and clean, in this example we will use the Anaconda package manager as it has great support for Tensorflow, GPU acceleration, and thousands of data science packages. You can download and install Anaconda here https://www.anaconda.com/products/individual.

我们建议为您的运行时设置一个虚拟Python环境，以保持系统整洁干净。在本示例中，我们将使用Anaconda软件包管理器，因为它对Tensorflow ，GPU加速和成千上万的数据科学软件包提供了强大的支持。您可以在这里https://www.anaconda.com/products/individual下载并安装Anaconda。

创建虚拟环境 (Create the virtual environment)

conda install python=3.8
conda create --name synthetics python=3.8 
conda activate synthetics # activate your virtual environment
conda install jupyter # set up notebook environment
jupyter notebook # launch notebook in browser

安装所需的Python软件包 (Install required Python packages)

Install dependencies such as gretel-synthetics, Tensorflow, Pandas, and Gretel helpers (API key required) into your new virtual environment. Add the code samples below directly into your notebook, or download the complete synthetics notebook from Github.

将诸如gretel -synthetics ，Tensorflow，Pandas和Gretel帮助器(需要API密钥)之类的依赖项安装到新的虚拟环境中。将以下代码示例直接添加到笔记本中，或从Github下载完整的合成笔记本。

# iPython notebook cell
!pip install tensorflow==2.1.0
!pip install pandas
!pip install gretel-client
!pip install gretel-synthetics


import getpass
import os
from gretel_client import get_cloud_client


gretel_api_key = os.getenv("GRETEL_API_KEY") or getpass.getpass("Your Gretel API Key")
client = get_cloud_client("api", api_key=gretel_api_key)
client.install_packages()

训练模型并生成综合数据(Train the model and generate synthetic data)

Load the source from CSV into a Pandas Dataframe, add or drop any columns, configure training parameters, and train the model. We recommend at least 5,000 rows of training data when possible.

将CSV中的源代码加载到Pandas Dataframe中，添加或删除任何列，配置训练参数并训练模型。如果可能，我们建议至少5,000行训练数据。

# Load dataset and build synthetic model
from gretel_helpers.synthetics import SyntheticDataBundle


# Specify dataset
dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/healthcare-analytics-vidhya/train_data.csv'
nrows = 10000


# Configure model training parameters
checkpoint_dir = str(Path.cwd() / "checkpoints")
config_template = {
    "checkpoint_dir": checkpoint_dir,
    "dp": True, # enable differential privacy in training
    "epochs": 25, # recommend 15-30 epochs to train production models
    "gen_lines": nrows, # number of lines to generate in first batch
    "vocab_size": 20000
}


# Gretel helpers to optimize the synthetic model
training_df = pd.read_csv(dataset_path, nrows=nrows)
bundle = SyntheticDataBundle(
    training_df=training_df,
    auto_validate=True, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original
    synthetic_config=config_template, # the config for Synthetics
)
bundle.build()
bundle.train()
bundle.generate(max_invalid=nrows)

比较源数据集和综合数据集(Compare the source and synthetic datasets)

Use Gretel.ai’s reporting functionality to verify that the synthetic dataset contains the same correlations and insights as the original source data.

使用Gretel.ai的报告功能来验证综合数据集是否包含与原始源数据相同的相关性和见解。

# Preview the synthetic Dataframe
bundle.synthetic_df()# Generate a synthetic data report
bundle.generate_report()# Save the synthetic dataset to CSV
bundle.synthetic_df().to_csv('synthetic-data.csv', index=False)

Download your new synthetic dataset, and explore correlations and insights in the synthetic data report!

下载新的综合数据集，并在综合数据报告中探索相关性和见解！

gretel.ai synthetic data correlations — Comparing insights between the source and synthetic datasets

想端到端地运行吗？(Want to run through end to end?)

Download our walkthrough notebook on Github, load the notebook in your local notebook server, connect your API key, and start creating synthetic data!

在Github上下载我们的演练笔记本，将笔记本加载到本地笔记本服务器中，连接API密钥，然后开始创建合成数据！

结论 (Conclusion)

At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or hi@gretel.ai. Like gretel-synthetics? Give us a ⭐ on Github!

在Gretel.ai，我们为使用合成数据来增强训练集来创建ML和AI模型的可能性感到非常兴奋，该模型可以更好地针对未知数据进行泛化，并减少算法偏差。我们很乐意听到您的用例-请随时通过评论， twitter或hi@gretel.ai与我们联系以进行更深入的讨论。像gretel-synthetics ？给我们一个Github ！