基于通用LLM合成数据生成

最新推荐文章于 2025-06-11 18:10:46 发布

技术与健康

最新推荐文章于 2025-06-11 18:10:46 发布

阅读量1.2k

点赞数 31

分类专栏： LLM 文章标签： python 机器学习开发语言

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/Practicer2015/article/details/141068672

版权

LLM 专栏收录该内容

15 篇文章

订阅专栏

使用大型语言模型 (LLM) 生成合成数据为一个常见问题提供了强大的解决方案：提供高质量、多样化且符合隐私要求的数据。这可以用于许多场景，例如训练数据科学机器学习模型（SVM、决策树、KNN）、在数据上微调不同的 GPT 模型、作为冷启动问题的解决方案、帮助使用真实数据构建引人注目的演示/应用程序、场景测试等。

有许多关键驱动因素可能会促使您想要利用合成数据。

人类数据可能包含我们不希望使用的隐私限制和/或可识别数据。
合成数据比真实数据更加结构化，因此更容易操作。
在数据稀疏或某些类别的数据稀疏的领域，我们可能希望增强数据。
当处理不平衡的数据集或缺乏多样性的数据集时，我们可能希望创建数据来提高数据集的丰富度。

与传统的数据增强或手动数据创建方法不同，使用 LLM 可以生成丰富、细致入微且与上下文相关的数据集，从而显著增强其对企业和开发人员的实用性。

我们将本教程分为两部分。在本指南中，我们将制定以下议程：

带有结构化提示的 CSV
使用 Python 程序生成 CSV
使用 Python 程序实现多表 CSV
简单创建文本数据
在第 2部分中，我们将处理不平衡或非多样化的文本数据，并研究获取更好文本数据的提示策略。

最后两种方法尤其适用于创建合成数据来微调另一个 GPT 模型。例如，使用生成的更高质量的数据来gpt-4o更便宜、更快速地微调模型gpt-3.5-turbo，以提高性能并降低成本。

开始设置
%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

from openai import OpenAI
import os
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 带有结构提示的 CSV

这里我们以最简单的方式创建数据。您可以通过解决 3 个关键点来快速生成数据：告诉它数据的格式 (CSV)、架构以及有关列如何关联的有用信息（LLM 将能够从列名称中推断出这一点，但帮助会提高性能）。

datagen_model = "gpt-4o-mini"
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

id、house_size_m2、house_price、location、number_of_bedrooms
1,50,150000、郊区、2
2,75,250000、市中心、3 3,100,350000
、郊区、4
4,120,450000、郊区、4
5,80,300000、市中心、3
6,90,400000、市中心、3
7,150,600000、优质区域、5
8,200,750000、优质区域、5
9,55,180000、郊区、2
10,300,950000、优质区域、6

2. 使用 Python 程序生成 CSV

直接生成数据的问题是，由于上下文的原因，我们能够生成的数据量有限。相反，我们可以做的是让 LLM 生成一个 Python 程序来生成合成数据。这使我们能够扩展到更多数据，同时还通过检查 Python 程序让我们了解数据是如何生成的。

这将使我们能够根据需要编辑 Python 程序，同时也为我们提供了一个良好的起点。

question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

当然可以！下面是一个 Python 程序，它根据您的要求生成合成住房数据。我们将创建一个具有定义字段和特征的 pandas DataFrame。
 
 import pandas as pd
 import random
 
def generate_housing_data(num_rows):
     data = []
     
    locations = [
         ('市中心', 10000, 150), # (位置名称、每平方米基本价格、基本大小)
         ('郊区', 8000, 100),
         ('乡村', 5000, 80),
         ('沿海地区', 12000, 110),
         ('城市社区', 9000, 130)
     ]
     
    for i in range(1, num_rows + 1):
         # 随机选择一个位置
        location, base_price_per_m2, base_size = random.choice(locations)
         
        # 生成卧室数量（1 到 5）
         number_of_bedrooms = random.randint(1, 5)
         
        # 根据卧室数量计算房屋大小
        house_size = base_size + (10 * number_of_bedrooms) + random.randint(-5, 15) # 添加一些噪音
        
        # 根据房屋大小和位置计算房价
        house_price = base_price_per_m2 * house_size + random.randint(-5000, 10000) # 添加一些噪音

        # 将生成的数据附加到列表中
        data.append({
             'id': i,
             'house_size_m2': house_size,
             'house_price': house_price,
             'location': location,
             'number_of_bedrooms': number_of_bedrooms
         })
 
    # 创建一个 pandas DataFrame
     df = pd.DataFrame(data)
     return df
 
#生成 100 行房屋数据
housing_data_df = generate_housing_data(100)
 
#显示结果
print(housing_data_df)

###解释：

generate_housing_data 函数为指定行数（num_rows）。
我们定义不同的位置，并对应每平方米的基本价格和平均房屋面积。
对于每栋房子，我们随机选择一个位置、卧室数量，并计算房屋面积和价格，以确保值之间有合理的相关性。
最后，我们从生成的数据中创建一个 pandas DataFrame 并返回它。

您可以在 Python 环境中运行此程序，它将输出一个包含 100 行合成住房数据的 DataFrame。

我们需要确保正确解析此输出，因为 Python 代码周围可能经常有文本。我们还可以明确要求它陈述它对生成的数据做出的所有假设，但在这种情况下，它会自动告诉我们这一点。

3. 使用 Python 程序处理多表 CSV

然而，对于更复杂的关系，我们需要确保指定更多特征。

要创建多个相互关联的不同数据集（例如住房、位置、房屋类型），我们需要像以前一样指定格式、架构和有用信息。但是，现在获得良好性能所需的有用信息更多。这是针对具体情况的，但需要描述的大量内容包括数据集如何相互关联、解决数据集相对于彼此的大小、确保外键和主键正确生成以及理想情况下使用先前生成的数据集来填充新数据集，以便实际数据值在必要时匹配。

question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

当然可以！下面是一个 Python 程序，它为住房数据、位置数据和房屋类型生成三个指定的 pandas DataFrame。每个 DataFrame 将包含必要的字段，外键将确保它们之间的正确关系。

 import pandas as pd
 import numpy as np
 
#设置随机种子以实现可重复性
np.random.seed(0)
 
#生成位置 DataFrame 的函数
def generate_location_data(num_locations):
     locations = {
         "id": range(1, num_locations + 1),
         "country": np.random.choice(['USA', 'Canada', 'UK'], num_locations),
         "city": np.random.choice(['New York', 'Toronto', 'London', 'Vancouver', 'Manchester'], num_locations),
         "population": np.random.randint(50000, 1000000, num_locations),
         "area": np.random.randint(10000, 500000, num_locations)
     }
     return pd.DataFrame(locations)
 
#生成房屋类型的函数 DataFrame
 def generate_house_type_data(num_house_types):
     house_types = {
         "id": range(1, num_house_types + 1),
         "house_type": np.random.choice(['Detached', 'Semi-Detached', 'Terraced', 'Flat'], num_house_types),
         "average_house_type_price": np.random.randint(100000, 1000000, num_house_types),
         "number_of_houses": np.random.randint(10, 1000, num_house_types)
     }
     return pd.DataFrame(house_types)
 
#生成房屋数据的函数 DataFrame
 def generate_housing_data(num_houses, location_df, house_type_df):
     house_sizes = np.random.randint(50, 300, num_houses) # 面积（以平方米为单位）
     location_ids = np.random.choice(location_df['id'], num_houses)
     house_type_ids = np.random.choice(house_type_df['id'], num_houses)
     
    # 根据面积、位置和房屋类型生成价格
    house_prices = (house_sizes * np.random.randint(2000, 5000, num_houses) // 10) + \
                    (location_ids * 1000) + \
                    (house_type_df.loc[house_type_ids - 1, 'average_house_type_price'].values // 4)
     
    housing_data = {
         "id": range(1, num_houses + 1),
         "house_size": house_sizes,
         "house_price":房价，
         “location_id”：location_ids，
         “卧室”：np.random.randint（1，6，num_houses），
         “house_type_id”：
house_type_ids     }
     
    return pd.DataFrame(housing_data)
 
#生成 DataFrames
 num_locations = 10
 num_house_types = 4
 num_houses = 100

location_df = generate_location_data(num_locations)
 house_type_df = generate_house_type_data(num_house_types)
 housing_df = generate_housing_data(num_houses, location_df, house_type_df)
 
#显示生成的 DataFrames
 print("Location DataFrame:")
 print(location_df.head(), "\n")
 
print("House Types DataFrame:")
 print(house_type_df.head(), "\n")
 
print("Housing DataFrame:")
 print(housing_df.head(), "\n")
 
#打印 DataFrame 形状
print(f"Shapes:\nLocation: {location_df.shape}, House Types: {house_type_df.shape}, Housing: {housing_df.shape}")

 
###代码解释：
 1. Location DataFrame： 
    - 生成具有国家、城市、人口和地区等属性的随机位置。
   
2. 房屋类型 DataFrame： 
   生成不同类型的房屋以及平均价格和可用数量。
   
3. 房屋 DataFrame： 
   根据房屋大小、位置和房屋类型生成价格递增的房屋数据，同时确保位置和房屋类型的外键 (ID)。

###输出：
生成的三个 DataFrame 将在逻辑上相互关联，具有一致的数据类型和主外键关系，从而产生一致的房屋数据集表示。输出显示每个 DataFrame 的头部及其形状以供验证。

4. 简单创建文本数据

在这里，我们首先看一下如何创建文本数据。例如，这可用于微调另一个 GPT 模型。在这种情况下，我们想象自己是一家零售商，试图简化为其销售的商品创建描述的过程。我们再次需要指定数据的格式，特别是在这种情况下，我们希望数据格式易于解析为输出。

下面我们考虑的例子是，我们想要为 GPT 模型创建输入输出训练对，以便对其进行微调。我们将产品的名称及其所属的类别作为输入，输出将是描述。

明确指定输出结构并给出不偏离此结构的命令有助于强制执行输出结构。您可以循环运行此操作并附加数据以生成更多合成数据。同样，与之前一样，我们需要很好地解析数据，以便我们下游的代码不会中断。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1.
输入：无线蓝牙耳机、电子产品
输出：使用这些无线蓝牙耳机沉浸在高品质声音中，具有主动降噪功能和舒适的耳罩式设计，可长时间聆听。
 
2.
输入：有机绿茶、饮料
输出：享用一杯清爽的有机绿茶，采用最优质的叶子，富含抗氧化剂，是随时补充健康活力的完美选择。
 
3.
输入：不锈钢厨刀、厨具
输出：使用这款不锈钢厨刀精确轻松地切割，采用符合人体工程学的手柄和锋利的刀片设计，适合您的所有烹饪任务。
 
4.
输入：徒步旅行背包、户外装备
输出：使用这款耐用的徒步旅行背包探索户外，具有多个隔层以实现最佳整理，透气设计可在长途跋涉中提供极致舒适感。
 
5.
输入：空气炸锅、厨房用具
输出：使用这款空气炸锅用更少的油烹饪您最喜欢的饭菜

注意：上面的输出被截断了。现在我们可以按如下方式解析它以获取产品、类别及其描述的列表。例如，让我们看看它生成的产品。

#regex to parse data
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())
products

['无线蓝牙耳机',
 '有机绿茶',
 '不锈钢菜刀',
 '徒步旅行背包',
 '空气炸锅',
 “儿童教育平板电脑”,
 '蓝牙音箱',
 '瑜伽垫',
 '记忆泡沫床垫',
 '智能手表',
 '皮革钱包',
 '便携式手机充电器',
 '不粘锅具套装',
 '宠物狗床',
 '健身追踪器',
 '无线耳塞',
 '有机绿茶',
 '可重复使用的水瓶',
 '瑜伽垫',
 '皮革钱包',
 '空气炸锅',
 '游戏鼠标',
 '钩针编织套件',
 '徒步旅行靴',
 '香薰蜡烛',
 '蓝牙音箱',
 '不锈钢炊具套装',
 ‘健身追踪器’、
 ‘装饰性抱枕’、
 ‘环保清洁用品’、
 ‘无线降噪耳机’、
 ‘有机绿茶’、
 ‘可调节瑜伽垫’、
 ‘蓝牙智能秤’、
 ‘不锈钢水瓶’、
 ‘柔软棉质床上用品套装’、
 ‘多功能厨房搅拌机’、
 ‘环保可重复使用袋’、
 ‘便携式手机充电器’、
 ‘经典皮革钱包’、
 ‘绒面切尔西靴’、
 ‘不粘锅具套装’、
 ‘宠物友好型室内植物’、
 ‘高蛋白零食棒’、
 ‘带 USB 端口的 LED 台灯’]

5.处理不平衡或非多样化的文本数据

生成高质量合成数据的一些最重要的方面是准确性（数据是否有意义）、一致性（同一输入的两个独立数据点是否大致相同）和多样性（确保我们的数据分布与生产中存在的分布尽可能匹配）。

为了增加数据的多样性，我们首先从对数据进行聚类开始。这将为我们提供有关哪些集群代表性不足（不平衡的数据集）或哪些数据根本没有得到解决（扩大数据分布）的信息。然后，我们将建议新的集群（使用来自 GPT 的自我反思类型调用）或要求我们的合成生成调用的下一次迭代明确针对代表性不足的集群。

然后我们可以递归运行这个集群循环的生成和分析，以自动生成不同的合成数据。

为了演示目的，我们明确提示 LLM 生成有关 4 个不同主题领域的信息：车辆、服装、洗漱用品、食品。然后我们将对数据进行聚类，看看它是否能找到这 4 个主题领域。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

1. 车辆  
输入：“特斯拉 Model 3，电动汽车”  
输出：“特斯拉 Model 3 是一款革命性的电动汽车，拥有令人印象深刻的续航里程和尖端技术，旨在提供令人振奋的驾驶体验，同时最大限度地减少对环境的影响。”
 
2. 服装  
输入：“耐克 Air Max，鞋子”  
输出：“用耐克 Air Max 提升你的运动鞋水平。这款鞋将标志性风格与卓越的舒适性和支撑性相结合，非常适合锻炼和休闲外出。”
 
3. 洗漱用品  
输入：“Oral-B Pro 1000，电子牙刷”  
输出：“使用 Oral-B Pro 1000 实现卓越清洁。这款电子牙刷具有 3D 清洁动作，可通过脉动和振荡去除比普通手动牙刷更多的牙菌斑。”
 
4. 食物  
输入：“Chobani 希腊酸奶，酸奶”  
输出：“尽情享受 Chobani 希腊酸奶带来的营养小吃吧。它富含蛋白质，口味鲜美，是健康早餐或任何时候令人满意的零食的完美选择。”
 
5. 车辆

注意：以上输出被截断。在上面的示例中，我们将明确地将主题区域作为每个示例的响应的一部分，因为它有助于调节后续输出并倾向于提供更好的性能。我们还可以为其提供一个实际示例，说明输出应该是什么样子，这样它就能正确理解输出的风格，同时也有助于强化结构。

pattern = re.compile(r'(\d+)\.\s*(\w+)\s*Input:\s*"(.+?),\s*(.+?)"\s*Output:\s*"(.*?)"', re.DOTALL)
matches = pattern.findall(output_string)

topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)
    categories.append(category)
    descriptions.append(description)

products

['特斯拉 Model 3'、
 '耐克 Air Max'、
 '欧乐 B Pro 1000'、
 'Chobani 希腊酸奶'、
 '福特 F-150'、
 '李维斯 511'、
 '飞利浦 Sonicare'、
 '桂格燕麦片'、
 '丰田凯美瑞'、
 '阿迪达斯 Ultraboost'、
 '丰田凯美瑞'、
 '耐克 Air Max'、
 '高露洁电动牙刷'、
 '蓝钻杏仁'、
 '哈雷戴维森 Fat Boy'、
 '阿迪达斯 UltraBoost'、
 '多芬男士沐浴露'、
 '桂格燕麦片'、
 '福特 F-150'、
 '李维斯 501 牛仔裤'、
 '特斯拉 Model 3'、
 '耐克 Air Max'、
 '欧乐 B Pro 1000'、
 '有机杏仁酱'，
 ‘雅马哈 YZF-R3’、
 ‘阿迪达斯 Ultraboost’、
 ‘飞利浦 Sonicare’、
 ‘有机藜麦’]

现在我们将对数据进行聚类分析。我们将使用 K 均值聚类来分离数据。K 均值的一个重要参数是 K，即聚类数。

我们知道应该有 4 个集群（4 个主题），因为我们在提示中指定了这一点：车辆、电子产品、服装、食品。但是对于我们的数据，通常我们不知道存在的集群数量。因此，我们将使用肘部法来找到最佳集群数量。

在肘部法中，我们迭代一系列不同的 K，每次都存储惯性。惯性测量簇中每个点与簇质心之间的平方距离之和，从而告诉我们每个簇的分离程度和密度。如果我们将 K 与惯性作图，我们能够看到惯性如何下降，以及惯性下降最慢的位置（通常呈肘部形状），我们可以设置最佳簇数。您可以在此处更深入地了解肘部法。

首先，我们将数据存储到 pandas 数据框中，以便于分析

data = {
    'Product': products,
    'Category': categories,
    'Description': descriptions
}

df = pd.DataFrame(data)

接下来让我们嵌入我们的数据，因为嵌入就是我们将要聚类的，因为如果它们相似，它们应该在向量空间中彼此接近。

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

# Ensure there are embeddings to concatenate
if len(df.embedding.values) > 0:
    matrix = np.vstack(df.embedding.values)
else:
    matrix = np.array([])  # Handle the case where there are no embeddings

df

   | 产品	  | 类别	  | 描述	  | 嵌入
  -|-----|-----|-----|-----|-----|
0  | 特斯拉 Model 3  | 	电动汽车  | 	特斯拉 Model 3 是一款革命性的电动汽车……	  | [0.003255360759794712，-0.039260633289813995，...
1  | 	耐克 Air Max	  | 鞋	  | 使用 Nike Air Max 提升您的运动鞋水平。 C...	  | [0.03943369910120964, 0.022045187652111053, -0...
2	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662，-0.01911414973437786，...
3	Chobani 希腊酸奶	酸奶	尽情享用 Chobani Gre 的营养小吃......	[0.0208318829536438，-0.02645781636238098，-0....
4	福特 F-150	皮卡车	福特 F-150 是一款终极皮卡车，......	[0.007467855699360371，-0.05288049206137657，-...
5	李维斯 511	牛仔裤	穿上 Levi's 511 牛仔裤，时尚出门。特点是...	[0.0037206460256129503, 0.022772302851080894, ...
6	飞利浦 Sonicare	电动牙刷	使用 Phi 探索口腔护理的全新水平...	[-0.00724813062697649,-0.011600878089666367,...
7	桂格燕麦片	早餐麦片	用桂格燕麦片开启美好的一天。这...	[-0.006529285106807947, 0.007865572348237038, ...
8	丰田凯美瑞	轿车	丰田凯美瑞 (Toyota Camry) 在轿车类别中脱颖而出……	[-0.02088991366326809，-0.006191295105963945，...
9	阿迪达斯 Ultraboost	跑鞋	穿上 Adidas Ultraboost，体验前所未有的奔跑体验……	[0.02679188922047615, 0.014639599248766899, 8....
10	丰田凯美瑞	车	丰田凯美瑞是一款可靠的中型轿车……	[0.008056452497839928，-0.007912316359579563，...
11	耐克 Air Max	鞋	通过 Nike Air Ma 提升您的运动鞋游戏水平...	[0.03943241760134697, 0.02208484522998333, -0....
12	高露洁电动牙刷	电动牙刷	通过 C 改变您的口腔卫生习惯...	[-0.003470012918114662，-0.01911414973437786，...
十三	蓝钻杏仁	坚果	蓝钻杏仁是健康的零食。这些...	[-0.013289917260408401, 0.036334190517663956, ...
14	哈雷戴维森胖男孩	摩托车	体验开阔道路的刺激……	[0.012365399859845638, 0.03552943095564842, -0...
15	阿迪达斯 UltraBoost	运动鞋	享受舒适与性能的完美结合……	[0.013107392005622387, 0.02963760495185852, -0...
16	多芬男士沐浴露	沐浴露	使用多芬男士保湿霜，让您的肌肤焕然一新、水润滋润……	[0.03760576993227005，-0.008475445210933685，-...
17	桂格燕麦	燕麦	用桂格燕麦开启美好的一天。包装精美...	[-0.00903365109115839, 0.00896345917135477, 0....
18	福特 F-150	卡车	福特 F-150 是一款耐用且可靠的汽车……	[0.023461222648620605，-0.026651185005903244，...
19	Levi's 501 牛仔裤	牛仔裤	探索 Levi's 501 牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...
20	特斯拉 Model 3	移动电话	使用特斯拉 M 探索未来的驾驶方式...	[0.03703858703374863, 0.03407958149909973, 0.0...
21	耐克 Air Max	鞋	穿上 Nike Air Max 提升你的比赛水平。设计...	[0.03943369910120964, 0.022045187652111053, -0...
22	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662，-0.01911414973437786，...
23	有机杏仁酱	食物	尽情享受有机杏仁的奶油般美味……	[-0.014613640494644642,-0.002179765608161688,...
24	雅马哈 YZF-R3	移动电话	推出 Yamaha YZF-R3，终极 sp...	[0.03703858703374863, 0.03407958149909973, 0.0...
二十五	阿迪达斯 Ultraboost	鞋	探索 Adidas Ultraboost，一款......的鞋子。	[0.03944042697548866, 0.022062409669160843, -0...
二十六	飞利浦 Sonicare	电动牙刷	与 Phi 一起体验牙科护理革命...	[-0.003470012918114662，-0.01911414973437786，...
二十七	有机藜麦	食物	用有机藜麦滋养您的身体，这是一种营养品……	[-0.014613640494644642,-0.002179765608161688,...

现在我们执行肘部方法。

#Determine the optimal number of clusters using the elbow method
inertias = []
range_of_clusters = range(1, 13)  # Adjust the range as necessary

for n_clusters in range_of_clusters:
    kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
    kmeans.fit(matrix)
    inertias.append(kmeans.inertia_)

这将为我们输出一个图表，我们必须直观地判断最佳聚类点在哪里。我们可以在下面看到，惯性逐渐减小，而不是急剧下降，但最急剧的下降点似乎出现在 3、4 或 5 个聚类附近，这与我们的提示所期望的一致。

#Plotting the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range_of_clusters, inertias, '-o')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.xticks(range_of_clusters)
plt.show()

在这里插入图片描述

肘部图表

为了演示目的，我们将选择 5 作为最佳簇数，以表明我们选择它的具体位置并不重要，只要我们大致正确即可。对数据进行分类的正确方法有很多。我们还存储每个数据点属于哪个簇。

n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

我们现在将分析集群数据。我们将着手解决两个问题。1. 数据不平衡，2. 扩大数据分布。

首先，对于不平衡数据，我们计算每个集群中的示例数量。然后，我们从每个集群中随机选择一些示例，并询问 LLM 这些示例映射到哪些主题。

cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

聚类
0 5
 1 7
 2 8
 3 6
 4 2
名称：count，dtype：int64

我们可以看到这里的主题：环保交通、奢侈品和休闲用品、个人护理产品、电动牙刷和服装和服饰与我们最初的提示（车辆、服装、洗漱用品、食品）很匹配，但不完全匹配。

由于我们选择了 5 个集群，它将洗漱用品分为皮肤护理和个人护理，这不会对我们下游产品造成太大影响。

df

产品	类别	描述	嵌入	簇
0	特斯拉 Model 3	电动汽车	特斯拉 Model 3 是一款革命性的电动汽车……	[0.003255360759794712，-0.039260633289813995，...	1
1	耐克 Air Max	鞋	使用 Nike Air Max 提升您的运动鞋水平。C...	[0.03943369910120964, 0.022045187652111053, -0...	2
2	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662，-0.01911414973437786，...	1
3	Chobani 希腊酸奶	酸奶	尽情享用 Chobani Gre 的营养小吃......	[0.0208318829536438，-0.02645781636238098，-0....	3
4	福特 F-150	皮卡车	福特 F-150 是一款终极皮卡车，......	[0.007467855699360371，-0.05288049206137657，-...	0
5	李维斯 511	牛仔裤	穿上 Levi's 511 牛仔裤，时尚出门。特点是...	[0.0037206460256129503, 0.022772302851080894, ...	2
6	飞利浦 Sonicare	电动牙刷	使用 Phi 探索口腔护理的全新水平...	[-0.00724813062697649,-0.011600878089666367,...	1
7	桂格燕麦片	早餐麦片	用桂格燕麦片开启美好的一天。这...	[-0.006529285106807947, 0.007865572348237038, ...	3
8	丰田凯美瑞	轿车	丰田凯美瑞 (Toyota Camry) 在轿车类别中脱颖而出……	[-0.02088991366326809，-0.006191295105963945，...	0
9	阿迪达斯 Ultraboost	跑鞋	穿上 Adidas Ultraboost，体验前所未有的奔跑体验……	[0.02679188922047615, 0.014639599248766899, 8....	2
10	丰田凯美瑞	车	丰田凯美瑞是一款可靠的中型轿车……	[0.008056452497839928，-0.007912316359579563，...	0
11	耐克 Air Max	鞋	通过 Nike Air Ma 提升您的运动鞋游戏水平...	[0.03943241760134697, 0.02208484522998333, -0....	2
12	高露洁电动牙刷	电动牙刷	通过 C 改变您的口腔卫生习惯...	[-0.003470012918114662，-0.01911414973437786，...	1
十三	蓝钻杏仁	坚果	蓝钻杏仁是健康的零食。这些...	[-0.013289917260408401, 0.036334190517663956, ...	3
14	哈雷戴维森胖男孩	摩托车	体验开阔道路的刺激……	[0.012365399859845638, 0.03552943095564842, -0...	0
15	阿迪达斯 UltraBoost	运动鞋	享受舒适与性能的完美结合……	[0.013107392005622387, 0.02963760495185852, -0...	2
16	多芬男士沐浴露	沐浴露	使用多芬男士保湿霜，让您的肌肤焕然一新、水润滋润……	[0.03760576993227005，-0.008475445210933685，-...	1
17	桂格燕麦	燕麦	用桂格燕麦开启美好的一天。包装精美...	[-0.00903365109115839, 0.00896345917135477, 0....	3
18	福特 F-150	卡车	福特 F-150 是一款耐用且可靠的汽车……	[0.023461222648620605，-0.026651185005903244，...	0
19	Levi's 501 牛仔裤	牛仔裤	探索 Levi's 501 牛仔裤的永恒风格...	[0.003762696636840701, 0.02275814116001129, -0...	2
20	特斯拉 Model 3	移动电话	使用特斯拉 M 探索未来的驾驶方式...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
21	耐克 Air Max	鞋	穿上 Nike Air Max 提升你的比赛水平。设计...	[0.03943369910120964, 0.022045187652111053, -0...	2
22	Oral-B Pro 1000	电动牙刷	使用 Oral-B Pro 1 实现卓越清洁...	[-0.003470012918114662，-0.01911414973437786，...	1
23	有机杏仁酱	食物	尽情享受有机杏仁的奶油般美味……	[-0.014613640494644642,-0.002179765608161688,...	3
24	雅马哈 YZF-R3	移动电话	推出 Yamaha YZF-R3，终极 sp...	[0.03703858703374863, 0.03407958149909973, 0.0...	4
二十五	阿迪达斯 Ultraboost	鞋	探索 Adidas Ultraboost，一款......的鞋子。	[0.03944042697548866, 0.022062409669160843, -0...	2
二十六	飞利浦 Sonicare	电动牙刷	与 Phi 一起体验牙科护理革命...	[-0.003470012918114662，-0.01911414973437786，...	1
二十七	有机藜麦	食物	用有机藜麦滋养您的身体，这是一种营养品……	[-0.014613640494644642,-0.002179765608161688,...	3

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

#Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

[
   {
     "cluster": 0,
     "topic": "汽车 "
   },
   {
     "cluster": 1,
     "topic": "个人护理 "
   },
   {
     "cluster": 2,
     "topic": "鞋类 "
   },
   {
     "cluster": 3,
     "topic": "食品 "
   },
   {
     "cluster": 4,
     "topic": "汽车 "
   }
 ]

我们现在有了集群及其计数，因此我们可以提示 LLM 在我们想要的主题中生成更多示例。但是对于这个例子，我们不会进一步进行，因为它们已经很好地分割了，您只需按照上面的步骤提示模型生成数据，同时传入代表性不足的主题。

接下来，我们将尝试增加数据分布的多样性。

首先，我们以类似的方式开始，随机从每个集群中查找一些示例，并询问 LLM 这些示例映射到哪些主题。除此之外，在同一个 LLM 调用中，我们将要求它生成更多主题以增加数据的多样性。我们在一次调用中完成此操作以节省时间/成本。

selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3, replace=True)).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)

1. 集群主题映射
集群：0，主题：汽车
集群：1，主题：个人护理
集群：2，主题：鞋类
集群：3，主题：食品
集群：4，主题：电动汽车

2. 新主题
1. 主题：家用电器
2. 主题：户外装备
3. 主题：智能家居技术
4. 主题：健身器材

我们在这里再次看到，我们明确提示了它应该遵循的输出结构。我还告诉它生成主题的目的（促进多样性），以便模型具有完整的背景。

然后，我们将数据解析为集群映射 JSON 列表和主题列表

parts = res.split("\n\n")
cluster_mapping_part = parts[0]
new_topics_part = parts[1]

#Parse cluster topic mapping
cluster_topic_mapping_lines = cluster_mapping_part.split("\n")[1:]  # Skip the first two lines
cluster_topic_mapping = [{"cluster": int(line.split(",")[0].split(":")[1].strip()), "topic": line.split(":")[2].strip()} for line in cluster_topic_mapping_lines]

#Parse new topics
new_topics_lines = new_topics_part.split("\n")[1:]  # Skip the first line
new_topics = [line.split(". ")[1] for line in new_topics_lines]

cluster_topic_mapping, new_topics

([{'cluster': 0, 'topic': '汽车'},
   {'cluster': 1, 'topic': '个人护理'},
   {'cluster': 2, 'topic': '鞋类'},
   {'cluster': 3, 'topic': '食品'},
   {'cluster': 4, 'topic': '电动汽车'}],
 ['topic: 家用电器',
   'topic: 户外设备',
   'topic: 智能家居技术',
   'topic: 健身器材'])

最后，我们可以利用这些信息进一步提示模型继续生成合成数据。我们通过将 json 列表中的所有主题传递给下面的提示来实现这一点。

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under some main topics: {[entry['topic'] for entry in cluster_topic_mapping]})
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string)

1. 汽车  
输入：“特斯拉 Model S，电动汽车”  
输出：“特斯拉 Model S 凭借先进的电动技术提供令人振奋的性能，提供时尚的设计、令人印象深刻的续航里程和业界领先的信息娱乐系统。”
 
2. 个人护理  
输入：“Oral-B Pro 1000，电子牙刷”  
输出：“Oral-B Pro 1000 具有 3D 清洁动作，可通过振荡、旋转和脉动去除牙菌斑，确保更深层的清洁，让牙龈更健康。”
 
3. 鞋类  
输入：“Nike Air Max 270，鞋子”  
输出：“穿上 Nike Air Max 270，享受舒适和时尚，采用大型 Max Air 单元设计，提供卓越的缓冲性能，透气鞋面确保舒适贴合。”
 
4. 电子产品  
输入：“Apple iPhone 12，手机”  
输出：“Apple iPhone 12 将强大的性能与令人惊叹的设计相结合，配备 A14 仿生芯片和先进的摄像系统，可以捕捉每个精彩瞬间。”
 
5. 食品  
输入：“Nature Valley 格兰诺拉燕麦棒，零食”  
输出：“Nature Valley 格兰诺拉燕麦棒由简单美味的食材制成，口感香脆，是补充能量的完美零食。”
 
6. 汽车  
输入：“福特 F-150，电动汽车”  
输出：“福特 F-150 站在耐用性和创新性的最前沿，其强大的电动版本为卡车类别的强度和可持续性树立了新标准。” 
 
7. 个人护理  
输入：“飞利浦 Sonicare，电子牙刷”  
输出：“飞利浦 Sonicare 采用动态技术提供卓越的清洁效果，每分钟可提供高达 31,000 次的振动，让您拥有更健康的口腔和更灿烂的笑容。”
 
8. 鞋类  
输入：“Adidas Ultraboost，鞋子”  
输出：“Adidas Ultraboost 改变了跑步鞋类，具有响应缓冲和针织鞋面，贴身、支撑性好，适合任何跑步方式。”
 
9. 电子  
产品输入：“戴尔 XPS 13，笔记本电脑”  
输出：“戴尔 XPS 13 是一款出色的笔记本电脑，采用超薄设计，配备令人惊叹的 InfinityEdge 显示屏和强大的性能，可满足您的多任务处理需求。”
 
10. 食品  
输入：“卡夫通心粉和奶酪，速食”  
输出：“卡夫通心粉和奶酪提供快速方便的舒适食品，将奶油奶酪酱与完美烹制的意大利面相结合，打造出一顿令人满意的简单餐点。”
 
1. 汽车  
输入：“丰田凯美瑞，手机”  
输出：
“丰田凯美瑞是一款集效率与现代科技于一体的中型轿车。它拥有宽敞的内部空间和最新功能，可为您带来愉悦的驾驶体验。”
2. 个人护理  
输入：“Oral-B Pro 1000，电子牙刷”  
输出：“Oral-B Pro 1000 不仅提供强大的清洁作用，还通过其智能压力传感器和各种清洁模式增强您的口腔卫生习惯。”
 
3. 鞋类  
输入：“Nike Air Max，鞋子”  
输出：“穿上 Nike Air Max，尽享舒适。这款鞋采用尖端技术和时尚设计，是运动员和休闲穿着者的完美选择。”
 
4. 食物  
输入：“Nature's Valley 格兰诺拉燕麦棒，食物”  
输出：“品尝 Nature's Valley 格兰诺拉燕麦棒的健康美味，采用真正的原料精心制作，以美味的口味和酥脆的满足感为您的一天提供能量。”
 
5. 电动汽车  
输入：“Tesla Model 3，手机”  
输出：“Tesla Model 3 是一款革命性的电动汽车，它将性能与可持续性相结合，具有直观的界面和尖端技术，可提供卓越的驾驶体验。”
 
1. 汽车  
输入：“特斯拉 Model 3，电动汽车”  
输出：“特斯拉 Model 3 将尖端技术与环保驾驶相结合。享受时尚的设计、令人印象深刻的续航里程和一流的安全功能，使其成为现代驾驶员的完美电动汽车。”
 
2. 个人护理  
输入：“Oral-B Pro 1000，电子牙刷”  
输出：“使用 Oral-B Pro 1000 实现卓越清洁。这款电子牙刷具有先进的 3D 清洁功能，可有效去除牙菌斑，同时温和地呵护牙龈，让您保持最佳的口腔健康。”
 
3. 鞋类  
输入：“Nike Air Max，鞋子”  
输出：“穿上 Nike Air Max 鞋，提升你的比赛水平。这款鞋结合了标志性的缓冲技术和大胆的风格，提供极致的舒适性和支撑力，非常适合休闲穿着和运动表现。”
 
4. 食品  
输入：“奥利奥饼干，零食”  
输出：“尽情享受奥利奥饼干的经典口味。两片松脆的巧克力威化饼中间夹着令人无法抗拒的奶油馅，这些美食是一天中任何时候满足你甜食需求的完美选择。”
 
5. 个人护理  
输入：“卡尼尔卸妆水，护肤品”  
输出：“卡尼尔卸妆水温和地去除化妆品和杂质，同时滋润肌肤。这种舒缓配方适合所有皮肤类型，是日常护肤的必备品。”
 
6. 汽车  
输入：“福特 F-150，卡车”  
输出：“福特 F-150 是典型的皮卡车，结合了动力、可靠性、
和创新技术。配备先进的牵引能力和宽敞的内部空间，适合工作和娱乐。”
7. 电子产品  
输入：“三星 Galaxy S21，手机”  
输出：“三星 Galaxy S21 带您体验移动技术的未来。这款智能手机拥有令人惊叹的显示屏、强大的处理器和多种摄像头选项，非常适合以高清格式捕捉生活中的精彩瞬间。”
 
8. 鞋类  
输入：“阿迪达斯 Ultraboost，鞋子”  
输出：“穿上阿迪达斯 Ultraboost 鞋，时尚奔跑。这些鞋子以其舒适性和性能而闻名，采用响应式缓冲技术，让您迈出的每一步都能获得无与伦比的能量回馈。” 
 
9. 电子产品  
输入：“戴尔 XPS 13，笔记本电脑”  
输出：“戴尔 XPS 13 凭借其令人惊叹的 InfinityEdge 显示屏、强大的性能和时尚的设计重新定义了笔记本电脑体验。是寻求便携性和功能性的专业人士和学生的理想选择。”
 
10. 个人护理  
输入：“飞利浦 Sonicare，电子牙刷”  
输出：“飞利浦 Sonicare 电子牙刷采用先进的声波技术，可确保提供卓越的清洁体验。这款牙刷不仅有助于去除牙菌斑，还能促进牙龈健康，让您拥有更灿烂的笑容。”

您可以循环运行它以附加到之前的数据，这样您就可以继续生成更多的文本合成数据来训练另一个 GPT 模型，同时确保我们满足不平衡的数据集并生成多样化的数据。