LangChain-v0.2文档翻译：2.12、教程-生成合成数据

最新推荐文章于 2024-08-20 09:32:47 发布

Hugo_Hoo

最新推荐文章于 2024-08-20 09:32:47 发布

阅读量92

点赞数

分类专栏： LangChain-v0.2文档翻译文章标签： langchain

原文链接：https://python.langchain.com/v0.2/docs/tutorials/data_generation/

版权

LangChain-v0.2文档翻译专栏收录该内容

31 篇文章 13 订阅

订阅专栏

介绍
教程
2.1. 构建一个简单的 LLM 应用程序
2.2. 构建一个聊天机器人
2.3. 构建向量存储库和检索器
2.4. 构建一个代理
2.5. 构建检索增强生成 (RAG) 应用程序
2.6. 构建一个会话式RAG应用程序
2.7. 在SQL数据上构建一个问答系统
2.8. 构建查询分析系统
2.9. 基于查询分析系统构建一个本地RAG应用程序
2.10. 基于图形数据库构建问答应用程序
2.11. 构建一个提取链
2.12. 生成合成数据（点击查看原文）

生成合成数据

简介

合成数据是人工生成的数据，而不是从现实世界事件中收集的数据。它用于模拟真实数据，而不会泄露隐私或遇到现实世界的限制。

合成数据的优势：

隐私和安全：没有真实的个人数据面临泄露风险。
数据增强：扩展机器学习的数据集。
灵活性：创建特定或罕见的场景。
成本效益：通常比现实世界数据收集更便宜。
监管合规：有助于应对严格的数据保护法律。
模型鲁棒性：可以带来更好的泛化AI模型。
快速原型设计：无需真实数据即可快速测试。
控制实验：模拟特定条件。
数据访问：当真实数据不可用时的替代方案。

注意：尽管有这些优势，合成数据应该谨慎使用，因为它可能无法始终捕捉现实世界的复杂性。

快速开始

在本示例中，我们将深入探讨如何使用langchain库生成合成的医疗计费记录。这个工具特别有用，当你想要开发或测试算法，但又不想使用真实患者数据，可能是由于隐私问题或数据可用性问题。

安装

首先，你需要安装langchain库及其依赖项。由于我们使用的是OpenAI生成器链，我们也会安装它。由于这是一个实验性库，我们需要在安装中包含langchain_experimental。然后我们将导入必要的模块。

# 安装必要的库
%pip install --upgrade --quiet langchain langchain_experimental langchain-openai
# 设置环境变量OPENAI_API_KEY或者从.env文件中加载
# import dotenv
# dotenv.load_dotenv()

# 导入必要的模块
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

1. 定义您的数据模型

每个数据集都有一个结构或"schema"。下面的MedicalBilling类作为我们的合成数据的模式。通过定义这个，我们告知合成数据生成器我们期望的数据的形状和性质。

# 定义合成数据的数据模型
class MedicalBilling(BaseModel):
    patient_id: int  # 患者ID，整数类型
    patient_name: str  # 患者姓名，字符串类型
    diagnosis_code: str  # 诊断代码，字符串类型
    procedure_code: str  # 程序代码，字符串类型
    total_charge: float  # 总费用，浮点数类型
    insurance_claim_amount: float  # 保险索赔金额，浮点数类型

例如，每条记录都会有一个patient_id整数、一个patient_name字符串，等等。

2. 示例数据

为了指导合成数据生成器，提供一些类似现实世界的例子很有用。这些例子作为"种子"——它们代表了你想要的数据类型，生成器将使用它们来创建更多看起来相似的数据。

这里有一些虚构的医疗计费记录：

# 提供一些示例数据作为生成合成数据的参考
examples = [
    {
        "example": "Patient ID: 123456, Patient Name: John Doe, Diagnosis Code: J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"
    },
    {
        "example": "Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"
    },
    {
        "example": "Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code: E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"
    },
]

3. 制作提示模板

生成器不会自动知道如何创建我们的数据；我们需要指导它。我们通过创建一个提示模板来实现这一点。这个模板有助于指导底层语言模型如何以期望的格式生成合成数据。

# 创建一个提示模板，用于指导语言模型生成合成数据
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,  # 前缀，可能包含指导性上下文或指令
    examples=examples,  # 之前定义的示例数据
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,  # 后缀
    input_variables=["subject", "extra"],  # 动态填充的占位符变量
    example_prompt=OPENAI_TEMPLATE,  # 每个示例行应采用的格式
)

其中FewShotPromptTemplate包括：

prefix和suffix：这些可能包含指导内容或说明。
examples：我们之前定义的样本数据。
input_variables：这些变量（“subject”、“extra”）是您可以稍后动态填充的占位符。例如，“subject”可以填充“medical_billing”，以进一步指导模型。
example_prompt：此提示模板是我们希望每个示例行在提示中采用的格式。

4. 创建数据生成器

有了模式和提示模板后，下一步是创建数据生成器。这个对象知道如何与底层语言模型通信以获取合成数据。

# 创建合成数据生成器
synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalBilling,  # 指定输出数据的模式
    llm=ChatOpenAI(  # 语言模型实例，需要替换成实际的实例
        temperature=1
    ),
    prompt=prompt_template,  # 使用前面创建的提示模板
)

5. 生成合成数据

最后，让我们获取我们的合成数据吧！

# 生成合成数据
synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",  # 主题，指定生成数据的领域
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",  # 额外的指导信息
    runs=10,  # 指定生成数据的数量
)

这个命令要求生成器生成10个合成的医疗计费记录。结果将存储在synthetic_results中。输出将是一系列MedicalBilling pydantic模型的实例。

其他实现

本文还展示了其他一些使用langchain生成合成数据的示例，包括使用不同的字段、偏好和样式。

from langchain_experimental.synthetic_data import (
    DatasetGenerator,
    create_data_generation_chain,
)
from langchain_openai import ChatOpenAI

# 实例化语言模型
model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)
# 创建数据生成链
chain = create_data_generation_chain(model)

# 使用示例
chain({"fields": ["blue", "yellow"], "preferences": {}})
# 更多示例...

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The vibrant blue sky contrasted beautifully with the bright yellow sun, creating a stunning display of colors that instantly lifted the spirits of all who gazed upon it.'}

chain(
    {
        "fields": {"colors": ["blue", "yellow"]},
        "preferences": {"style": "Make it in a style of a weather forecast."},
    }
)

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "Good morning! Today's weather forecast brings a beautiful combination of colors to the sky, with hues of blue and yellow gently blending together like a mesmerizing painting."}

chain(
    {
        "fields": [
            {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
            {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]},
        ],
        "preferences": {"minimum_length": 200, "style": "gossip"},
    }
)

{‘fields’: [{‘actor’: ‘Tom Hanks’, ‘movies’: [‘Forrest Gump’, ‘Green Mile’]},
{‘actor’: ‘Mads Mikkelsen’, ‘movies’: [‘Hannibal’, ‘Another round’]}],
‘preferences’: {‘minimum_length’: 200, ‘style’: ‘gossip’},
‘text’: ‘Did you know that Tom Hanks, the beloved Hollywood actor known for his roles in “Forrest Gump” and “Green Mile”, has shared the screen with the talented Mads Mikkelsen, who gained international acclaim for his performances in “Hannibal” and “Another round”? These two incredible actors have brought their exceptional skills and captivating charisma to the big screen, delivering unforgettable performances that have enthralled audiences around the world. Whether it’s Hanks’ endearing portrayal of Forrest Gump or Mikkelsen’s chilling depiction of Hannibal Lecter, these movies have solidified their places in cinematic history, leaving a lasting impact on viewers and cementing their status as true icons of the silver screen.'}

正如我们所看到的，创建的示例是多样化的，并且具有我们希望它们拥有的信息。此外，它们的样式很好地反映了给定的偏好。

生成用于提取基准测试

inp = [
    {
        "Actor": "Tom Hanks",
        "Film": [
            "Forrest Gump",
            "Saving Private Ryan",
            "The Green Mile",
            "Toy Story",
            "Catch Me If You Can",
        ],
    },
    {
        "Actor": "Tom Hardy",
        "Film": [
            "Inception",
            "The Dark Knight Rises",
            "Mad Max: Fury Road",
            "The Revenant",
            "Dunkirk",
        ],
    },
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

dataset

[{‘fields’: {‘Actor’: ‘Tom Hanks’,
‘Film’: [‘Forrest Gump’,
‘Saving Private Ryan’,
‘The Green Mile’,
‘Toy Story’,
‘Catch Me If You Can’]},
‘preferences’: {‘style’: ‘informal’, ‘minimal length’: 500},
‘text’: ‘Tom Hanks, the versatile and charismatic actor, has graced the silver screen in numerous iconic films including the heartwarming and inspirational “Forrest Gump,” the intense and gripping war drama “Saving Private Ryan,” the emotionally charged and thought-provoking “The Green Mile,” the beloved animated classic “Toy Story,” and the thrilling and captivating true story adaptation “Catch Me If You Can.” With his impressive range and genuine talent, Hanks continues to captivate audiences worldwide, leaving an indelible mark on the world of cinema.’},
{‘fields’: {‘Actor’: ‘Tom Hardy’,
‘Film’: [‘Inception’,
‘The Dark Knight Rises’,
‘Mad Max: Fury Road’,
‘The Revenant’,
‘Dunkirk’]},
‘preferences’: {‘style’: ‘informal’, ‘minimal length’: 500},
‘text’: ‘Tom Hardy, the versatile actor known for his intense performances, has graced the silver screen in numerous iconic films, including “Inception,” “The Dark Knight Rises,” “Mad Max: Fury Road,” “The Revenant,” and “Dunkirk.” Whether he’s delving into the depths of the subconscious mind, donning the mask of the infamous Bane, or navigating the treacherous wasteland as the enigmatic Max Rockatansky, Hardy’s commitment to his craft is always evident. From his breathtaking portrayal of the ruthless Eames in “Inception” to his captivating transformation into the ferocious Max in “Mad Max: Fury Road,” Hardy’s dynamic range and magnetic presence captivate audiences and leave an indelible mark on the world of cinema. In his most physically demanding role to date, he endured the harsh conditions of the freezing wilderness as he portrayed the rugged frontiersman John Fitzgerald in “The Revenant,” earning him critical acclaim and an Academy Award nomination. In Christopher Nolan’s war epic “Dunkirk,” Hardy’s stoic and heroic portrayal of Royal Air Force pilot Farrier showcases his ability to convey deep emotion through nuanced performances. With his chameleon-like ability to inhabit a wide range of characters and his unwavering commitment to his craft, Tom Hardy has undoubtedly solidified his place as one of the most talented and sought-after actors of his generation.’}]

解析器

好的，让我们看看现在是否可以从生成的数据中提取输出以及它与我们的案例相比如何！

from typing import List
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from pydantic import BaseModel, Field

# 定义解析器需要的数据模型
class Actor(BaseModel):
    Actor: str = Field(description="actor's name")
    Film: List[str] = Field(description="list of films the actor starred in")

# 创建解析器
llm = OpenAI()
parser = PydanticOutputParser(pydantic_object=Actor)

# 创建提示模板
prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# 使用解析器从生成的文本中提取数据
_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())
parsed = parser.parse(output)
parsed

Actor(Actor=‘Tom Hanks’, Film=[‘Forrest Gump’, ‘Saving Private Ryan’, ‘The Green Mile’, ‘Toy Story’, ‘Catch Me If You Can’])

# 验证解析结果是否与输入一致
(parsed.Actor == inp[0]["Actor"]) & (parsed.Film == inp[0]["Film"])

True

extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=model)
extracted = extractor.run(dataset[1]["text"])
extracted

[Actor(Actor=‘Tom Hardy’, Film=[‘Inception’, ‘The Dark Knight Rises’, ‘Mad Max: Fury Road’, ‘The Revenant’, ‘Dunkirk’])]

(extracted[0].Actor == inp[1]["Actor"]) & (extracted[0].Film == inp[1]["Film"])

True

总结

本文详细介绍了如何使用Python和langchain库生成各种类型的合成数据。合成数据可以用于隐私保护、数据增强、模型训练等多种场景。本文以医疗计费记录为例，详细介绍了生成合成数据的步骤，包括安装库、定义数据模型、提供示例数据、创建提示模板、创建数据生成器以及生成合成数据。此外，本文还提供了一些其他生成合成数据的示例，展示了langchain库的灵活性和强大功能。