LangChain-v0.2文档翻译：3.3、如何从模型返回结构化数据

Hugo_Hoo

已于 2024-06-18 17:49:23 修改

阅读量379

点赞数 1

分类专栏： LangChain-v0.2文档翻译文章标签： langchain python 人工智能

于 2024-06-13 16:41:56 首次发布

原文链接：https://python.langchain.com/v0.2/docs/how_to/structured_output/#streaming

版权

LangChain-v0.2文档翻译专栏收录该内容

31 篇文章 12 订阅

订阅专栏

在很多情况下，我们希望模型能够返回符合特定模式的输出。例如，从文本中提取数据以插入到数据库或用于其他下游系统。本指南涵盖了几种从模型获取结构化输出的策略。

`.with_structured_output()` 方法

这是获取结构化输出最简单也是最可靠的方法。with_structured_output() 方法适用于那些提供原生API以结构化输出的模型，例如工具/函数调用或JSON模式，并在背后利用这些能力。

这个方法接受一个模式作为输入，该模式定义了所需输出属性的名称、类型和描述。返回的是一个模型类的实例，但它不输出字符串或消息，而是输出与给定模式相对应的对象。模式可以指定为JSON Schema或Pydantic类。如果使用JSON Schema，则返回的是一个字典；如果使用Pydantic类，则返回的是Pydantic对象。

以下是使用该方法的一个示例，让模型生成一个关于猫的笑话，并将铺垫（setup）与结尾（punchline）分开：

# 安装LangChain OpenAI库
# pip install -qU langchain-openai

# 导入必要的库
import getpass
import os
from langchain_openai import ChatOpenAI
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

# 设置环境变量以存储OpenAI的API密钥
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# 创建ChatOpenAI的实例，指定使用的模型版本
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

# 定义Joke类，用于指定输出的结构
class Joke(BaseModel):
    """要告诉用户的笑话。"""
    setup: str = Field(description="笑话的铺垫")
    punchline: str = Field(description="笑话的结尾")
    rating: Optional[int] = Field(description="笑话的有趣程度，从1到10")

# 使用with_structured_output方法来获取结构化的输出
structured_llm = llm.with_structured_output(Joke)

# 调用模型并传入提示，获取结构化的笑话输出
joke_response = structured_llm.invoke("Tell me a joke about cats")
print(joke_response)

输出示例：

Joke(setup='Why was the cat sitting on the computer?', punchline='To keep an eye on the mouse!', rating=None)

除了 Pydantic 类的结构之外，Pydantic 类的名称、文档字符串以及参数的名称和提供的描述也非常重要。大多数时候都with_structured_output在使用模型的函数/工具调用 API，您可以有效地将所有这些信息视为添加到模型提示中。

JSON Schema 字典的使用

除了使用 Pydantic 类之外，我们也可以选择使用 JSON Schema 字典来定义输出的结构。在这种情况下，响应也将以字典的形式返回：

# 定义 JSON Schema 字典来描述期望的输出结构
json_schema = {
    "title": "joke",
    "description": "Joke to tell user.",
    "type": "object",
    "properties": {
        "setup": {
            "type": "string",
            "description": "The setup of the joke"
        },
        "punchline": {
            "type": "string",
            "description": "The punchline to the joke"
        },
        "rating": {
            "type": "integer",
            "description": "How funny the joke is, from 1 to 10"
        }
    },
    "required": ["setup", "punchline"]
}

# 使用 JSON Schema 字典获取结构化输出
structured_llm = llm.with_structured_output(json_schema)

# 调用模型并获取结构化的笑话输出
joke_response = structured_llm.invoke("Tell me a joke about cats")
print(joke_response)

输出示例：

{'setup': 'Why was the cat sitting on the computer?',
 'punchline': 'Because it wanted to keep an eye on the mouse!',
 'rating': 8}

选择多个模式

要让模型从多个模式中选择，可以创建一个父 Pydantic 类，其中包含一个 Union 类型的属性：

from typing import Union

# 定义 ConversationalResponse 类
class ConversationalResponse(BaseModel):
    """以对话方式回应用户查询，友好且有帮助。"""
    response: str = Field(description="对用户查询的对话式回应")

# 定义 Response 类，其中 output 属性是 Joke 或 ConversationalResponse 的联合类型
class Response(BaseModel):
    output: Union[Joke, ConversationalResponse]

# 使用 Response 类获取结构化输出
structured_llm = llm.with_structured_output(Response)

# 调用模型并获取不同类型的结构化输出
joke_response = structured_llm.invoke("Tell me a joke about cats")
print(joke_response)

conversation_response = structured_llm.invoke("How are you today?")
print(conversation_response)

输出示例：

Response(output=Joke(setup='Why was the cat sitting on the computer?', punchline='To keep an eye on the mouse!', rating=8))

Response(output=ConversationalResponse(response="I'm just a digital assistant, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?"))

流式传输（Streaming）

我们可以从我们的结构化模型流式传输输出，当输出类型是字典时（即，当模式被指定为JSON Schema字典时）。

# 使用 JSON Schema 字典获取结构化输出
structured_llm = llm.with_structured_output(json_schema)

# 流式传输输出
for chunk in structured_llm.stream("Tell me a joke about cats"):
    print(chunk)

输出示例（逐步打印输出的每个部分）：

{}  # 初始为空，表示开始接收数据
{}
{'setup': ''}
{'setup': 'Why'}
{'setup': 'Why was'}
{'setup': 'Why was the'}
{'setup': 'Why was the cat'}
{'setup': 'Why was the cat sitting'}
{'setup': 'Why was the cat sitting on'}
{'setup': 'Why was the cat sitting on the'}
{'setup': 'Why was the cat sitting on the computer'}
{'setup': 'Why was the cat sitting on the computer?'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': ''}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye on'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye on the'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye on the mouse'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye on the mouse!'}
{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'Because it wanted to keep an eye on the mouse!', 'rating': 8}

少样本提示（Few-shot prompting）

对于更复杂的模式，添加少样本示例到提示中非常有用。最简单的方法是在提示中添加系统消息的例子：

from langchain_core.prompts import ChatPromptTemplate

# 定义系统消息，包含少样本示例
system = """
You are a hilarious comedian. Your specialty is knock-knock jokes.
Return a joke which has the setup (the response to "Who's there?") and the final punchline (the response to "<setup> who?").

Here are some examples of jokes:
example_user: Tell me a joke about planes
example_assistant: {"setup": "Why don't planes ever get tired?", "punchline": "Because they have rest wings!", "rating": 2}

example_user: Tell me another joke about planes
example_assistant: {"setup": "Cargo", "punchline": "Cargo 'vroom vroom', but planes go 'zoom zoom'!", "rating": 10}

example_user: Now about caterpillars
example_assistant: {"setup": "Caterpillar", "punchline": "Caterpillar really slow, but watch me turn into a butterfly and steal the show!", "rating": 5}
"""

# 创建提示模板
prompt = ChatPromptTemplate.from_messages([
    ("system", system),
    ("human", "{input}")])
# 结合少样本提示和结构化输出
few_shot_structured_llm = prompt | structured_llm

# 调用模型并获取结构化的笑话输出
response = few_shot_structured_llm.invoke("what's something funny about woodpeckers")
print(response)

输出示例：

{'setup': 'Woodpecker', 'punchline': "Woodpecker goes 'knock knock', but don't worry, they never expect you to answer the door!", 'rating': 8}

当构造输出的底层方法是工具调用时，我们可以将示例作为显式工具调用传入。您可以在 API 参考中检查所使用的模型是否使用工具调用。

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage

examples = [
    HumanMessage("Tell me a joke about planes", name="example_user"),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[
            {
                "name": "joke",
                "args": {
                    "setup": "Why don't planes ever get tired?",
                    "punchline": "Because they have rest wings!",
                    "rating": 2,
                },
                "id": "1",
            }
        ],
    ),
    # Most tool-calling models expect a ToolMessage(s) to follow an AIMessage with tool calls.
    ToolMessage("", tool_call_id="1"),
    # Some models also expect an AIMessage to follow any ToolMessages,
    # so you may need to add an AIMessage here.
    HumanMessage("Tell me another joke about planes", name="example_user"),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[
            {
                "name": "joke",
                "args": {
                    "setup": "Cargo",
                    "punchline": "Cargo 'vroom vroom', but planes go 'zoom zoom'!",
                    "rating": 10,
                },
                "id": "2",
            }
        ],
    ),
    ToolMessage("", tool_call_id="2"),
    HumanMessage("Now about caterpillars", name="example_user"),
    AIMessage(
        "",
        tool_calls=[
            {
                "name": "joke",
                "args": {
                    "setup": "Caterpillar",
                    "punchline": "Caterpillar really slow, but watch me turn into a butterfly and steal the show!",
                    "rating": 5,
                },
                "id": "3",
            }
        ],
    ),
    ToolMessage("", tool_call_id="3"),
]
system = """You are a hilarious comedian. Your specialty is knock-knock jokes. \
Return a joke which has the setup (the response to "Who's there?") \
and the final punchline (the response to "<setup> who?")."""

prompt = ChatPromptTemplate.from_messages(
    [("system", system), ("placeholder", "{examples}"), ("human", "{input}")]
)
few_shot_structured_llm = prompt | structured_llm
few_shot_structured_llm.invoke({"input": "crocodiles", "examples": examples})

{'setup': 'Crocodile',
 'punchline': "Crocodile 'see you later', but in a while, it becomes an alligator!",
 'rating': 7}

(高级) 指定结构化输出的方法

对于支持多种结构化输出方式的模型（即，它们同时支持工具调用和JSON模式），您可以使用 method= 参数指定使用哪种方法。

JSON模式

如果使用JSON模式，您需要在模型提示中指定所需的模式。传递给 with_structured_output 的模式仅用于解析模型输出，而不会像工具调用那样传递给模型。

# 指定使用JSON模式
structured_llm = llm.with_structured_output(Joke, method="json_mode")

# 调用模型并指示以JSON格式返回笑话
response = structured_llm.invoke(
    "Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys"
)
print(response)

输出示例：

Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!', rating=None)

直接提示和解析模型

不是所有模型都支持 .with_structured_output()，因为不是所有模型都有工具调用或JSON模式支持。对于这些模型，您需要直接提示模型使用特定格式，并使用输出解析器从原始模型输出中提取结构化响应。

使用 `PydanticOutputParser`

以下示例使用内置的 PydanticOutputParser 来解析被提示匹配给定Pydantic模式的聊天模型的输出。注意，我们直接在提示中添加了 format_instructions。

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

# 定义Person和People类，用于输出解析
class Person(BaseModel):
    """关于一个人的信息。"""
    name: str = Field(..., description="人的名字")
    height_in_meters: float = Field(..., description="人的高度，以米为单位")

class People(BaseModel):
    """文本中所有人的识别信息。"""
    people: List[Person]

# 设置解析器
parser = PydanticOutputParser(pydantic_object=People)

# 创建提示
prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "Answer the user query. Wrap the output in `json` tags {format_instructions}",
    ),
    ("human", "{query}"),
]).partial(format_instructions=parser.get_format_instructions())

# 调用模型并解析输出
query = "Anna is 23 years old and she is 6 feet tall"

print(prompt.invoke(query).to_string())

输出示例：

System: Answer the user query. Wrap the output in `json` tags
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{“description”: “识别文本中所有人物的信息。”, “properties”: {“people”: {“title”: “人物”, “type”: “array”, “items”: {“$ref”: “#/definitions/Person”}}}, “required”: [“people”] , “definitions”: {“Person”: {“title”: “人物”, “description”: “关于人物的信息。”, “type”: “object”, “properties”: {“name”: {“title”: “姓名”, “description”: “人物的姓名”, “type”: “string”}, “height_in_meters”: {“title”: “身高（米）”, “description”: “以米为单位的人物身高。”, “type”: “number”}}, “required”: [“name”, “height_in_meters”] }}}

Human: Anna is 23 years old and she is 6 feet tall

现在让我们调用它：

chain = prompt | llm | parser

chain.invoke({"query": query})

People(people=[Person(name='Anna', height_in_meters=1.8288)])

创建自定义提示和解析器

您还可以使用LangChain 表达语言 (LCEL)创建自定义提示和解析器，使用普通函数来解析模型的输出：

import json
import re
from typing import List

from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


class Person(BaseModel):
    """Information about a person."""

    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(
        ..., description="The height of the person expressed in meters."
    )


class People(BaseModel):
    """Identifying information about all people in a text."""

    people: List[Person]


# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query. Output your answer as JSON that  "
            "matches the given schema: ```json\n{schema}\n```. "
            "Make sure to wrap the answer in ```json and ```tags",
        ),
        ("human", "{query}"),
    ]
).partial(schema=People.schema())


# Custom parser
def extract_json(message: AIMessage) -> List[dict]:
    """Extracts JSON content from a string where JSON is embedded between ```json and ```tags.

    Parameters:
        text (str): The text containing the JSON content.

    Returns:
        list: A list of extracted JSON strings.
    """
    text = message.content
    # Define the regular expression pattern to match JSON blocks
    pattern = r"```json(.*?)```"

    # Find all non-overlapping matches of the pattern in the string
    matches = re.findall(pattern, text, re.DOTALL)

    # Return the list of matched JSON strings, stripping any leading or trailing whitespace
    try:
        return [json.loads(match.strip()) for match in matches]
    except Exception:
        raise ValueError(f"Failed to parse: {message}")

这是发送给模型的提示：

query = "Anna is 23 years old and she is 6 feet tall"

print(prompt.format_prompt(query=query).to_string())

System: Answer the user query. Output your answer as JSON that  matches the given schema: ```json
{'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}
```. Make sure to wrap the answer in ```json and ```tags
Human: Anna is 23 years old and she is 6 feet tall

当我们调用它时它看起来是这样的：

chain = prompt | llm | extract_json

chain.invoke({"query": query})

[{'people': [{'name': 'Anna', 'height_in_meters': 1.8288}]}]

扩展知识：

LangChain：是一个用于构建AI助手和应用程序的Python库，提供了与不同AI模型交互的接口。
OpenAI API：是由OpenAI公司提供的API服务，允许开发者在自己的应用程序中使用预训练的AI模型。
Pydantic：是一个用于数据验证和配置管理的Python库，它使用Python类型注解来验证输入数据，并提供了丰富的数据验证功能。
Union 类型：在 Python 类型注解中，Union 用于指示一个位置可以是多种类型之一。
继承与多态：在面向对象编程中，继承允许新创建的类（子类）继承现有类（父类）的属性和方法。多态性是指对象可以有多种形式，允许不同类的对象对同一消息做出响应，但具体形式取决于对象的实际类型。
流式处理：是一种数据处理方式，允许逐步处理数据流，而不是一次性处理整个数据集。
少样本学习（Few-shot learning）：是一种机器学习范式，其中模型试图从少量的样本中学习并泛化到新的情境。
输出解析器（Output Parser）：是一个工具，用于从模型的原始输出中提取结构化数据。
JSON模式：用于定义和描述JSON数据结构的模式，有助于验证和解析JSON数据。

Hugo_Hoo

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
LangChain-v0.2文档翻译：3.3、如何从模型返回结构化数据

import re# Prompt"system",),Returns:""")```"try:Returns:Returns:扩展知识LangChain：是一个用于构建AI助手和应用程序的Python库，提供了与不同AI模型交互的接口。OpenAI API：是由OpenAI公司提供的API服务，允许开发者在自己的应用程序中使用预训练的AI模型。Pydantic：是一个用于数据验证和配置管理的Python库，它使用Python类型注解来验证输入数据，并提供了丰富的数据验证功能。
复制链接

扫一扫