从模型返回结构化数据
让模型返回与特定架构匹配的输出通常很有用。一个常见的用例是从文本中提取数据以插入数据库或与其他下游系统集成。本指南涵盖了从模型获取结构化输出的一些策略。
.with_structured_output()
方法
这是获取结构化输出最简单和最可靠的方法。with_structured_output()
为提供原生 API 用于结构化输出的模型实现,比如工具/函数调用或 JSON 模式,并在背后利用这些能力。
此方法接受一个模式作为输入,指定所需输出属性的名称、类型和描述。该方法返回一个类似模型的可运行对象,但它不是输出字符串或消息,而是输出与给定模式对应的对象。模式可以指定为 JSON 模式或 Pydantic 类。如果使用 JSON 模式,则 Runnable 将返回一个字典;如果使用 Pydantic 类,则返回 Pydantic 对象。
举个例子,让我们得到一个模型来生成一个笑话,并将设置与妙语分开:
# 安装 langchain-openai 库
pip install -qU langchain-openai
import getpass
import os
# 设置 OPENAI_API_KEY 环境变量
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
# 创建 ChatOpenAI 实例
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
如果我们希望模型返回一个 Pydantic 对象,我们只需要传入所需的 Pydantic 类:
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field
# 定义 Joke 类
class Joke(BaseModel):
"""Joke to tell user."""
setup: str = Field(description="The setup of the joke")
punchline: str = Field(description="The punchline to the joke")
rating: Optional[int] = Field(description="How funny the joke is, from 1 to 10")
# 使用 with_structured_output() 方法
structured_llm = llm.with_structured_output(Joke)
# 调用模型生成笑话
structured_llm.invoke("Tell me a joke about cats")
# 输出示例
Joke(setup='Why was the cat sitting on the computer?', punchline='To keep an eye on the mouse!', rating=None)
提示
除了 Pydantic 类的结构外,Pydantic 类的名称、文档字符串以及参数的名称和提供的描述非常重要。大多数时候 with_structured_output
是在使用模型的函数/工具调用 API,你可以有效地将所有这些信息视为添加到模型提示中。
我们也可以传入一个 JSON 模式字典,如果你不想使用 Pydantic。在这种情况下,响应也是一个字典:
# 定义 JSON 模式
json_schema = {
"title": "joke",
"description": "Joke to tell user.",
"type": "object",
"properties": {
"setup": {
"type": "string",
"description": "The setup of the joke",
},
"punchline": {
"type": "string",
"description": "The punchline to the joke",
},
"rating": {
"type": "integer",
"description": "How funny the joke is, from 1 to 10",
},
},
"required": ["setup", "punchline"],
}
# 使用 JSON 模式
structured_llm = llm.with_structured_output(json_schema)
# 调用模型生成笑话
structured_llm.invoke("Tell me a joke about cats")
# 输出示例
{'setup': 'Why was the cat sitting on the computer?',
'punchline': 'Because it wanted to keep an eye on the mouse!',
'rating': 8}
选择多个模式
允许模型从多个模式中选择的最简单的方法是创建一个父 Pydantic 类,该类具有 Union 类型的属性:
from typing import Union
# 定义 ConversationalResponse 类
class ConversationalResponse(BaseModel):
"""Respond in a conversational manner. Be kind and helpful."""
response: str = Field(description="A conversational response to the user's query")
# 定义 Response 类
class Response(BaseModel):
output: Union[Joke, ConversationalResponse]
# 使用 Response 类
structured_llm = llm.with_structured_output(Response)
# 调用模型生成不同类型的响应
structured_llm.invoke("Tell me a joke about cats")
structured_llm.invoke("How are you today?")
Response(output=Joke(setup=‘Why was the cat sitting on the computer?’, punchline=‘To keep an eye on the mouse!’, rating=8))
Response(output=ConversationalResponse(response=“I’m just a digital assistant, so I don’t have feelings, but I’m here and ready to help you. How can I assist you today?”))
或者,如果所选模型支持,您可以直接使用工具调用来允许模型在选项之间进行选择。这涉及更多的解析和设置,但在某些情况下可以提高性能,因为您不必使用嵌套架构。
流式传输
当输出类型为字典时(即,当模式被指定为 JSON 模式字典时),我们可以从我们的结构化模型流式传输输出。
注意
所生成的是已经聚合的块,而不是增量。
# 使用 JSON 模式
structured_llm = llm.with_structured_output(json_schema)
# 流式传输输出
for chunk in structured_llm.stream("Tell me a joke about cats"):
print(chunk)
{}
{‘setup’: ‘’}
{‘setup’: ‘Why’}
{‘setup’: ‘Why was’}
{‘setup’: ‘Why was the’}
{‘setup’: ‘Why was the cat’}
{‘setup’: ‘Why was the cat sitting’}
{‘setup’: ‘Why was the cat sitting on’}
{‘setup’: ‘Why was the cat sitting on the’}
{‘setup’: ‘Why was the cat sitting on the computer’}
{‘setup’: ‘Why was the cat sitting on the computer?’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye on’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye on the’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye on the mouse’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye on the mouse!’}
{‘setup’: ‘Why was the cat sitting on the computer?’, ‘punchline’: ‘Because it wanted to keep an eye on the mouse!’, ‘rating’: 8}
少量示例提示
对于更复杂的模式,向提示中添加少量示例非常有用。这可以通过几种方式完成。
最简单和最通用的方式是在提示中的系统消息中添加示例:
from langchain_core.prompts import ChatPromptTemplate
# 定义系统消息和示例
system = """
You are a hilarious comedian. Your specialty is knock-knock jokes. \\
Return a joke which has the setup (the response to "Who's there?") and the final punchline (the response to "<setup> who?").
Here are some examples of jokes:
example_user: Tell me a joke about planes
example_assistant: {"setup": "Why don't planes ever get tired?", "punchline": "Because they have rest wings!", "rating": 2}
example_user: Tell me another joke about planes
example_assistant: {"setup": "Cargo", "punchline": "Cargo 'vroom vroom', but planes go 'zoom zoom'!", "rating": 10}
example_user: Now about caterpillars
example_assistant: {"setup": "Caterpillar", "punchline": "Caterpillar really slow, but watch me turn into a butterfly and steal the show!", "rating": 5}
"""
# 创建提示模板
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", "{input}")])
# 使用少量示例的结构化模型
few_shot_structured_llm = prompt | structured_llm
# 调用模型生成笑话
few_shot_structured_llm.invoke("what's something funny about woodpeckers")
{‘setup’: ‘Woodpecker’,
‘punchline’: “Woodpecker goes ‘knock knock’, but don’t worry, they never expect you to answer the door!”,
‘rating’: 8}
当构造输出的底层方法是工具调用时,我们可以将示例作为显式工具调用传递。您可以在 API 参考中检查所使用的模型是否使用工具调用。
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
examples = [
HumanMessage("Tell me a joke about planes", name="example_user"),
AIMessage(
"",
name="example_assistant",
tool_calls=[
{
"name": "joke",
"args": {
"setup": "Why don't planes ever get tired?",
"punchline": "Because they have rest wings!",
"rating": 2,
},
"id": "1",
}
],
),
# Most tool-calling models expect a ToolMessage(s) to follow an AIMessage with tool calls.
ToolMessage("", tool_call_id="1"),
# Some models also expect an AIMessage to follow any ToolMessages,
# so you may need to add an AIMessage here.
HumanMessage("Tell me another joke about planes", name="example_user"),
AIMessage(
"",
name="example_assistant",
tool_calls=[
{
"name": "joke",
"args": {
"setup": "Cargo",
"punchline": "Cargo 'vroom vroom', but planes go 'zoom zoom'!",
"rating": 10,
},
"id": "2",
}
],
),
ToolMessage("", tool_call_id="2"),
HumanMessage("Now about caterpillars", name="example_user"),
AIMessage(
"",
tool_calls=[
{
"name": "joke",
"args": {
"setup": "Caterpillar",
"punchline": "Caterpillar really slow, but watch me turn into a butterfly and steal the show!",
"rating": 5,
},
"id": "3",
}
],
),
ToolMessage("", tool_call_id="3"),
]
system = """You are a hilarious comedian. Your specialty is knock-knock jokes. \
Return a joke which has the setup (the response to "Who's there?") \
and the final punchline (the response to "<setup> who?")."""
prompt = ChatPromptTemplate.from_messages(
[("system", system), ("placeholder", "{examples}"), ("human", "{input}")]
)
few_shot_structured_llm = prompt | structured_llm
few_shot_structured_llm.invoke({"input": "crocodiles", "examples": examples})
{‘setup’: ‘Crocodile’,
‘punchline’: “Crocodile ‘see you later’, but in a while, it becomes an alligator!”,
‘rating’: 7}
(高级) 指定结构化输出的方法
对于支持多种结构化输出方式的模型(即,它们同时支持工具调用和 JSON 模式),您可以使用 method=
参数指定使用哪种方法。
JSON 模式
如果使用 JSON 模式,您仍然需要在模型提示中指定所需的模式。您传递给 with_structured_output
的模式仅用于解析模型输出,它不会像工具调用那样传递给模型。
要查看您使用的模型是否支持 JSON 模式,请查看其 API 参考中的条目。
# 使用 JSON 模式
structured_llm = llm.with_structured_output(Joke, method="json_mode")
# 调用模型生成笑话
structured_llm.invoke(
"Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys"
)
Joke(setup=‘Why was the cat sitting on the computer?’, punchline=‘Because it wanted to keep an eye on the mouse!’, rating=None)
直接提示和解析模型
并非所有模型都支持 .with_structured_output()
,因为并非所有模型都支持工具调用或 JSON 模式。对于这样的模型,您需要直接提示模型使用特定格式,并使用输出解析器从原始模型输出中提取结构化响应。
使用 PydanticOutputParser
以下示例使用内置的 PydanticOutputParser
来解析提示聊天模型以匹配给定 Pydantic 模式的输出。注意,我们直接从解析器的方法中将 format_instructions
添加到提示中:
from typing import List
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
class Person(BaseModel):
"""Information about a person."""
name: str = Field(..., description="The name of the person")
height_in_meters: float = Field(
..., description="The height of the person expressed in meters."
)
class People(BaseModel):
"""Identifying information about all people in a text."""
people: List[Person]
# Set up a parser
parser = PydanticOutputParser(pydantic_object=People)
# Prompt
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"Answer the user query. Wrap the output in `json` tags\n{format_instructions}",
),
("human", "{query}"),
]
).partial(format_instructions=parser.get_format_instructions())
让我们看一下向模型发送了哪些信息:
query = "Anna is 23 years old and she is 6 feet tall"
print(prompt.invoke(query).to_string())
System: Answer the user query. Wrap the output in
json
tags
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {“properties”: {“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“foo”]}
the object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not well-formatted.
Here is the output schema:
Human: Anna is 23 years old and she is 6 feet tall
现在让我们调用它:
chain = prompt | llm | parser
chain.invoke({"query": query})
People(people=[Person(name=‘Anna’, height_in_meters=1.8288)])
自定义解析
您也可以使用 LangChain Expression Language (LCEL) 创建自定义提示和解析器,使用普通函数从模型的输出中解析输出:
import json
import re
from typing import List
from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
class Person(BaseModel):
"""Information about a person."""
name: str = Field(..., description="The name of the person")
height_in_meters: float = Field(
..., description="The height of the person expressed in meters."
)
class People(BaseModel):
"""Identifying information about all people in a text."""
people: List[Person]
# Prompt
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"Answer the user query. Output your answer as JSON that "
"matches the given schema: ```json\n{schema}\n```. "
"Make sure to wrap the answer in ```json and ```tags",
),
("human", "{query}"),
]
).partial(schema=People.schema())
# Custom parser
def extract_json(message: AIMessage) -> List[dict]:
"""Extracts JSON content from a string where JSON is embedded between ```json and ```tags.
Parameters:
text (str): The text containing the JSON content.
Returns:
list: A list of extracted JSON strings.
"""
text = message.content
# Define the regular expression pattern to match JSON blocks
pattern = r"```json(.*?)```"
# Find all non-overlapping matches of the pattern in the string
matches = re.findall(pattern, text, re.DOTALL)
# Return the list of matched JSON strings, stripping any leading or trailing whitespace
try:
return [json.loads(match.strip()) for match in matches]
except Exception:
raise ValueError(f"Failed to parse: {message}")
这是发送给模型的提示:
query = "Anna is 23 years old and she is 6 feet tall"
print(prompt.format_prompt(query=query).to_string())
System: Answer the user query. Output your answer as JSON that matches the given schema:
json {'title': 'People', 'description': 'Identifying information about all people in a text.', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': 'Information about a person.', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': 'The name of the person', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'The height of the person expressed in meters.', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}
. Make sure to wrap the answer injson and
tags
Human: Anna is 23 years old and she is 6 feet tall
当我们调用它时它看起来是这样的:
chain = prompt | llm | extract_json
chain.invoke({"query": query})
[{‘people’: [{‘name’: ‘Anna’, ‘height_in_meters’: 1.8288}]}]
总结与扩展知识
本文介绍了如何从语言模型中获取结构化的输出数据,这在将数据插入数据库或与其他系统集成时非常有用。我们探讨了几种方法,包括使用 .with_structured_output()
方法、处理多个模式、流式传输输出、少量示例提示、直接提示和解析模型等。
通过这些方法,可以确保模型的输出不仅仅限于文本,而是可以转换成具体的数据结构,如 Pydantic 对象或 JSON 对象,从而更容易地被程序处理和分析。
扩展知识:
- Pydantic:一个用于数据验证和设置管理的 Python 库,它使用 Python 类型注解来验证输入数据的类型和结构。
- JSON Schema:一种基于 JSON 的格式,用于描述 JSON 数据结构的类型、属性和验证规则。
- LangChain:一个用于构建和运行复杂语言模型链的框架,提供了一系列的工具和接口。
- 环境变量:用于存储配置信息,如 API 密钥,使代码更加安全和灵活。
- 工具调用(Tool Calling):一种模型能力,允许模型在生成文本时调用外部工具或函数。
- LangChain Expression Language (LCEL):LangChain 的表达式语言,用于创建自定义提示和解析器。
- 流式传输(Streaming):一种技术,允许逐步处理数据,而不是一次性处理整个数据集。
- 少量示例提示(Few-shot prompting):一种技术,通过向模型展示少量示例来引导模型生成特定格式的输出。