LLM之LangChain（五）| 使用LangChain Agent分析非结构化数据

最新推荐文章于 2024-09-17 23:02:46 发布

wshzd

最新推荐文章于 2024-09-17 23:02:46 发布

阅读量1.7k

点赞数 26

分类专栏： Agent 笔记 LangChain 文章标签： langchain java 人工智能

本文链接：https://blog.csdn.net/wshzd/article/details/135834436

版权

笔记同时被 3 个专栏收录

181 篇文章 122 订阅

订阅专栏

Agent

9 篇文章 9 订阅

订阅专栏

LangChain

3 篇文章 1 订阅

订阅专栏

想象一下，你有一家面包店，你派出了甜食商情报团队来收集竞争对手的数据。他们会汇报竞争情况，他们有很多很棒的想法，你想把它们应用到你的业务中。然而，数据是非结构化的！您如何分析这些数据，以了解最需要什么，并为您业务的下一步计划做出最佳的策略？在第1部分中，我们使用“PydanticOutputParser”来分析我们的数据并添加所需的结构。在第2部分中，我们将创建一个LangChain Agent来进行数据分析。

为了探索这个用例，创建了一个玩具数据集[1]。以下是数据集中的一个示例样本：

At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the “SeasonalJoy” subscription platform and adding a special touch to our cookies with the “FloralStamp” cookie stamper could keep our offerings fresh and exciting for customers.

第一部分：从非结构化数据抽取结构化信息

方法一：create_extract_chain

定义数据抽取的结构，并且使用LangChain创建一个提取链。

from langchain.chains import create_extraction_chainfrom langchain.chat_models import ChatOpenAI# Schemaschema = {    "properties": {        "company": {"type": "string"},        "offering": {"type": "string"},        "advantage": {"type": "string"},        "products_and_services": {"type": "string"},        "additional_details": {"type": "string"},    }}

定义测试样本

# Inputsin1 = """Sweet Delights Bakery introduced lavender-infused vanilla cupcakes with a honey buttercream frosting, using the "Frosting-Spreader-3000". This innovation could inspire our next cupcake creation"""in2 = """Whisked Away Cupcakes introduced a dessert subscription service, ensuring regular customers receive fresh batches of various sweets. Exploring a similar subscription model using the "SweetSubs" program could boost customer loyalty."""in3 = """At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the "SeasonalJoy" subscription platform and adding a special touch to our cookies with the "FloralStamp" cookie stamper could keep our offerings fresh and exciting for customers."""inputs = [in1, in2, in3]

创建Chain

# Run chainllm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")chain = create_extraction_chain(schema, llm)

运行Chain

for input in inputs:    print(chain.run(input))

现在，我们将输出结构化为Python列表：

[{'company': 'Sweet Delights Bakery', 'offering': 'lavender-infused vanilla cupcakes', 'advantage': 'inspiring next cupcake creation', 'products_and_services': 'Frosting-Spreader-3000'}][{'company': 'Whisked Away Cupcakes', 'offering': 'dessert subscription service', 'advantage': 'ensuring regular customers receive fresh batches of various sweets', 'products_and_services': '', 'additional_details': ''}, {'company': '', 'offering': 'subscription model using the "SweetSubs" program', 'advantage': 'boost customer loyalty', 'products_and_services': '', 'additional_details': ''}][{'company': 'Velvet Frosting Cupcakes', 'offering': 'rotating seasonal menu', 'advantage': 'fresh and exciting offerings', 'products_and_services': 'SeasonalJoy subscription platform, FloralStamp cookie stamper'}]

导入包含竞争情报的CSV，将其应用于提取链进行解析和结构化，并将解析后的信息无缝集成回原始数据集。下面的Python代码正是这样做的：

import pandas as pdfrom langchain.chains import create_extraction_chainfrom langchain.chat_models import ChatOpenAI# Load in the data.csv (semicolon separated) filedf = pd.read_csv("data.csv", sep=';')# Define Schema based on your dataschema = {    "properties": {        "company": {"type": "string"},        "offering": {"type": "string"},        "advantage": {"type": "string"},        "products_and_services": {"type": "string"},        "additional_details": {"type": "string"},    }}# Create extraction chainllm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")chain = create_extraction_chain(schema, llm)# ----------# Add the data to a data frame# ----------# Extract information and create a DataFrame from the list of dictionariesextracted_data = df['INTEL'].apply(lambda x: chain.run(x)[0]).apply(pd.Series)# Replace missing values with NaNextracted_data.replace('', np.nan, inplace=True)# Concatenate the extracted_data DataFrame with the original dfdf = pd.concat([df, extracted_data], axis=1)# display the data framedf.head()

这次运行花了大约15秒，但它还没有找到我们要求的所有信息。接下来，让我们尝试一种不同的方法。

方法二：Pydantic

在下面的代码中，Pydantic用于定义表示竞争情报信息结构的数据模型。Pydantic是Python的数据验证和解析库，允许您使用Python数据类型定义简单或复杂的数据结构。在这种情况下，我们使用Pydantic模型（竞争对手和公司）来定义竞争情报数据的结构。

import pandas as pdfrom typing import Optional, Sequencefrom langchain.llms import OpenAIfrom langchain.output_parsers import PydanticOutputParserfrom langchain.prompts import PromptTemplatefrom pydantic import BaseModel# Load data from CSVdf = pd.read_csv("data.csv", sep=';')# Pydantic models for competitive intelligenceclass Competitor(BaseModel):    company: str    offering: str    advantage: str    products_and_services: str    additional_details: strclass Company(BaseModel):    """Identifying information about all competitive intelligence in a text."""    company: Sequence[Competitor]# Set up a Pydantic parser and prompt templateparser = PydanticOutputParser(pydantic_object=Company)prompt = PromptTemplate(    template="Answer the user query.\n{format_instructions}\n{query}\n",    input_variables=["query"],    partial_variables={"format_instructions": parser.get_format_instructions()},)# Function to process each row and extract informationdef process_row(row):    _input = prompt.format_prompt(query=row['INTEL'])    model = OpenAI(temperature=0)    output = model(_input.to_string())    result = parser.parse(output)        # Convert Pydantic result to a dictionary    competitor_data = result.model_dump()    # Flatten the nested structure for DataFrame creation    flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}    for entry in competitor_data['company']:        flat_data['INTEL'].append(row['INTEL'])        flat_data['company'].append(entry['company'])        flat_data['offering'].append(entry['offering'])        flat_data['advantage'].append(entry['advantage'])        flat_data['products_and_services'].append(entry['products_and_services'])        flat_data['additional_details'].append(entry['additional_details'])    # Create a DataFrame from the flattened data    df_cake = pd.DataFrame(flat_data)    return df_cake# Apply the function to each row and concatenate the resultsintel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)# Display the resulting DataFrameintel_df.head()

速度很快！与create_extract_chain不同，这次找到了所有条目的详细信息。

第一部分总结：

发现PydanticOutputParser更快、更可靠。每次运行大约需要1秒和400个tokens。而create_extract_chain运行大约需要2.5秒和250个tokens。

我们已经设法从非结构化文本中提取了一些结构化数据！第2部分重点是使用LangChain Agent分析这些结构化数据。

第二部分：使用LangChain Agent分析这些结构化数据

什么是LangChain Agent？

在LangChain中，Agent是利用语言模型来选择要执行的操作序列的系统。与Chain不同的是，在Chain中，动作被硬编码在代码中，而Agent利用语言模型作为“推理引擎”，决定采取哪些动作以及以何种顺序采取这些动作。

现在，使用LangChain中的CSV Agent来分析我们的结构化数据了：

步骤1：创建Agent

首先加载必要的库：

from langchain.agents.agent_types import AgentTypefrom langchain_community.llms import OpenAIfrom langchain_experimental.agents.agent_toolkits import create_csv_agent

创建Agent

agent = create_csv_agent(    OpenAI(temperature=0),    "data/intel.csv",    verbose=True,    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,)

现在我们可以用一些问题来测试我们的Agent：

步骤2：向Agent提出问题

当你问LangChain Agent问题时，你会看到它思考自己的行为。

询问通用问题

agent.run("What insights can I get from this data?")

‘This dataframe contains information about different companies and their products/services, as well as additional details and potential opportunities for improvement.’

询问竞争对手优势

agent.run("What are 3 specific areas of focus that you can obtain through analyzing the advantages offered by the competition?")

‘Three specific areas of focus that can be obtained through analyzing the advantages offered by the competition are: streamlining production processes, incorporating unique and distinctive flavors, and using sustainable and high-quality ingredients.’

询问主要竞争对手主题

agent.run("What are some key themes that the competitors represented in the data are focusing on providing? Be specific with examples, and talk about the advantages of these")

‘The key themes that the competitors are focusing on providing are efficiency, unique flavors, and high-quality ingredients. For example, Coco candy co is using the 77Tyrbo Choco machine to coat their candy

gummies, which streamlines the process and saves time. Cinnamon Bliss Bakery adds a secret touch of cinnamon in their chocolate brownies with the CinnaMagic ingredient, which adds a distinctive flavor. Choco Haven factory uses organic and locally sourced ingredients, including the EcoCocoa brand, to elevate the quality of their chocolates.’