构建高质量Agent【Building effective agents】
本文仅是对Anthropic公司的文章“Building effective agents”的中文翻译,版权归原公司所有,原文见文末链接。
2024年12月20日
在过去的一年里,我们与数十个团队合作,在各个行业构建了大型语言模型(LLM)代理。一直以来,最成功的实现并没有使用复杂的框架或专门的库。相反,他们用简单、可组合的模式进行构建。
Over the past year, we’ve worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
在这篇文章中,我们分享了我们从与客户和构建Agent合作中学到的东西,并为开发人员提供了构建高质量Agent的实用建议。
In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.
什么是Agent?【What are agents?】
“Agent”可以通过多种方式定义。一些客户将代理定义为在长时间内独立运行的完全自主的系统,使用各种工具来完成复杂的任务。其他人则使用该术语来描述遵循预定义工作流的更规范的实现。在Anthropic,我们将所有这些变体归类为Agent系统,但在工作流和Agent之间进行了重要的架构区分:
“Agent” can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
-
工作流是通过预定义的代码路径编排LLM和工具的系统。
-
Agent是一种可以由LLM动态决定自己的流程和工具使用,以完成其任务的系统。
-
Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
-
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
下面,我们将详细探讨这两种类型的Agent系统。在附录1(“实践中的Agent”)中,我们展示了两个由客户发现的较有价值的Agent应用领域。
Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.
何时用/何时不用Agent【When (and when not) to use agents】
当使用LLM构建应用程序时,我们建议找到最简单的解决方案,并且只在需要时增加复杂性。这可能意味着根本不构建代理系统。代理系统经常以延迟和成本换取更好的任务性能,您应该考虑这种权衡何时有意义。
When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.
当面对更高的复杂性的场景时,工作流为定义明确的任务提供了可预测性和一致性,而当需要大规模的灵活性和模型驱动决策时,Agent是更好的选择。然而,对于许多应用程序来说,使用检索和上下文示例来优化单个LLM的调用通常就足够了。
When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.
何时用框架/如何用框架【When and how to use frameworks】
有许多框架能够使Agent系统更容易实现,包括:
There are many frameworks that make agentic systems easier to implement, including:
-
LangChain的LangGraph;
-
Amazon Bedrock的 AI Agent framework;
-
Rivet, 一个拖拽式GUI LLM工作流构建工具;
-
Vellum, 另一个用于构建和测试复杂工作流的GUI工具.
-
LangGraph from LangChain;
-
Amazon Bedrock’s AI Agent framework;
-
Rivet, a drag and drop GUI LLM workflow builder; and
-
Vellum, another GUI tool for building and testing complex workflows.
这些框架通过简化标准的低级任务,如调用LLM、定义和解析工具以及将调用链接在一起,使入门变得容易。然而,它们通常会创建额外的抽象层,从而掩盖底层的Prompt及其返回内容,使其更难以调试。此外,在一些很简单的场景下,尽管简单配置一下就够了,框架也会帮你增加大量复杂实现来完成同样的任务。
These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
我们建议开发人员从直接使用LLM API开始:许多模式可以在几行代码中实现。如果你确实使用框架,请确保你理解底层代码。对幕后工作的错误假设是客户错误的常见来源。
We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.
一些实现示例可以参考我们的cookbook。
See our cookbook for some sample implementations.
模块组成、工作流与Agent【Building blocks, workflows, and agents】
在本节中,我们将探讨我们在生产中看到的Agent系统的常见模式。我们将从我们的基础组成模块——能力增强LLM——开始,并逐步增加复杂性,从简单的组合工作流程到自主Agent。
In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We’ll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.
组成模块:能力增强LLM【Building block: The augmented LLM】
Agent系统的基本组成模块是通过检索、工具和内存等增强功能后的LLM。我们目前的模型可以使用这些功能——生成自己的搜索查询,选择合适的工具,并确定要保留哪些信息。
The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.
图:LLM的能力增强【The augmented LLM】
我们建议在实现的时候关注以下两点:一是要对这些能力进行定制开发以适应实际中的场景用例,二是确保它们为您的LLM提供一个使用简单、文档丰富的接口。有很多方法可以实现这些增强,一种方法是通过我们最近发布的模型上下文协议,它允许开发人员通过简单的客户端实现与不断增长的第三方工具生态系统集成。
We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.
在本文的其余部分,我们将假设每个LLM调用都可以访问这些增强功能。
For the remainder of this post, we’ll assume each LLM call has access to these augmented capabilities.
工作流:Prompt链【Workflow: Prompt chaining】
Prompt链将任务分解为一系列步骤,其中每个LLM调用都处理前一个任务的输出。您可以在任何中间步骤上添加程序检查(见下图中的“门”),以确保流程仍在正轨上。
Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
图:Prompt链的工作流【The prompt chaining workflow】
**何时使用此工作流:**此工作流非常适合任务可以轻易、干净地分解为固定子任务的情况。该工作流的主要目标是通过牺牲响应时间来获得更高的结果准确性,方法是将总体任务分解为多个LLM可以轻松完成的子任务。
When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.
适合Prompt链的场景示例:
Examples where prompt chaining is useful:
-
生成营销文案,然后将其翻译成另一种语言。
-
编写文档大纲,检查大纲是否符合某些标准,然后根据大纲编写文档。
-
Generating Marketing copy, then translating it into a different language.
-
Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
工作流:路由【Workflow: Routing】
路由对输入进行分类,并将其引导到专门的后续任务。此工作流允许根据不同的情形产生不同的LLM调用,并构建更具有针对性的Prompt。如果不采用这个工作流,对一种输入的优化可能会损害其他输入的性能。
Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
图:路由的工作流【The routing workflow】
**何时使用此工作流:**路由工作流适用于那些包含多种类别,其中每一类需要单独处理的复杂任务,在这些任务中,有不同的类别可以更好地单独处理,同时,要求这些任务能够通过LLM或更传统的分类模型/算法准确地被分为各个类别。
When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.
适合路由工作流的场景示例:
Examples where routing is useful:
-
将不同类型的客户服务查询(一般问题、退款请求、技术支持)引导到不同的下游流程、提示和工具中。
-
将简单/常见问题路由到较小的模型,如Claude 3.5 Haiku,将困难/异常问题路由到更强大的模型,例如Claude 3.5 Sonnet,以优化成本和速度。
-
Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
-
Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
工作流:并行【Workflow: Parallelization】
LLM有时可以同时处理一个任务,并以编程方式聚合其输出。这种工作流,即并行化,有两种形式:
LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:
-
分段:将任务分解为并行运行的独立子任务。
-
**投票:**多次运行同一任务以获得不同的输出。
-
Sectioning: Breaking a task into independent subtasks run in parallel.
-
Voting: Running the same task multiple times to get diverse outputs.
图:并行工作流【The parallelization workflow】
**何时使用此工作流:**当划分的子任务可以并行化以提高速度时,或者当需要多个视角或尝试以获得更高置信度的结果时,并行化是有效的。对于具有多个考虑因素的复杂任务,当每个考虑因素都由单独的LLM调用处理时,LLM通常会表现更好,从而可以将注意力集中在每个特定方面。
When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
适合并行工作流的场景示例:
Examples where parallelization is useful:
-
分段:
- 实现内容保护(过滤有害内容),其中一个模型实例处理用户查询,而另一个则筛选不适当的内容或请求。这往往比让相同的LLM调用处理内容保护和响应表现更好。
- 自动评估LLM性能,其中每个LLM调用在给定Prompt下评估模型性能的不同方面。
-
投票:
- 检查一段代码是否存在漏洞,如果发现问题,会有几个不同的提示检查并标记代码。
- 评估给定内容是否不合适,有多个提示评估不同方面,或要求不同的投票阈值来平衡误报和漏报。
-
Sectioning:
- Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
- Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
-
Voting:
- Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
-
Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
工作流:工作编排器【Workflow: Orchestrator-workers】
在编排器工作流程中,中央LLM动态分解任务,将其委托给工作LLM,并综合其结果。
In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
图:工作编排器工作流【The orchestrator-workers workflow】
**何时使用此工作流:**此工作流非常适合无法预测所需子任务的复杂任务(例如,在编码中,需要更改的文件数量和每个文件中更改的性质可能取决于任务)。虽然它的拓扑结构相似,但与并行化的关键区别在于它的灵活性——子任务不是预先定义的,而是由编排器根据特定输入确定的。
When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren’t pre-defined, but determined by the orchestrator based on the specific input.
适合工作编排器工作流的场景示例:
Example where orchestrator-workers is useful:
-
每次对多个文件进行复杂更改的编码产品。
-
搜索任务涉及从多个来源收集和分析信息,以寻找可能的相关信息。
-
Coding products that make complex changes to multiple files each time.
-
Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
工作流:评估器-优化器【Workflow: Evaluator-optimizer】
在评估器优化器工作流中,一个LLM调用生成响应,而另一个则在循环中提供评估和反馈。
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
图:评估器-优化器工作流【The evaluator-optimizer workflow】
**何时使用此工作流:**当我们有明确的评估标准,并且迭代过程中存在可量化的指标时,此工作流特别有效。该工作流能够生效需要两个条件:第一,当人类给出反馈时,LLM的输出会因得到的反馈有明显改善;第二,LLM可以提供类似反馈。这类似于人类作家在写高质量文章时经历的迭代写作过程。
When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.
适合评估器-优化器工作流的场景示例:
Examples where evaluator-optimizer is useful:
-
文学翻译,翻译LLM最初可能无法捕捉到某些翻译间的细微差别,但评估LLM可以提供对此有用的评论意见。
-
需要多轮搜索和分析以收集全面信息的复杂搜索任务,评估LLM决定是否需要进一步搜索。
-
Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
-
Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
Agents
随着LLM在关键能力(理解复杂的输入、参与推理和规划、可靠地使用工具以及从错误中恢复)方面的成熟,Agent出现了。Agent执行任务来自于人类用户的命令或与人类用户的交互。一旦任务明确,Agent就会独立制定计划并执行,并可能向人类用户索取更多信息或者请求人类用户做出决断。在执行过程中,Agent在每个步骤都需要从环境中获取执行结果(如工具调用结果或代码执行)以评估其进展。然后,Agent可以在检查点或遇到阻断器时暂停以获取人工反馈。任务通常在完成后终止,但异常停止(如达到最大迭代次数)也很常见。
Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it’s crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
Agent可以处理复杂的任务,但它们的实现通常很直接。它们通常只是能够不断根据环境反馈来使用工具的LLM。因此,能够良好地设计工具集及其文档至关重要。我们在附录2中详细介绍了工具开发的最佳实践(“快速设计您的工具”)。
Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 (“Prompt Engineering your Tools”).
图:自主Agent【Autonomous agent】
**何时使用代理:**Agent可用于难以或不可能预测所需步骤数的开放式问题,在这些问题中无法给出固定解决路径。LLM可能会执行多轮,您必须对其决策有一定程度的信任。Agent的自主性使其成为在可信环境中扩展任务的理想选择。
When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents’ autonomy makes them ideal for scaling tasks in trusted environments.
Agent的自主性意味着更高的成本,以及复合错误的可能性。我们建议在沙盒环境中进行广泛的测试,并设置适当的防护栏。
The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
适Agent的场景示例:
Examples where agents are useful:
以下示例来自我们自己的实现:
- 一个编码Agent,用于解决SWE工作台任务,该任务涉及根据任务描述对许多文件进行编辑;
- 我们的“计算机使用”参考实现,Claude使用计算机完成任务。
The following examples are from our own implementations:
- A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
- Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks.
图:编码Agent的高层级流程图【High-level flow of a coding agent】
组合和定制这些模式【Combining and customizing these patterns】
这些组成模块不是硬性规定性的。它们是开发人员可以塑造和组合以适应不同用例的常见模式。与任何LLM功能一样,成功的关键是度量性能并迭代优化。重复一遍:只有当复杂性明显改善了结果时,你才应该考虑增加复杂性。
These building blocks aren’t prescriptive. They’re common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.
总结【Summary】
LLM领域的成功不是建立最复杂的系统,而是根据特定需求构建“正确”的系统。从简单的提示词开始,通过综合评估对其进行优化,只有在更简单的解决方案不足以完成任务时才添加多步骤Agent系统。
Success in the LLM space isn’t about building the most sophisticated system. It’s about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
在实现Agent时,我们尽量遵循三个核心原则:
- 在Agent的设计中保持简单。
- 尽量确保透明度,即显式给出Agent的计划步骤。
- 根据全面的工具文档和测试,精心设计Agent-计算机接口(ACI)。
When implementing agents, we try to follow three core principles:
- Maintain simplicity in your agent’s design.
- Prioritize transparency by explicitly showing the agent’s planning steps.
- Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
框架有助于快速入门,但在进入生产环境时,不要犹豫,减少抽象层并使用基本组件进行构建。通过遵循这些原则,您可以创建不仅功能强大,而且可靠、可维护且受用户信任的代理。
Frameworks can help you get started quickly, but don’t hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.
原文链接:https://www.anthropic.com/research/building-effective-agents