LLMs之Agent：《OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning》翻译与解读-CSDN博客

本文链接：https://blog.csdn.net/qq_41185868/article/details/147056018

LLMs之Agent：《OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning》翻译与解读

导读：这篇论文介绍了OctoTools，一个用于复杂推理的、无需训练的、可扩展的智能体框架。OctoTools框架通过标准化的工具卡、高效的规划器和执行器以及工具集优化算法，提供了一种无需训练、可扩展且通用的复杂推理解决方案，在多个基准测试中取得了显著的性能提升。论文充分论证了其在解决复杂推理问题上的有效性，并为未来AI智能体开发提供了有益的指导。

>> 背景痛点：现有方法的局限性。现有的增强大型语言模型 (LLM) 复杂推理能力的方法存在以下局限性：

● 领域限制：许多方法仅限于特定领域或工具类型。

● 训练数据依赖：一些方法需要大量标注数据进行训练，难以适应新的领域和任务。

● 可扩展性差：添加新的工具通常需要对框架进行修改，难以扩展。

● 多步推理能力不足：一些方法难以有效地进行多步推理和规划。

>> 具体的解决方案：OctoTools框架。OctoTools是一个无需训练的智能体框架，它通过以下三个核心组件来解决上述问题：

● 工具卡 (Tool Cards)：对各种外部工具进行标准化封装，包括工具描述、输入输出格式、使用限制和最佳实践等元数据。这使得添加、替换或扩展工具变得非常容易。

● 规划器 (Planner)：负责高层和低层规划。高层规划制定全局计划，低层规划根据上下文逐步细化行动。

● 执行器 (Executor)：将规划器的文本化行动转换为可执行的命令，运行相应的工具，并将结果更新到上下文。

>> 核心思路步骤：

● 查询分析 (Query Analyzer)：分析用户查询，确定所需技能和相关工具。

● 行动预测 (Action Predictor)：根据上下文和全局目标，预测下一步行动（选择工具和子目标）。

● 命令生成 (Command Generator)：将行动转换为可执行的命令。

● 命令执行 (Command Executor)：运行命令，获取结果并更新上下文。

● 上下文验证 (Context Verifier)：验证当前上下文是否足够完成任务，决定是否继续执行下一步。

● 解决方案总结 (Solution Summarizer)：根据整个执行轨迹生成最终答案。

● 工具集优化 (Task-specific Toolset Optimization)：通过贪婪搜索算法，根据验证集性能选择最优工具子集。

>> 优势：

● 无需训练 (Training-free)：无需对模型进行额外训练或微调，即可集成新的工具。

● 可扩展性强 (Extensible)：通过工具卡机制，方便地集成各种类型的工具。

● 通用性强 (General)：适用于多个领域和类型的复杂推理任务。

● 多步推理能力强 (Multi-step Reasoning)：通过规划器和执行器，有效地进行多步推理和规划。

● 透明性和可维护性高 (Transparency and Maintainability)：将规划和命令生成分离，提高了系统的可靠性和可维护性。

>> 论文结论和观点：

● OctoTools在16个不同基准测试中显著优于基线方法（GPT-4o，CoT prompting，AutoGen，GPT-Functions和LangChain），平均准确率提升高达9.3%。

● OctoTools的性能提升来自于多步规划、有效工具使用和多步问题分解三个方面。

● 消融实验表明，允许更多步骤、优化工具集以及使用较弱的LLM（GPT-4o-mini）作为基础引擎，都能提高OctoTools的性能。

● 工具卡机制简化了工具集成，提高了框架的可扩展性和模块化程度。

● 未来工作包括在查询级别进行测试时间推理、扩展多智能体协作以及探索特定领域应用。

《OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning》翻译与解读

Abstract

Figure 1:The framework of OctoTools. (1) Tool cards define tool-usage metadata and encapsulate tools, enabling training-free integration of new tools without additional training or framework refinement. (2) The planner governs both high-level and low-level planning to address the global objective and refine actions step by step. (3) The executor instantiates tool calls by generating executable commands and save structured results in the context. The final answer is summarized from the full trajectory in the context. Furthermore, the task-specific toolset optimization algorithm learns to select a beneficial subset of tools for downstream tasks. See Figure 3 for an example.图 1：OctoTools 的框架。（1）工具卡定义工具使用元数据并封装工具，从而无需额外训练或框架改进即可实现新工具的免训练集成。（2）规划器负责高层和低层规划，以解决全局目标并逐步细化行动。（3）执行器通过生成可执行命令来实例化工具调用，并在上下文中保存结构化结果。最终答案从上下文中的完整轨迹中总结得出。此外，特定任务的工具集优化算法学习为下游任务选择有益的工具子集。有关示例，请参见图 3。

1、Introduction

Figure 2:Performance comparison across 16 benchmarks. Our OctoTools framework achieves an average accuracy gain of 9.3% over GPT-4o without function plugins and 7.3% over LangChain, using the same tools under the same configuration.图 2：在 16 个基准测试中的性能比较。在相同配置下使用相同工具，我们的 OctoTools 框架相较于未使用功能插件的 GPT-4o 平均准确率提升了 9.3%，相较于 LangChain 提升了 7.3%。

Figure 3:The demonstration of a self-contained example from Figure 1. We visualize the tool cards for selected tools, the initial plan generated by the planner, and two steps in which the planner and the executor orchestrate low-level planing and tool usage before arriving at the final answer. See §F.1 for details and §F for more examples. An interactive visualization of these examples is available at OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning图 3：来自图 1 的一个独立示例的演示。我们展示了所选工具的工具卡、规划器生成的初始计划，以及规划器和执行器在到达最终答案之前协调低级规划和工具使用过程中的两个步骤。详情见 §F.1，更多示例见 §F。这些示例的交互式可视化可在 https://octotools.github.io/#visualization 查看。

Conclusion

《OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning》翻译与解读

地址	论文地址：https://arxiv.org/abs/2502.11271
时间	2025年2月16日
作者	斯坦福大学

Abstract

Solving complex reasoning tasks may involve visual understanding, domain knowledge retrieval, numerical calculation, and multi-step reasoning. Existing methods augment large language models (LLMs) with external tools but are restricted to specialized domains, limited tool types, or require additional training data. In this paper, we introduce OctoTools, a training-free, user-friendly, and easily extensible open-source agentic framework designed to tackle complex reasoning across diverse domains. OctoTools introduces standardized tool cards to encapsulate tool functionality, a planner for both high-level and low-level planning, and an executor to carry out tool usage. We validate OctoTools' generality across 16 diverse tasks (including MathVista, MMLU-Pro, MedQA, and GAIA-Text), achieving substantial average accuracy gains of 9.3% over GPT-4o. Furthermore, OctoTools outperforms AutoGen, GPT-Functions and LangChain by up to 10.6% when given the same set of tools. Through comprehensive analysis and ablations, OctoTools demonstrates advantages in task planning, effective tool usage, and multi-step problem solving.

解决复杂的推理任务可能涉及视觉理解、领域知识检索、数值计算以及多步骤推理。现有的方法通过为大型语言模型（LLMs）配备外部工具来增强其能力，但这些方法往往局限于特定领域、工具类型有限，或者需要额外的训练数据。在本文中，我们推出了 OctoTools，这是一个无需训练、用户友好且易于扩展的开源代理框架，旨在解决跨多个领域的复杂推理问题。OctoTools 引入了标准化的工具卡来封装工具功能，一个用于高级和低级规划的规划器，以及一个用于执行工具使用的执行器。我们在 16 个不同的任务（包括 MathVista、MMLU-Pro、MedQA 和 GAIA-Text）上验证了 OctoTools 的通用性，其平均准确率比 GPT-4o 提高了 9.3%。此外，在给定相同工具集的情况下，OctoTools 的表现优于 AutoGen、GPT-Functions 和 LangChain，最高可高出 10.6%。通过全面的分析和消融实验，OctoTools 在任务规划、有效工具使用和多步骤问题解决方面展现出了优势。

Figure 1:The framework of OctoTools. (1) Tool cards define tool-usage metadata and encapsulate tools, enabling training-free integration of new tools without additional training or framework refinement. (2) The planner governs both high-level and low-level planning to address the global objective and refine actions step by step. (3) The executor instantiates tool calls by generating executable commands and save structured results in the context. The final answer is summarized from the full trajectory in the context. Furthermore, the task-specific toolset optimization algorithm learns to select a beneficial subset of tools for downstream tasks. See Figure 3 for an example.图 1：OctoTools 的框架。（1）工具卡定义工具使用元数据并封装工具，从而无需额外训练或框架改进即可实现新工具的免训练集成。（2）规划器负责高层和低层规划，以解决全局目标并逐步细化行动。（3）执行器通过生成可执行命令来实例化工具调用，并在上下文中保存结构化结果。最终答案从上下文中的完整轨迹中总结得出。此外，特定任务的工具集优化算法学习为下游任务选择有益的工具子集。有关示例，请参见图 3。

1、Introduction

Large language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023b) have made rapid progress on tasks such as summarization, translation (Thoppilan et al., 2022), code generation (Nakano et al., 2021), and math problem solving (Shuster et al., 2022). However, complex reasoning tasks that involve multiple steps, logical decomposition, or specialized domain knowledge remains challenging. For example, solving a visual riddle may require fine-grained image understanding and text-based reasoning, while a math or chemistry question can require thorough computations or domain expertise. Existing prompting methods often fail to orchestrate these varied processes into a coherent chain of reasoning (Yao et al., 2022).

A promising direction to address these challenges is to augment LLMs with external tools. By offloading specialized subtasks (e.g., web queries, Python-based calculations, and specialized scientific tools) to dedicated modules, LLMs can focus on higher-level planning and synthesis. Several frameworks have explored such tool usage, from those relying on extensive supervised data and fine-tuning (Schick et al., 2023; Liu et al., 2023), to static solutions without refinement (Lu et al., 2023), and those limited to one specialized domain of tools (Nakano et al., 2021; Tao et al., 2023; Hu et al., 2024). Although these methods perform well on specific tasks, they still face challenges that hinder general widespread use. Many require substantial training with curated data, which limits their adaptability to new domains. Others are designed for a particular domain (Bran et al., 2023; Kang & Kim, 2024; Li et al., 2024a; Schmidgall et al., 2024) or cannot easily support multi-step problem-solving (Lu et al., 2023), restricting their generality.

大型语言模型（LLMs）（Brown 等人，2020 年；Chowdhery 等人，2022 年；OpenAI，2023 年 b）在诸如摘要、翻译（Thoppilan 等人，2022 年）、代码生成（Nakano 等人，2021 年）和数学问题解决（Shuster 等人，2022 年）等任务上取得了快速进展。然而，涉及多步骤、逻辑分解或特定领域知识的复杂推理任务仍然具有挑战性。例如，解决视觉谜题可能需要对图像进行精细理解以及基于文本的推理，而数学或化学问题可能需要进行详尽的计算或具备特定领域的专业知识。现有的提示方法往往无法将这些不同的过程协调成一个连贯的推理链（Yao 等人，2022 年）。

解决这些挑战的一个有前景的方向是为 LLMs 增添外部工具。通过将专门的子任务（例如，网络查询、基于 Python 的计算以及专门的科学工具）委托给专用模块，LLMs 可以专注于更高层次的规划和综合。已有多个框架探索了这种工具使用方式，从那些依赖大量监督数据和微调的方法（Schick 等人，2023 年；从 Liu 等人（2023 年）的动态解决方案到 Lu 等人（2023 年）的静态解决方案，再到那些仅限于某一特定工具领域的方案（Nakano 等人，2021 年；Tao 等人，2023 年；Hu 等人，2024 年），尽管这些方法在特定任务上表现出色，但仍面临阻碍其广泛通用化的挑战。许多方法需要大量使用精心整理的数据进行训练，这限制了它们对新领域的适应性。还有一些方法是为特定领域设计的（Bran 等人，2023 年；Kang 和 Kim，2024 年；Li 等人，2024a；Schmidgall 等人，2024 年），或者无法轻松支持多步骤问题解决（Lu 等人，2023 年），从而限制了其通用性。

In this paper, we propose OctoTools, a training-free (i.e., it does not require updating model weights), user-friendly, and extensible agentic framework for tackling complex reasoning tasks across diverse domains (Figure 1). A key feature of OctoTools is the concept of tool cards, standardized wrappers that encapsulate heterogeneous tools (e.g., Python calculators, web search APIs, and domain-specific modules), along with metadata such as input-output formats, usage constraints, and best practices that delineate ideal use cases. This standardized design enables easy integration, replacement, or expansion of tools—unlike approaches requiring painstaking re-engineering for each new tool (Lu et al., 2023; Hu et al., 2024).

Building on these tool cards, OctoTools employs a dedicated planner that governs both high-level and low-level planning. Given a user query, the planner proposes a tentative global plan for how various tools might be employed. At each step, it generates a text-based action (including sub-goals and tool selection) conditioned on the evolving context. A separate executor instantiates tool calls by converting this textual action into an executable command, running the corresponding tool, and updating the context with the results. By separating strategic planning from command generation, OctoTools reduces errors and increases transparency, making the system more reliable and easier to maintain.

在本文中，我们提出了 OctoTools，这是一个无需训练（即无需更新模型权重）、用户友好且可扩展的代理框架，用于解决跨多个领域的复杂推理任务（图 1）。OctoTools 的一个关键特性是工具卡的概念，这是一种标准化的封装，将异构工具（例如 Python 计算器、网络搜索 API 和特定领域的模块）封装起来，同时还包含输入输出格式、使用限制和最佳实践等元数据，以界定理想的使用场景。这种标准化设计使得工具的集成、替换或扩展变得容易——这与每次引入新工具都需要费力重新设计的方法不同（Lu 等人，2023 年；Hu 等人，2024 年）。

基于这些工具卡片，OctoTools 采用了一个专用的规划器，负责高层和低层规划。当收到用户查询时，规划器会提出一个初步的全局计划，说明如何使用各种工具。在每一步，它都会根据不断变化的上下文生成基于文本的操作（包括子目标和工具选择）。一个单独的执行器通过将此文本操作转换为可执行命令来实例化工具调用，运行相应的工具，并使用结果更新上下文。通过将战略规划与命令生成分离，OctoTools 减少了错误并提高了透明度，使系统更可靠且更易于维护。

An additional challenge in agentic systems is determining which subset of tools to enable for a given domain. Although providing many tools can be beneficial, enabling them all may introduce noise or slow performance (Lumer, 2024; Fore et al., 2024; Paramanayakam et al., 2024). To address this, we propose a lightweight toolset optimization algorithm that identifies a more useful subset of tools for each task based on validation performance, ultimately improving both accuracy and efficiency.

While recent general agent frameworks also allow LLMs to use external tools autonomously, they often focus on high-level abstractions (LangChain, 2024), limited observability of intermediate decisions (OpenAI, 2023a), or multi-agent collaboration features (AutoGen, 2024), with less emphasis on enhancing complex reasoning and quantitatively benchmarking downstream task performance. In contrast, we systematically evaluate the entire agentic workflow of OctoTools across diverse tasks, providing in-depth analyses of when and how tool-based reasoning succeeds or fails in complex reasoning scenarios.

We conduct large-scale experiments across 16 diverse reasoning benchmarks, spanning general vision, mathematical, scientific, medical, and agentic domains. As summarized in Figure 2, OctoTools substantially outperforms other baselines, achieving an average accuracy gain of 9.3% over zero-shot prompting by GPT-4o and 7.7% over chain-of-thought (CoT) prompting, as well as up to 10.6% improvement compared to existing agentic frameworks when given the same tools (AutoGen, 2024; OpenAI, 2023a; LangChain, 2024). Detailed analyses show that OctoTools effectively combines multi-step planning and specialized tool usage, with each dimension providing distinct improvements. For tasks requiring intricate calculations or specialized knowledge, we found tool usage is particularly beneficial; for tasks requiring reasoning decomposition, we found multi-step planning offers significant gains.

在代理系统中，另一个挑战在于确定为给定领域启用哪些工具子集。尽管提供大量工具可能有益，但启用所有工具可能会引入噪声或降低性能（Lumer，2024；Fore 等人，2024；Paramanayakam 等人，2024）。为解决这一问题，我们提出了一种轻量级工具集优化算法，该算法基于验证性能为每个任务确定更有用的工具子集，最终提高准确性和效率。

虽然近期的通用代理框架也允许 LLM 自主使用外部工具，但它们往往侧重于高级抽象（LangChain，2024）、对中间决策的有限观察（OpenAI，2023a）或多代理协作功能（AutoGen，2024），而较少关注增强复杂推理和对下游任务性能进行定量基准测试。相比之下，我们系统地评估了 OctoTools 在各种任务中的整个代理工作流程，深入分析了基于工具的推理在复杂推理场景中何时以及如何成功或失败。我们在 16 个不同的推理基准上进行了大规模实验，涵盖通用视觉、数学、科学、医学和代理等领域。如图 2 所示，OctoTools 显著优于其他基线，在零样本提示方面，其平均准确率比 GPT-4o 高出 9.3%，在链式思维（CoT）提示方面高出 7.7%，在给定相同工具的情况下，与现有的代理框架（AutoGen，2024；OpenAI，2023a；LangChain，2024）相比，最高可提高 10.6%。详细分析表明，OctoTools 能够有效地结合多步规划和专用工具的使用，每个维度都带来了显著的改进。对于需要复杂计算或专业知识的任务，我们发现使用工具特别有益；对于需要推理分解的任务，我们发现多步规划能带来显著的提升。

Furthermore, our ablation studies offer insights into OctoTools’s performance under different conditions. Overall, the average accuracy tends to improve as the maximum number of steps increases. Without any toolset optimization, simply enabling all tools in the toolset yields 57.4% accuracy, which still surpasses the setup with only the base tool by 3.5%, suggesting a degree of generalization as the toolset expands. Learning the optimal toolset for specific tasks raises the overall performance to 58.9%, indicating the benefit of further optimization. Additionally, when using a weaker LLM (GPT-4o-mini) as the base engine, OctoTools maintains a strong average gain of 7.1% across 16 tasks.

Our contributions are as follows: (1) We propose OctoTools, a training-free, extensible agentic framework that enables LLMs to call external tools in multiple steps, without the need for additional training or fine-tuning. (2) We introduce a comprehensive planner-executor paradigm with standardized tool cards, which can be easily customized or expanded for new domains. (3) We conduct large-scale experiments on 16 diverse benchmarks and show that OctoTools improves performance by a sizable margin compared to baseline prompting and other agentic frameworks. (4) We provide in-depth analyses and ablations on how multi-step reasoning and tool usage contribute to performance, offering practical guidance for future agent development.

此外，我们的消融研究揭示了 OctoTools 在不同条件下的性能表现。总体而言，平均准确率随着最大步骤数的增加而提高。在未对工具集进行任何优化的情况下，仅启用工具集中的所有工具就能达到 57.4% 的准确率，这仍比仅使用基础工具的设置高出 3.5%，表明随着工具集的扩大，具有一定程度的泛化能力。针对特定任务学习最优工具集可将整体性能提升至 58.9%，这表明进一步优化是有益的。此外，在使用较弱的 LLM（GPT-4o-mini）作为基础引擎时，OctoTools 在 16 项任务中仍保持平均 7.1% 的显著增益。

我们的贡献如下：（1）我们提出了 OctoTools，这是一种无需训练、可扩展的代理框架，使 LLM 能够在多步骤中调用外部工具，无需额外的训练或微调。（2）我们引入了一个全面的规划器-执行器范式，配有标准化的工具卡，可轻松针对新领域进行定制或扩展。（3）我们在 16 个不同的基准上进行了大规模实验，并表明 OctoTools 相比基线提示和其他代理框架，性能有显著提升。（4）我们对多步推理和工具使用如何促进性能进行了深入分析和消融研究，为未来智能体的开发提供了实用指导。

Figure 2:Performance comparison across 16 benchmarks. Our OctoTools framework achieves an average accuracy gain of 9.3% over GPT-4o without function plugins and 7.3% over LangChain, using the same tools under the same configuration.图 2：在 16 个基准测试中的性能比较。在相同配置下使用相同工具，我们的 OctoTools 框架相较于未使用功能插件的 GPT-4o 平均准确率提升了 9.3%，相较于 LangChain 提升了 7.3%。

Figure 3:The demonstration of a self-contained example from Figure 1. We visualize the tool cards for selected tools, the initial plan generated by the planner, and two steps in which the planner and the executor orchestrate low-level planing and tool usage before arriving at the final answer. See §F.1 for details and §F for more examples. An interactive visualization of these examples is available at OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning图 3：来自图 1 的一个独立示例的演示。我们展示了所选工具的工具卡、规划器生成的初始计划，以及规划器和执行器在到达最终答案之前协调低级规划和工具使用过程中的两个步骤。详情见 §F.1，更多示例见 §F。这些示例的交互式可视化可在 https://octotools.github.io/#visualization 查看。

Conclusion

In this paper, we introduced OctoTools, a training-free, extensible agentic framework for complex reasoning. OctoTools employs standardized tool cards to facilitate seamless integration of diverse tools and a dedicated planner-executor workflow that separates high-level planning over multiple steps from low-level planning and command generation within each step. Through extensive experiments on 16 diverse benchmarks, OctoTools consistently outperforms baselines, achieving average accuracy gains of up to 9.3% over GPT-4o and up to 10.6% over strong agentic frameworks. Our in-depth analysis shows that OctoTools’ improvements stem from dynamic task planning, effective tool usage, and multi-step problem decomposition.

Ablation studies highlight the benefits of allowing more step, refining the toolset, and demonstrate the robustness when deployed with a weaker LLM. By streamlining the integration of new or specialized modules through tool cards, OctoTools readily adapts to a broad range of tasks. We believe our findings open new ecosystems for building next-generation AI agents that are more transparent, modular, and effective at solving real-world problems. Future work includes test-time inference at the query level, extending multi-agent collaboration, and exploring specialized domains.

在本文中，我们介绍了 OctoTools，这是一个无需训练、可扩展的复杂推理代理框架。OctoTools 采用标准化工具卡来促进不同工具的无缝集成，并且有一个专门的规划器 - 执行器工作流程，将多步骤的高级规划与每个步骤中的低级规划和命令生成分离开来。通过在 16 个不同基准上的大量实验，OctoTools 一直优于基线，与 GPT-4o 相比平均准确率提高了多达 9.3%，与强大的代理框架相比提高了多达 10.6%。我们的深入分析表明，OctoTools 的改进源于动态任务规划、有效工具使用以及多步骤问题分解。

消融研究表明，允许更多步骤、优化工具集以及在使用较弱的 LLM 时部署具有显著优势，并展示了其稳健性。通过工具卡简化新模块或专用模块的集成，OctoTools 能够轻松适应广泛的任务。我们认为，我们的发现为构建更透明、模块化且更有效地解决现实世界问题的下一代 AI 代理开辟了新的生态系统。未来的工作包括在查询级别进行推理测试、拓展多智能体协作以及探索专业领域。