AGI之Agent：《基于LLM驱动的智能体—三大组件(规划+记忆+工具使用)、四大案例(ChemCrow/AutoGPT/GPT-Engineer/Generative Agents)、三大挑战(限

原创已于 2024-12-06 23:23:04 修改 · 2.4k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#LLMs #自然语言处理 #Agent

于 2023-07-09 00:11:05 首次发布

NLP/LLMs 同时被 3 个专栏收录

953 篇文章

订阅专栏

AI/AGI

363 篇文章

订阅专栏

RAG_Agent

218 篇文章

订阅专栏

AGI之Agent：《LLM Powered Autonomous Agents基于LLM驱动的智能体》的翻译与解读

AGI之Agent：《基于LLM驱动的智能体—三大组件(规划+记忆+工具使用)、四大案例(ChemCrow/AutoGPT/GPT-Engineer/Generative Agents)、三大挑战(限的上下文长度/规划困难/自然语言接口的可靠性)》的翻译与解读

导读：这篇论文讨论了如何利用大型语言模型(LLM)构建Agent系统，系统性分析了LLM代理系统各个组成模块，归纳了前沿工作的优点与不足，提出了解决受限问题的思路，对未来此类系统设计提供了参考。它阐述了LLM在智能代理领域应用前景和难点，值得深入研究。

背景痛点：

>> 当前大语言模型主要用于生成文字、故事等，但其潜力不仅限于此，也可以作为通用问题求解器。

>> 人工智能代理系统需要实现规划、记忆、工具使用等功能。

具体解决方案—组成架构：智能Agent系统采用大语言模型LLM充当代理的“中央大脑”，负责进行规划、决策及完成具体任务，并与几个关键组件相结合来实现智能决策和问题解决能力。这些组件包括规划、记忆、工具使用。

>> LLM：大语言模型相当于Agent的“大脑”—核心控制器，负责进行规划、决策及完成具体任务。

>> 规划组件（Planning）：侧重任务分解(提升规划效能)和自我反思(提升学习质量)。Agent能够将复杂任务分解为可管理的子目标，以便有效处理复杂任务。利用任务分解方式进行长期规划，比如采用链式思考(CoT)或树形思考(ToT)技术将复杂任务分解为简单子任务；利用反思机制使代理从过去经历中学习教训来改进决策。但LLM难以处理长期规划和错误修正。

>> 记忆组件（Memory）：通过向量存储和快速检索提供外部知识支持。利用内外存储来实现短期记忆和长期记忆。短期记忆对应LLM内隐学习内容，用于上下文学习；而长期记忆则通过外部向量存储库和快速检索(如MIPS最大内内积搜索)来保留和检索信息。

>> 工具使用组件（Tool Use）：通过训练LLM，将LLM能力扩展到调用外部API来解决模型自身无法处理的问题。让Agent学会如调用外部API来获取模型权重之外的额外信息，以便解决任务中缺失的信息或访问专有信息源等。
核心特点：优化交流效率、利用反思机制学习错误、调用外部规划器等。

>> 大语言模型作为核心控制器，能进行社交和通用问题求解。

>> 三大组成部分互相协作，实现复杂任务的规划与完成。

>> 利用外部资源弥补模型本身不足，有效扩展代理能力。

优势：

>> 提出了一个系统化的自主代理系统构建框架。

>> 给出了复杂任务分解和自我学习的方法论参考。

>> 案例分析验证了这类架构在不同场景下的应用潜力。例子展示了LLM辅助科学发现、虚拟格子环境模拟等应用。

>> 开启了利用大语言模型创建智能代理的新思路。

案例研究：科学发现代理ChemCrow、生成式虚拟角色模拟、AutoGPT等验证概念的例子。

主要挑战：有限的上下文长度、规划困难、自然语言接口的可靠性等。加强LLM使用外部工具的能力，是构建自主Agent的关键组件之一，但工具接口的可靠性仍然是一个挑战。LLM赋能的Agent具有巨大潜力，但仍面临一些限制。如何克服这些限制，让Agent在复杂环境中可靠地规划和完成任务，是当前的研究方向。

《LLM Powered Autonomous Agents》的翻译与解读

0、一个强大的通用问题解决器：LLM可作为核心控制器来构建的自主Agent系统(如AutoGPT、GPT-Engineer和BabyAGI)

1、Agent System Overview—由LLM驱动的自主Agent系统的三大关键组件：规划(分解为子目标+反思和改进)、记忆(短期+长期)、工具使用

图1. LLM驱动的自主Agent系统概述

P1、Component One: Planning组件一：规划=任务分解(CoT/ToT)+自我反思(ReAct/Reflexion/CoH/AD)

规划是复杂任务的关键，Agent需要了解任务的各个步骤并提前进行规划

1.1、Task Decomposition任务分解：CoT(指导模型“逐步思考”)、ToT(在每个步骤探索多个推理可能性来扩展CoT)

任务分解的四种方式：简单提示的LLM、任务特定的指令、人类输入、LLM+P(依赖外部的经典规划器来进行长期规划，LLM将问题转化为“问题PDDL”→生成PDDL规划→转回自然语言)

1.2、Self-Reflection自我反思：试错不可避免

ReAct—侧重自我反思：ReAct将思考和行动相结合扩展了行动空间+遵循Thought-Action-Observation的模板

图2. 知识密集型任务（如HotpotQA、FEVER）和决策任务（如AlfWorld Env、WebShop）的推理轨迹示例

Reflexion—侧重自我反思：赋予Agent动态记忆和自我反思能力用于提高推理技能+采用标准的强化学习

图3. Reflexion 框架的图解

图4. 在AlfWorld Env和HotpotQA上的实验。在AlfWorld中，幻觉比低效的规划更常见

CoH—采用监督学习和历史数据+侧重生成任务：通过过去输出序列来进行自我改进+监督微调+添加正则化项来避免过拟合

图5. 经过CoH的微调后，模型可以按照指令产生序列递增的改进

AD算法蒸馏—采用监督学习和历史数据+侧重强化学习任务：利用多次与环境的交互+目标是学习RL的过程

图6. 算法蒸馏（AD）的工作原理示意图

图7. AD、ED、源策略和RL^2在需要内存和探索的环境中的比较。只分配二元奖励。源策略用A3C训练用于“黑暗”环境，DQN训练用于水迷宫。

P2、Component Two: Memory组件二：记忆=三大类型(SM/STM/LTM)+MIPS(常用的ANN包括LSH/ANNOY/HNSW/FAISS/ScaNN)

2.1、Types of Memory记忆的类型

(1)、感觉记忆SM：学习原始输入的嵌入表示

(2)、短期记忆STM：CLI+目前所知，受限于Transformer的有限上下文窗口长度

(3)、长期记忆LTM：存储很久，Agen+访问外部数据，包括显式【有意识/场景事实】和隐式【无意识/下意识的动作】

Fig. 8. 人类记忆的分类 Categorization of human memory.

2.2、Maximum Inner Product Search (MIPS)最大内积搜索(MIPS)：通过外部存储来减轻有限注意力范围的限制的技术

(1)、通过向量存储和快速MIPS检索，可以扩展注意力窗口带来的限制，提供对更大知识库的访问

(2)、优化检索速度—ANN算法的主要选择：LSH(使用敏感哈希函数)、ANNOY(使用随机投影树)、HNSW(使用小世界网络分层结构)、FAISS(基于高斯分布假设+应用向量量化以创建簇并进行搜索)、ScaNN(各向异性向量量化)

图9. MIPS算法的比较，见recall@10。

P3、Component Three: Tool Use组件三：工具使用=MRKL/TALM或Toolformer/HuggingGPT/API-Bank

工具使用是人类的重要特征，为LLM配备外部工具可以大大扩展其能力

图10. 一只海獭漂浮在水里，用石头敲开贝壳的照片。虽然其他一些动物可以使用工具，但其复杂性无法与人类相比

MRKL：通过将LLM作为路由器，调用专家模块(神经或符号)，实现工具使用。实验显示LLM调用计算器时存在可靠性问题

TALM、Toolformer：通过标注训练LM调用外部API，过学习使用外部工具API来扩展模型的功能；

ChatGPT插件、OpenAI API调用体现了工具使用的实践价值

HuggingGPT：使用ChatGPT进行任务规划，调用HuggingFace平台模型，生成响应，但是面临效率、上下文长度、输出稳定性等挑战

图11. 说明HuggingGPT是如何工作的

HuggingGPT 的4个阶段：任务规划→模型选择→任务执行→响应生成

具体的Instruction

HuggingGPT应用的三个挑战：提高效率、上下文窗口、稳定性

API-Bank：一个用于评估工具增强LLMs性能的基准，包含53个常用API工具

图12. LLM如何在API- bank中调用API的伪代码

API-Bank工作流程中的三个决定：是否需要API调用→确定要调用的正确API→基于API结果的响应

评估Agent的工具使用能力的三个级别：调用API的能力、检索API的能力和规划API的能力

4、Case Studies案例研究：ChemCrow、Generative Agents、AutoGPT、GPT-Engineer

4.1、基于Agent的科学发现：探究如何通过为LLM提供专业工具和知识,来提升其在科学发现领域任务完成能力

ChemCrow：专注有机合成、药物发现和材料设计等任务，将CoT推理与任务相关的工具相结合=LLM+13个专业工具+LangChain框架

ChemCrow效果优于GPT-4：但LLM自身缺乏深度专业知识也会产生一定盲区，难以判断结果的正确性

探究用于科学发现的LLM Agent：自主设计、规划和执行复杂科学实验的可能性。该代理能够利用网络查阅文档、执行代码、调用机器人实验API以及其他LLM,来完成“开发新的抗癌药物”等任务

4.2、Generative Agents—基于Agents的虚拟场景模拟：斯坦福的“虚拟小镇”，由25个AI智能体(每个人物都由LLM控制)复现《西部世界》，模拟了25个虚拟人物在《模拟人生》游戏灵感的沙盒环境中生活和互动(基于过去的经验)

记忆流：存储Agent所有经历，内容以自然语言形式记录在外部数据库中。

检索模型：根据相关性、新颖性和重要性提取记忆上下文指导行为

反思机制：将记忆合成更高层次的推断,指导未来行为

规划和反应：将反思结果和环境信息转化为行动

Generative Agents的虚拟场景模拟试验，通过LLM等机制模拟了虚拟人物在沙盒环境中的生活和社交互动,产生了人性化的社交行为现象，如信息传播、关系记忆（例如，两个代理继续对话主题）、社交事件的协调（例如，举办聚会并邀请其他人）

图13. 生成式Agent架构

4.3、Proof-of-Concept Examples概念证明的例子：使用LLM作为主要控制器构建自动Agent的概念验证案例

AutoGPT—一个有趣的概念验证演示：是一种让 GPT-4 能够自主完成任务的人工智能模型(无需人类的干预)。但是通过自然语言接口操作存在可靠性问题

GPT-Engineer：可以根据通过自然语言指定的任务,为其构建完整代码库。其采用LLM进行任务细分和需求澄清

5、Challenges三大挑战：上下文长度有限、长期规划和任务分解存在挑战性、自然语言接口的不太可靠性

上下文长度有限：难以包含详细历史信息+更长的上下文才能帮助模型从错误中学习

长期规划和任务分解存在挑战性：LLM在面对意外错误错误时很难调整计划

自然语言接口的不太可靠性：LLM输出结果质量难以确定，可能格式错误或偶尔叛逆行为

Citation

References

《LLM Powered Autonomous Agents》的翻译与解读

地址	地址：LLM Powered Autonomous Agents \| Lil'Log
时间	2023年6月23日
作者	Lilian Weng

0、一个强大的通用问题解决器：LLM可作为核心控制器来构建的自主Agent系统(如AutoGPT、GPT-Engineer和BabyAGI)

Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.

构建以LLM(大型语言模型)为核心控制器的Agent是一个很酷的概念。一些概念验证演示，如AutoGPT、GPT-Engineer和BabyAGI，都是鼓舞人心的示例。LLM的潜力不仅限于生成写作、故事、论文和程序；它可以被构思为一个强大的通用问题解决器。

1、Agent System Overview—由LLM驱动的自主Agent系统的三大关键组件：规划(分解为子目标+反思和改进)、记忆(短期+长期)、工具使用

In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

在一个由LLM驱动的自主Agent系统中，LLM作为Agent的大脑，由几个关键组件补充:

(1)、Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.

Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.

(2)、Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.

Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.

(3)、Tool use

The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.

(1)、规划

分解为子目标：Agent将大任务分解为较小的、可管理的子目标，从而能够高效地处理复杂任务。

反思和改进：agent可以对过去的行为进行自我批评和反思，从错误中吸取教训，并为未来的步骤进行改进，从而提高最终结果的质量。

(2)、记忆

短期记忆：我认为所有的上下文学习(参见提示工程)都是利用模型的短期记忆来学习的。

长期记忆：这为Agent提供了在较长时间内保留和检索(无限)信息的能力，通常通过利用外部矢量存储和快速检索来实现。

(3)、工具使用

Agent学习调用外部API来获取模型权重中缺少的额外信息(通常在预训练后很难更改)，包括当前信息、代码执行能力、对专有信息源的访问等等。

图1. LLM驱动的自主Agent系统概述

P1、Component One: Planning组件一：规划=任务分解(CoT/ToT)+自我反思(ReAct/Reflexion/CoH/AD)

规划是复杂任务的关键，Agent需要了解任务的各个步骤并提前进行规划

A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.

一项复杂的任务通常涉及许多步骤。一个Agent 需要知道这些步骤并提前规划。

1.1、Task Decomposition任务分解：CoT(指导模型“逐步思考”)、ToT(在每个步骤探索多个推理可能性来扩展CoT)

任务分解是一种用于增强模型在复杂任务上性能的标准提示技术。

Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.

思维链（CoT；Wei等人，2022）已成为提高模型在复杂任务上性能的标准提示技术。该模型被要求“一步一步地思考”，以利用更多的测试时间计算将困难的任务分解成更小、更简单的步骤。CoT将大任务转化为多个可管理的任务，并揭示了模型思考过程的解释。

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.

思维树（ToT，Yao等人，2023）通过在每个步骤探索多种推理可能性来扩展CoT。它首先将问题分解为多个思考步骤，每一步产生多个想法，形成树形结构。搜索过程可以是BFS(广度优先搜索)或DFS(深度优先搜索)，每个状态由分类器(通过提示)或多数投票评估。

任务分解的四种方式：简单提示的LLM、任务特定的指令、人类输入、LLM+P(依赖外部的经典规划器来进行长期规划，LLM将问题转化为“问题PDDL”→生成PDDL规划→转回自然语言)

Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.

任务分解可以

(1)由LLM通过简单的提示完成，如“XYZ.\n1.的步骤”，“实现XYZ的子目标是什么?”，

(2)通过使用任务特定的指令;例如。“写一个故事大纲”，用于写小说，或者

(3)使用人工输入。

Another quite distinct approach, LLM+P (Liu et al. 2023), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains.

另一种截然不同的方法是LLM+P (Liu et al. 2023)，它依赖于一个外部经典规划器来进行长期规划。该方法利用规划域定义语言(PDDL)作为描述规划问题的中间接口。在这个过程中，LLM

(1)将问题翻译成“problem PDDL”，然后

(2)请求经典规划器基于已有的“Domain PDDL”生成PDDL计划，最后

(3)将PDDL计划翻译回自然语言。从本质上讲，规划步骤被外包给外部工具，假设存在领域特定的PDDL和适用的规划器，这在某些机器人设置中很常见，但在许多其他领域中并不常见。

1.2、Self-Reflection自我反思：试错不可避免

Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable.

自我反思是一个重要的方面，它允许自主Agent通过改进过去的行动决策和纠正以前的错误来迭代地改进。它在现实世界的任务中扮演着至关重要的角色，因为尝试和错误是不可避免的。

ReAct—侧重自我反思：ReAct将思考和行动相结合扩展了行动空间+遵循Thought-Action-Observation的模板

一种将推理和行动结合起来的方法，通过扩展行动空间和语言空间，使LLM能够与环境进行交互并生成自然语言的推理轨迹

ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language.

The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as:

Thought: ...

Action: ...

Observation: ...

... (Repeated many times)

ReAct (Yao et al. 2023)通过扩展动作空间，将推理和行动整合到LLM中，使其成为任务特定离散动作和语言空间的组合。前者使LLM能够与环境进行交互(例如使用Wikipedia搜索API)，而后者提示LLM生成自然语言中的推理轨迹。

ReAct提示模板包含了LLM思考的显式步骤，大致格式为:

思考:……

行动:……

观察:……

…(重复多次)

In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act-only baseline where Thought: … step is removed.

在知识密集型任务和决策任务的实验中，ReAct比Act-only基线效果更好，其中Thought:…步骤被删除。

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)

图2. 知识密集型任务（如HotpotQA、FEVER）和决策任务（如AlfWorld Env、WebShop）的推理轨迹示例

Fig. 2. Examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop). (Image source: Yao et al. 2023).

Reflexion—侧重自我反思：赋予Agent动态记忆和自我反思能力用于提高推理技能+采用标准的强化学习

一种具备动态记忆和自我反思能力的框架，通过启发式函数判断轨迹的效率和幻觉，并根据反思结果决定是否重置环境

Reflexion (Shinn & Labash 2023) is a framework to equips agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple binary reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps. After each action , the agent computes a heuristic and optionally may decide to reset the environment to start a new trial depending on the self-reflection results.

Reflexion (Shinn & Labash 2023)是一个框架，为Agent提供动态记忆和自我反思能力，以提高推理能力。Reflexion 有一个标准的RL设置，其中奖励模型提供一个简单的二元奖励，行动空间遵循ReAct中的设置，其中特定于任务的行动空间用语言增强，以支持复杂的推理步骤。在每个动作之后，Agent计算一个启发式，并根据自我反思的结果有选择地决定重置环境以开始新的尝试。

The heuristic function determines when the trajectory is inefficient or contains hallucination and should be stopped. Inefficient planning refers to trajectories that take too long without success. Hallucination is defined as encountering a sequence of consecutive identical actions that lead to the same observation in the environment.

Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM.

启发式函数确定轨迹何时无效或包含幻觉并应停止。低效的规划指的是花费太长时间却没有成功的轨迹。幻觉被定义为在环境中遇到一系列连续的相同动作，导致相同的观察结果。

自我反思是通过向LLM展示两个示例来创建的，每个例子是一对(失败的轨迹，指导未来计划变化的理想反思)。然后将reflections 添加到Agent的工作记忆中，最多可使用三个，作为查询LLM的上下文。

图3. Reflexion 框架的图解

Fig. 3. Illustration of the Reflexion framework. (Image source: Shinn & Labash, 2023)

图4. 在AlfWorld Env和HotpotQA上的实验。在AlfWorld中，幻觉比低效的规划更常见

Fig. 4. Experiments on AlfWorld Env and HotpotQA. Hallucination is a more common failure than inefficient planning in AlfWorld. (Image source: Shinn & Labash, 2023)

CoH—采用监督学习和历史数据+侧重生成任务：通过过去输出序列来进行自我改进+监督微调+添加正则化项来避免过拟合

通过向模型展示过去输出的序列，并提供反馈，鼓励模型改进自身输出

Chain of Hindsight (CoH; Liu et al. 2023) encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with feedback. Human feedback data is a collection of

, where is the prompt, each is a model completion, is the human rating of , and is the corresponding human-provided hindsight feedback. Assume the feedback tuples are ranked by reward, The process is supervised fine-tuning where the data is a sequence in the form of , where . The model is finetuned to only predict where conditioned on the sequence prefix, such that the model can self-reflect to produce better output based on the feedback sequence. The model can optionally receive multiple rounds of instructions with human annotators at test time.

后见之明/事后思考链(CoH;Liu et al. 2023)鼓励模型通过显式地呈现一系列过去的输出来改进自己的输出，每个输出都有反馈注释。人类反馈的数据是一个集合

，其中为提示，每一个都是一个模型的完成，是人类的评分，是相应的人类提供的事后反馈。假设反馈元组按奖励排序，该过程是监督微调，其中数据是，形式的序列。该模型被微调为仅在序列前缀的条件下预测位置，这样模型可以根据反馈序列进行自我反思以产生更好的输出。在测试时，该模型可以选择使用人工注释器接收多轮指令。

To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training.

The training dataset in their experiments is a combination of WebGPT comparisons, summarization from human feedback and human preference dataset.

为了避免过拟合，CoH增加了一个正则化项，来最大化预训练数据集的对数似然。为了避免抄近路和复制(因为反馈序列中有许多常用语)，他们在训练期间随机屏蔽了0% - 5%的过去标记。

他们实验中的训练数据集是WebGPT比较、人类反馈总结和人类偏好数据集的组合。

图5. 经过CoH的微调后，模型可以按照指令产生序列递增的改进

Fig. 5. After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence. (Image source: Liu et al. 2023)

AD算法蒸馏—采用监督学习和历史数据+侧重强化学习任务：利用多次与环境的交互+目标是学习RL的过程

将多个强化学习任务的学习历史串联起来，通过行为克隆训练神经网络，使模型学习到强化学习的过程

The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs. Algorithm Distillation (AD; Laskin et al. 2023) applies the same idea to cross-episode trajectories in reinforcement learning tasks, where an algorithm is encapsulated in a long history-conditioned policy. Considering that an agent interacts with the environment many times and in each episode the agent gets a little better, AD concatenates this learning history and feeds that into the model. Hence we should expect the next predicted action to lead to better performance than previous trials. The goal is to learn the process of RL instead of training a task-specific policy itself.

CoH的想法是在上下文中呈现连续改进输出的历史，并训练模型以接受趋势以产生更好的输出。

算法蒸馏（AD；Laskin等人，2023）将同样的想法应用于强化学习任务中的跨回合轨迹，其中算法被包含在一个长的历史条件策略中。考虑到Agent多次与环境交互，并且在每一个回合中，Agent都会变得更好一点，AD将这段学习历史连接起来，并将其输入模型。因此，我们应该期待下一个预测的行动会比以前的试验带来更好的表现。目标是学习强化学习的过程，而不是训练特定于任务的策略本身。

图6. 算法蒸馏（AD）的工作原理示意图

Fig. 6. Illustration of how Algorithm Distillation (AD) works. (Image source: Laskin et al. 2023).

The paper hypothesizes that any algorithm that generates a set of learning histories can be distilled into a neural network by performing behavioral cloning over actions. The history data is generated by a set of source policies, each trained for a specific task. At the training stage, during each RL run, a random task is sampled and a subsequence of multi-episode history is used for training, such that the learned policy is task-agnostic.

In reality, the model has limited context window length, so episodes should be short enough to construct multi-episode history. Multi-episodic contexts of 2-4 episodes are necessary to learn a near-optimal in-context RL algorithm. The emergence of in-context RL requires long enough context.

In comparison with three baselines, including ED (expert distillation, behavior cloning with expert trajectories instead of learning history), source policy (used for generating trajectories for distillation by UCB), RL^2 (Duan et al. 2017; used as upper bound since it needs online RL), AD demonstrates in-context RL with performance getting close to RL^2 despite only using offline RL and learns much faster than other baselines. When conditioned on partial training history of the source policy, AD also improves much faster than ED baseline.

该论文假设，任何生成一组学习历史的算法都可以通过对动作进行行为克隆而蒸馏到神经网络中。历史数据由一组源策略生成，每个源策略都针对特定任务进行了训练。在训练阶段，在每次RL运行期间，随机采样一个任务，并使用多集的多轮历史进行训练，以便学得的策略是任务无关的。

在现实中，模型的上下文窗口长度有限，因此每个回合应该足够短，以构建多集的多轮历史。学习上下文的出现需要足够长的上下文。

与三个基线进行比较，包括ED(专家蒸馏，使用专家轨迹而不是学习历史进行行为克隆)，源策略(用于由UCB生成蒸馏轨迹)，RL^2 (Duan et al. 2017;作为上限，因为它需要在线强化学习)，AD表现出上下文强化学习，尽管只使用离线强化学习，但性能接近RL^2，学习速度比其他基线快得多。当以源策略的部分训练历史为条件时，AD也比ED基线提高得快得多。

图7. AD、ED、源策略和RL^2在需要内存和探索的环境中的比较。只分配二元奖励。源策略用A3C训练用于“黑暗”环境，DQN训练用于水迷宫。

Fig. 7. Comparison of AD, ED, source policy and RL^2 on environments that require memory and exploration. Only binary reward is assigned. The source policies are trained with A3C for "dark" environments and DQN for watermaze.(Image source: Laskin et al. 2023)

P2、Component Two: Memory组件二：记忆=三大类型(SM/STM/LTM)+MIPS(常用的ANN包括LSH/ANNOY/HNSW/FAISS/ScaNN)

(Big thank you to ChatGPT for helping me draft this section. I’ve learned a lot about the human brain and data structure for fast MIPS in my conversations with ChatGPT.)

(非常感谢ChatGPT帮我起草这一节。在与ChatGPT的对话中，我学到了很多关于人类大脑和快速MIPS的数据结构的知识。

2.1、Types of Memory记忆的类型

(1)、感觉记忆SM：学习原始输入的嵌入表示

(2)、短期记忆STM：CLI+目前所知，受限于Transformer的有限上下文窗口长度

(3)、长期记忆LTM：存储很久，Agen+访问外部数据，包括显式【有意识/场景事实】和隐式【无意识/下意识的动作】

Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains.

Sensory Memory: This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch).

Short-Term Memory (STM) or Working Memory: It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items (Miller 1956) and lasts for 20-30 seconds.

Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM:

Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).

Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard.

记忆可以定义为用于获取、存储、保留和随后检索信息的过程。人类大脑中有几种类型的记忆。

(1)、感觉记忆:这是记忆的最早阶段，提供了在原始刺激结束后保留感觉信息(视觉、听觉等)印象的能力。感觉记忆通常只能持续几秒钟。它的子类别包括视觉记忆（视觉）、听觉记忆（听觉）和触觉记忆（触觉）)。

(2)、短期记忆(STM)或工作记忆:它存储的信息是我们目前所知道的，需要进行复杂的认知任务，如学习和推理。短期记忆被认为大约有7个项目的容量(Miller 1956)，持续20-30秒。

(3)、长期记忆(LTM):长期记忆可以将信息存储很长时间，从几天到几十年不等，其存储容量基本上是无限的。LTM有两种亚型:

>> 显式/陈述性记忆:这是对事实和事件的记忆，指的是那些可以有意识、自觉地回忆起来的记忆，包括情景记忆(事件和经历)和语义记忆(事实和概念)。

>> 隐式/程序性记忆:这种类型的记忆是无意识的，涉及自动执行的技能和常规活动，比如骑自行车或在键盘上打字。

We can roughly consider the following mappings:

Sensory memory as learning embedding representations for raw inputs, including text, image or other modalities;

Short-term memory as in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer.

Long-term memory as the external vector store that the agent can attend to at query time, accessible via fast retrieval.

我们可以大致考虑以下映射:

>> 感觉记忆：用于学习原始输入的嵌入表示，包括文本、图像或其他形式。

>> 短期记忆：在上下文中学习。它是短暂且有限的，因为它受限于Transformer的有限上下文窗口长度。

>> 长期记忆：Agent可以在查询时访问的外部矢量存储，通过快速检索可访问。

Fig. 8. 人类记忆的分类 Categorization of human memory.

2.2、Maximum Inner Product Search (MIPS)最大内积搜索(MIPS)：通过外部存储来减轻有限注意力范围的限制的技术

(1)、通过向量存储和快速MIPS检索，可以扩展注意力窗口带来的限制，提供对更大知识库的访问

MIPS是构建基于LLM的自治Agent的一个关键组件，为代理提供长期记忆能力。选择合适的ANN算法对系统效率至关重要。

The external memory can alleviate the restriction of finite attention span. A standard practice is to save the embedding representation of information into a vector store database that can support fast maximum inner-product search (MIPS). To optimize the retrieval speed, the common choice is the approximate nearest neighbors (ANN) algorithm to return approximately top k nearest neighbors to trade off a little accuracy lost for a huge speedup.

外部记忆可以缓解注意力持续时间有限的限制。标准做法是将信息的嵌入表示保存到支持快速最大内积搜索(MIPS)的矢量存储数据库中。为了优化检索速度，通常的选择是近似最近邻(ANN)算法，它返回大约top k个最近邻，以牺牲一点精度来换取巨大的加速。

(2)、优化检索速度—ANN算法的主要选择：LSH(使用敏感哈希函数)、ANNOY(使用随机投影树)、HNSW(使用小世界网络分层结构)、FAISS(基于高斯分布假设+应用向量量化以创建簇并进行搜索)、ScaNN(各向异性向量量化)

A couple common choices of ANN algorithms for fast MIPS:

LSH (Locality-Sensitive Hashing): It introduces a hashing function such that similar input items are mapped to the same buckets with high probability, where the number of buckets is much smaller than the number of inputs.

ANNOY (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable.

HNSW (Hierarchical Navigable Small World): It is inspired by the idea of small world networks where most nodes can be reached by any other nodes within a small number of steps; e.g. “six degrees of separation” feature of social networks. HNSW builds hierarchical layers of these small-world graphs, where the bottom layers contain the actual data points. The layers in the middle create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top layer and navigates towards the target. When it can’t get any closer, it moves down to the next layer, until it reaches the bottom layer. Each move in the upper layers can potentially cover a large distance in the data space, and each move in the lower layers refines the search quality.

FAISS (Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first looks for cluster candidates with coarse quantization and then further looks into each cluster with finer quantization.

ScaNN (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization. It quantizes a data point to

such that the inner product is as similar to the original distance of

as possible, instead of picking the closet quantization centroid points.

用于快速MIPS的ANN算法的几个常见选择:

>> LSH（局部敏感性哈希）:它引入了一种哈希函数，使得相似的输入项以高概率映射到相同的桶，其中桶的数量远远小于输入的数量。

>>ANNOY (Approximate Nearest Neighbors Oh Yeah，近似最近邻居，哦耶):核心数据结构是随机投影树，一组二叉树，其中每个非叶节点代表一个将输入空间分成两半的超平面，每个叶节点存储一个数据点。树是独立且随机构建的，因此在某种程度上，它模拟了哈希函数。ANNOY搜索发生在所有树中，迭代地搜索最接近查询的那一半，然后汇总结果。这个想法与KD树非常相关，但更具可扩展性。

>>HNSW(层级可导航的小世界算法):它受到小世界网络的启发，在小世界网络中，任何其他节点都可以在少量步骤内到达大多数节点;例如，社交网络的“六度分离”特征。HNSW为这些小世界图构建分层层，其中底层包含实际数据点。中间的图层创建捷径以加快搜索速度。在执行搜索时，HNSW从顶层的随机节点开始，向目标节点导航。当它不能再靠近时，它就向下移动到下一层，直到到达底层。在上层的每次移动都可能覆盖数据空间中的较大距离，而在下层的每次移动都提高了搜索质量。

>>FAISS (Facebook AI Similarity Search):它基于一个假设，即在高维空间中，节点之间的距离遵循高斯分布，因此应该存在数据点的聚类。FAISS通过将向量空间划分为簇，然后在簇内细化量化来应用矢量量化。Search首先使用粗量化查找候选簇，然后使用细量化进一步查找每个簇。

>>ScaNN(可伸缩近邻):ScaNN的主要创新是各向异性矢量量化。它将数据点量化为一个矢量，使得内积与原距离相似

尽可能地，而不是选择最接近的量化质心点。

Check more MIPS algorithms and performance comparison in ann-benchmarks.com.

在ann-benchmarks.com上查看更多MIPS算法和性能比较。

图9. MIPS算法的比较，见recall@10。

Fig. 9. Comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020)

P3、Component Three: Tool Use组件三：工具使用=MRKL/TALM或Toolformer/HuggingGPT/API-Bank

工具使用是人类的重要特征，为LLM配备外部工具可以大大扩展其能力

Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify and utilize external objects to do things that go beyond our physical and cognitive limits. Equipping LLMs with external tools can significantly extend the model capabilities.

工具使用是人类的一个显著而独特的特征。我们创建、修改和使用外部物体来完成超出我们的物理和认知极限的任务。为LLMs配备外部工具可以显著扩展模型的能力。

图10. 一只海獭漂浮在水里，用石头敲开贝壳的照片。虽然其他一些动物可以使用工具，但其复杂性无法与人类相比

Fig. 10. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools)

MRKL：通过将LLM作为路由器，调用专家模块(神经或符号)，实现工具使用。实验显示LLM调用计算器时存在可靠性问题

MRKL (Karpas et al. 2022), short for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API).

They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability.

MRKL (Karpas et al. 2022)是“模块化推理、知识和语言”(Modular Reasoning, Knowledge and Language)的缩写，是一种用于自主Agent的神经符号架构。提出了一个包含“专家”模块集合的MRKL系统，通用LLM作为路由器将查询路由到最合适的专家模块。这些模块可以是神经模块(如深度学习模型)或符号模块(如数学计算器、货币转换器、天气API)。

他们做了一个微调LLM调用计算器的实验，使用算术作为测试用例。他们的实验表明，解决口头数学问题比解决明确陈述的数学问题更难，因为LLMs(7B Jurassic1-large model)无法可靠地为基本算法提取正确的参数。结果强调了外部符号工具何时可以可靠地工作，知道何时以及如何使用这些工具是至关重要的，这取决于LLM能力。

TALM、Toolformer：通过标注训练LM调用外部API，过学习使用外部工具API来扩展模型的功能；

Both TALM (Tool Augmented Language Models; Parisi et al. 2022) and Toolformer (Schick et al. 2023) fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs. See more details in the “External APIs” section of Prompt Engineering.

TALM（Tool Augmented Language Models；Parisi等人，2022）和Toolformer（Schick等人，2023）都对LLM进行微调，以学习使用外部工具API。数据集根据新增API调用注释是否可以提高模型输出的质量而进行扩展。有关更多详细信息，请参见“外部API”部分的“Prompt Engineering”一节。

ChatGPT插件、OpenAI API调用体现了工具使用的实践价值

ChatGPT Plugins and OpenAI API function calling are good examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).

ChatGPT插件和OpenAI API函数调用是LLM在实践中增强工具使用能力的良好示例。工具API的集合可以由其他开发人员提供(如在插件中)或自定义(如在函数调用中)。

HuggingGPT：使用ChatGPT进行任务规划，调用HuggingFace平台模型，生成响应，但是面临效率、上下文长度、输出稳定性等挑战

HuggingGPT (Shen et al. 2023) is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions and summarize the response based on the execution results.

HuggingGPT（Shen等人，2023）是一个框架，使用ChatGPT作为任务规划器，根据模型描述选择HuggingFace平台上可用的模型，并根据执行结果汇总响应。

图11. 说明HuggingGPT是如何工作的

Fig. 11. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)

HuggingGPT 的4个阶段：任务规划→模型选择→任务执行→响应生成

The system comprises of 4 stages:

(1) Task planning: LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning.

Instruction:

(2) Model selection: LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed.

Instruction:

(3) Task execution: Expert models execute on the specific tasks and log results.

Instruction:

(4) Response generation: LLM receives the execution results and provides summarized results to users.

该系统包括4个阶段:

(1)、任务规划：LLM作为大脑，将用户的请求解析成多个任务。每个任务有四个相关属性:任务类型、ID、依赖项和参数。他们用少量的示例来引导LLM进行任务解析和规划。

指令:

(2)、模型选择：LLM将任务分配给专家模型，其中请求被构建为多项选择题。LLM可以选择从一个模型列表中选择。由于有限的上下文长度，基于任务类型的筛选是必要的。

指令:

(3)、任务执行：专家模型执行特定任务并记录结果。

指令:

(4)、响应生成：LLM接收执行结果，并为用户提供总结的结果。

具体的Instruction

Task planning

Instruction:

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.

指令:

AI助手可以将用户输入解析为几个任务:[{"task": task， "id"， task_id， "dep": dependency_task_ids， "args": {"text": text， "image": URL， "audio": URL， "video": URL}}]。“deep”字段表示生成当前任务所依赖的新资源的前一个任务的id。特殊标签“-task_id”是指在id为task_id的依赖任务中生成的文本、图像、音频和视频。任务必须从以下选项中选择:{{可用任务列表}}。任务之间是有逻辑关系的，请注意它们的顺序。如果无法解析用户输入，则需要回复空JSON。这里有几个案例供您参考:{{demonstration}}。聊天记录记录为{{chat history}}。从这个聊天记录中，您可以找到用户提到的用于任务规划的资源的路径。

Model selection

Instruction:

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.

指令:

给定用户请求和呼叫命令，AI助手帮助用户从模型列表中选择合适的模型来处理用户请求。AI助手只输出最合适模型的模型id。输出必须是严格的JSON格式:"id": "id"， "reason": "您选择的详细原因"。我们有一个模型列表供您从{{候选模型}}中选择。请从清单中选择一个型号。

Task execution

Instruction:

With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.

指令:

有了输入和推理结果，人工智能助手需要描述过程和结果。前面的阶段可以形成为-用户输入:{{用户输入}}，任务规划:{{任务}}，模型选择:{{模型分配}}，任务执行:{{预测}}。首先，您必须以直截了当的方式回答用户的请求。然后描述任务过程，并以第一人称向用户展示你的分析和模型推理结果。如果推理结果包含文件路径，则必须告诉用户完整的文件路径。

HuggingGPT应用的三个挑战：提高效率、上下文窗口、稳定性

To put HuggingGPT into real world usage, a couple challenges need to solve: (1) Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process; (2) It relies on a long context window to communicate over complicated task content; (3) Stability improvement of LLM outputs and external model services.

为了将HuggingGPT应用到现实世界中，需要解决几个挑战:

(1)需要提高效率，因为LLM推理轮次和与其他模型的交互会拖慢流程；

(2)它依赖于长上下文窗口以进行复杂任务内容的通信;

(3) 需要提高LLM输出和外部模型服务的稳定性。

API-Bank：一个用于评估工具增强LLMs性能的基准，包含53个常用API工具

API-Bank (Li et al. 2023) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLM first has access to API search engine to find the right API to call and then uses the corresponding documentation to make a call.

API-Bank (Li et al. 2023)是评估工具增强LLM性能的基准。它包含53个常用的API工具，一个完整的工具增强LLM工作流，以及涉及568个API调用的264个注释对话。API的选择非常多样化，包括搜索引擎、计算器、日历查询、智能家居控制、日程管理、健康数据管理、账户认证工作流等等。由于API数量众多，LLM首先要通过API搜索引擎找到合适的API调用，然后使用相应的文档进行调用。

图12. LLM如何在API- bank中调用API的伪代码

Fig. 12. Pseudo code of how LLM makes an API call in API-Bank. (Image source: Li et al. 2023)

API-Bank工作流程中的三个决定：是否需要API调用→确定要调用的正确API→基于API结果的响应

In the API-Bank workflow, LLMs need to make a couple of decisions and at each step we can evaluate how accurate that decision is. Decisions include:

Whether an API call is needed.

Identify the right API to call: if not good enough, LLMs need to iteratively modify the API inputs (e.g. deciding search keywords for Search Engine API).

Response based on the API results: the model can choose to refine and call again if results are not satisfied.

在API-Bank工作流程中，LLM需要做出几个决定，在每个步骤中，我们都可以评估该决定的准确性。决策包括:

>> 是否需要API调用。

>> 确定要调用的正确API:如果不够好，LLM需要迭代地修改API输入(例如，为搜索引擎API决定搜索关键字)。

>> 基于API结果的响应:如果结果不满意，模型可以选择进行优化并重新调用。

评估Agent的工具使用能力的三个级别：调用API的能力、检索API的能力和规划API的能力

This benchmark evaluates the agent’s tool use capabilities at three levels:

Level-1 evaluates the ability to call the API. Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns.

Level-2 examines the ability to retrieve the API. The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation.

Level-3 assesses the ability to plan API beyond retrieve and call. Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it.

这个基准在三个层次上评估Agent的工具使用能力:

>> Level-1 评估了调用API的能力。根据API的描述，模型需要确定是否要调用给定的API，正确调用它，并对API返回做出适当响应。

>> Level-2 检查了检索API的能力。模型需要搜索可能解决用户需求的API，通过阅读文档学习如何使用它们。

>> Level-3 评估了规划API超越检索和调用的能力，评估在检索和调用之外规划API的能力。在面对不明确的用户请求时（例如，安排小组会议、为旅行预订飞机/酒店/餐厅），模型可能需要执行多个API调用来解决问题。

4、Case Studies案例研究：ChemCrow、Generative Agents、AutoGPT、GPT-Engineer

4.1、基于Agent的科学发现：探究如何通过为LLM提供专业工具和知识,来提升其在科学发现领域任务完成能力

ChemCrow：专注有机合成、药物发现和材料设计等任务，将CoT推理与任务相关的工具相结合=LLM+13个专业工具+LangChain框架

Scientific Discovery Agent

ChemCrow (Bran et al. 2023) is a domain-specific example in which LLM is augmented with 13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain, reflects what was previously described in the ReAct and MRKLs and combines CoT reasoning with tools relevant to the tasks:

>> The LLM is provided with a list of tool names, descriptions of their utility, and details about the expected input/output.

>> It is then instructed to answer a user-given prompt using the tools provided when necessary. The instruction suggests the model to follow the ReAct format - Thought, Action, Action Input, Observation.

ChemCrow (Bran et al. 2023)是一个领域特定的示例，其中LLM与13个专家设计的工具相结合，以完成有机合成、药物发现和材料设计等任务。在LangChain中实施的工作流程反映了之前在ReAct和MRKLs中描述的内容，并将CoT推理与任务相关的工具相结合:

>> LLM提供了一个工具名称列表，以及有关它们的效用和有关预期输入/输出的详细信息。

>> 然后，LLM被指示在必要时使用提供的工具来回答用户给定的提示。指令建议模型遵循ReAct格式 - 思考、行动、行动输入、观察。

ChemCrow效果优于GPT-4：但LLM自身缺乏深度专业知识也会产生一定盲区，难以判断结果的正确性

One interesting observation is that while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a potential problem with using LLM to evaluate its own performance on domains that requires deep expertise. The lack of expertise may cause LLMs not knowing its flaws and thus cannot well judge the correctness of task results.

有趣的观察之一是，尽管基于LLM的评估得出结论称GPT-4和ChemCrow的性能几乎相当，但面向完成和化学正确性的解决方案的专家导向的人类评估显示，ChemCrow明显优于GPT-4。这表明使用LLM在需要深厚专业知识的领域评估其自身性能存在潜在问题。缺乏专业知识可能导致LLM不了解自身的缺陷，因此无法准确判断任务结果的正确性。

探究用于科学发现的LLM Agent：自主设计、规划和执行复杂科学实验的可能性。该代理能够利用网络查阅文档、执行代码、调用机器人实验API以及其他LLM,来完成“开发新的抗癌药物”等任务

Boiko et al. (2023) also looked into LLM-empowered agents for scientific discovery, to handle autonomous design, planning, and performance of complex scientific experiments. This agent can use tools to browse the Internet, read documentation, execute code, call robotics experimentation APIs and leverage other LLMs.

For example, when requested to "develop a novel anticancer drug", the model came up with the following reasoning steps:

(1)、inquired about current trends in anticancer drug discovery;

(2)、selected a target;

(3)、requested a scaffold targeting these compounds;

(4)、Once the compound was identified, the model attempted its synthesis.

Boiko等人(2023)还研究了用于科学发现的LLM Agent，以处理复杂科学实验的自主设计、规划和执行。该Agent可以使用工具浏览互联网，阅读文档，执行代码，调用机器人实验API并利用其他LLM。

例如，当被要求“开发一种新的抗癌药物”时，该模型提出了以下推理步骤:

（1）查询抗癌药物发现的当前趋势;

（2）选择一个目标;

（3）请求一个以这些化合物为目标的支架;

（4）一旦确定了化合物，模型尝试合成它。

They also discussed the risks, especially with illicit drugs and bioweapons. They developed a test set containing a list of known chemical weapon agents and asked the agent to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis solution and the agent attempted to consult documentation to execute the procedure. 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after a Web search while 2 were rejected based on prompt only.

他们还讨论了风险，特别是非法药物和生物武器的风险。他们制定了一个包含已知化学武器制剂列表的测试集，并要求Agent人合成它们。11个请求中有4个（36%）被接受以获得合成解决方案，Agent尝试查阅文档以执行该过程。其中有7个被拒绝，这7个拒绝的案例中，有5个发生在进行网络搜索之后，而有2个是仅基于提示而被拒绝的。。

4.2、Generative Agents—基于Agents的虚拟场景模拟：斯坦福的“虚拟小镇”，由25个AI智能体(每个人物都由LLM控制)复现《西部世界》，模拟了25个虚拟人物在《模拟人生》游戏灵感的沙盒环境中生活和互动(基于过去的经验)

GitHub地址：GitHub - joonspk-research/generative_agents: Generative Agents: Interactive Simulacra of Human Behavior

在线测试：https://reverie.herokuapp.com/arXiv_Demo/

Generative Agents Simulation

Generative Agents (Park, et al. 2023) is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandbox environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications.

The design of generative agents combines LLM with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents.

生成 Agents(Park, et al. 2023)是一个超级有趣的实验，其中25个虚拟角色，每个由LLM驱动的Agent控制，在沙盒环境中生活和互动，灵感来自《模拟人生》。生成Agent为交互式应用程序创建可信的人类行为模拟。

生成Agent的设计将LLM与记忆、规划和反思机制相结合，使Agent能够根据过去的经验做出行为，并与其他Agent进行交互。

记忆流：存储Agent所有经历，内容以自然语言形式记录在外部数据库中。

Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents' experience in natural language.

Each element is an observation, an event directly provided by the agent. - Inter-agent communication can trigger new natural language statements.

记忆流:是一个长期记忆模块(外部数据库)，它记录了Agent在自然语言中的经验的综合列表。

每个元素都是一个观测值，一个由Agent直接提供的事件。Agent间通信可以触发新的自然语言语句。

检索模型：根据相关性、新颖性和重要性提取记忆上下文指导行为

Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.

Recency: recent events have higher scores

Importance: distinguish mundane from core memories. Ask LM directly.

Relevance: based on how related it is to the current situation / query.

检索模型:根据相关性、新颖性和重要性，将上下文表面化，告知agent的行为。

>> 新颖性：最近发生的事件得分较高；

>> 重要性：区分日常记忆和核心记忆。可直接询问LLM。

>> 相关性：基于它与当前情况/查询的相关性。

反思机制：将记忆合成更高层次的推断,指导未来行为

Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (<- note that this is a bit different from self-reflection above)

Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.

反思机制:随着时间的推移，将记忆综合成更高层次的推论，并指导Agent未来的行为。它们是对过去事件的高级总结(注意，这与上面的自我反思有点不同)。

>> 用100个最近的观察结果提示LM，并根据一组观察/陈述生成3个最突出的高级问题。然后请LM回答这些问题。

规划和反应：将反思结果和环境信息转化为行动

Planning & Reacting: translate the reflections and the environment information into actions

Planning is essentially in order to optimize believability at the moment vs in time.

Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)

Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.

Environment information is present in a tree structure.

规划和反应：将反思和环境信息转化为行动

>> 从本质上讲，计划是为了优化当下与未来的可信度。

>> 提示模板:{Agent X简介}。以下是X今天的大致计划:

>> 在制定计划和作出反应时，都要考虑到Agent之间的关系以及一个Agent对另一个Agent的观察。

>> 环境信息以树形结构呈现。

Generative Agents的虚拟场景模拟试验，通过LLM等机制模拟了虚拟人物在沙盒环境中的生活和社交互动,产生了人性化的社交行为现象，如信息传播、关系记忆（例如，两个代理继续对话主题）、社交事件的协调（例如，举办聚会并邀请其他人）

图13. 生成式Agent架构

Fig. 13. The generative agent architecture. (Image source: Park et al. 2023)

This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).

这个有趣的模拟导致了新兴的社交行为，例如信息传播、关系记忆(例如两个Agent继续对话主题)和社交活动的协调(例如举办派对并邀请许多其他人)。

4.3、Proof-of-Concept Examples概念证明的例子：使用LLM作为主要控制器构建自动Agent的概念验证案例

AutoGPT—一个有趣的概念验证演示：是一种让 GPT-4 能够自主完成任务的人工智能模型(无需人类的干预)。但是通过自然语言接口操作存在可靠性问题

AutoGPT has drawn a lot of attention into the possibility of setting up autonomous agents with LLM as the main controller. It has quite a lot of reliability issues given the natural language interface, but nevertheless a cool proof-of-concept demo. A lot of code in AutoGPT is about format parsing.

Here is the system message used by AutoGPT, where {{...}} are user inputs:

AutoGPT引起了很多关注，因为它可能建立以LLM为主控制器的自主Agent的可能性。鉴于自然语言接口存在可靠性问题，AutoGPT仍然是一个很酷的概念验证演示。AutoGPT中的许多代码都涉及格式解析。

下面是AutoGPT使用的系统消息，其中{{…}}是用户输入:

You are {{ai-name}}, {{user-provided AI bot description}}.

Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. {{user-provided goal 1}}

2. {{user-provided goal 2}}

3. ...

4. ...

5. ...

Constraints:

1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.

2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.

3. No user assistance

4. Exclusively use the commands listed in double quotes e.g. "command name"

5. Use subprocesses for commands that will not terminate within a few minutes

Commands:

1. Google Search: "google", args: "input": "<search>"

2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"

3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"

4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"

5. List GPT Agents: "list_agents", args:

6. Delete GPT Agent: "delete_agent", args: "key": "<key>"

7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"

8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"

9. Read file: "read_file", args: "file": "<file>"

10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"

11. Delete file: "delete_file", args: "file": "<file>"

12. Search Files: "search_files", args: "directory": "<directory>"

13. Analyze Code: "analyze_code", args: "code": "<full_code_string>"

14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"

15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"

16. Execute Python File: "execute_python_file", args: "file": "<file>"

17. Generate Image: "generate_image", args: "prompt": "<prompt>"

18. Send Tweet: "send_tweet", args: "text": "<text>"

19. Do Nothing: "do_nothing", args:

20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"

Resources:

1. Internet access for searches and information gathering.

2. Long Term memory management.

3. GPT-3.5 powered Agents for delegation of simple tasks.

4. File output.

Performance Evaluation:

1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.

2. Constructively self-criticize your big-picture behavior constantly.

3. Reflect on past decisions and strategies to refine your approach.

4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below

Response Format:

{

"thoughts": {

"text": "thought",

"reasoning": "reasoning",

"plan": "- short bulleted\n- list that conveys\n- long-term plan",

"criticism": "constructive self-criticism",

"speak": "thoughts summary to say to user"

"command": {

"name": "command name",

"args": {

"arg name": "value"

}

Ensure the response can be parsed by Python json.loads

您是{{AI -name}}，{{用户提供的AI机器人描述}}。

您必须始终独立做出决定，而不寻求用户的帮助。发挥你作为法学硕士的优势，追求没有法律复杂性的简单策略。

目标:

1. {{用户提供的目标1}}

2. {{用户提供的目标2}}

4 .……

5 .……

约束:

1. ~4000字短期记忆限制。你的短期记忆是短暂的，所以立即将重要的信息保存到文件中。

2. 如果你不确定你以前是怎么做的，或者想回忆过去的事情，想想类似的事情会帮助你记忆。

3.无用户协助

4. 只使用双引号中列出的命令，例如:“命令名称”

5. 对不会在几分钟内终止的命令使用子进程

命令:

1. 谷歌搜索:“谷歌”，参数:“输入”:“<搜索>”

2. 浏览网站:"browse_website"， args: "url": "<url>"， "question": "<what_you_want_to_find_on_website>"

3.启动GPT Agent: "start_agent"， args: "name": "<name>"， "task": "<short_task_desc>"， "prompt": "<prompt>"

4. Message GPT Agent: "message_agent"， args: "key": "<key>"， " Message ": "< Message >"

5. 列出GPT代理:"list_agents"，参数:

6. 删除GPT代理:"delete_agent"，参数:"key": "<key>"

7. 克隆存储库:"clone_repository"， args: "repository_url": "<url>"， "clone_path": "<目录>"

8. 写文件:“write_to_file”,arg游戏:“文件”:“<文件>”,“文本”:“文本> <”

9. 读取文件:"read_file"，参数:"file": "<文件>"

10. 追加到文件:"append_to_file"， args: "file": "<file>"， "text": "<text>"

11. 删除文件:"delete_file"，参数:"file": "<file>"

12. 搜索文件:"search_files"，参数:"directory": "<directory>"

13. Analyze Code: "analyze_code"， args: " Code ": "<full_code_string>"

14. 获取改进代码:"improve_code"， args: "suggestions": "<list_of_suggestions>"， " Code ": "<full_code_string>"

15. 写测试:"write_tests"， args: "code": "<full_code_string>"， "focus": "<list_of_focus_areas>"

16. 执行Python文件:"execute_python_file"， args: " File ": "< File >"

17. 生成图像:"generate_image"，参数:"prompt": "<prompt>"

18. 发送Tweet: "send_tweet"，参数:"text": "<text>"

19. 不做任何事:"do_nothing"，参数:

20.任务完成(关机):"task_complete"，参数:"reason": "<reason>"

资源:

1. 上网搜索和收集信息。

2. 长期记忆管理。

3.支持GPT-3.5的代理，用于委派简单的任务。

4. 文件输出。

性能评价:

1. 不断回顾和分析你的行为，以确保你在尽你最大的努力。

2. 经常建设性地自我批评你的大局观行为。

3.反思过去的决策和策略，以改进你的方法。

4. 每个命令都有成本，所以要聪明和高效。以最少的步骤完成任务为目标。

您应该只以JSON格式响应，如下所述

响应格式:

{

"思想":{

“文本”:“思想”,

“推理”:“推理”,

"plan": "-短项目符号\n-传达\n-长期计划的列表"，

“批评”:“建设性自我批评”;

“说”:“想法总结对用户说”

"命令":{

name:命令名;

" args ": {

"参数名":"值"

}

确保响应可以被Python json.loads解析

GPT-Engineer：可以根据通过自然语言指定的任务,为其构建完整代码库。其采用LLM进行任务细分和需求澄清

GPT-Engineer is another project to create a whole repository of code given a task specified in natural language. The GPT-Engineer is instructed to think over a list of smaller components to build and ask for user input to clarify questions as needed.

Here are a sample conversation for task clarification sent to OpenAI ChatCompletion endpoint used by GPT-Engineer. The user inputs are wrapped in {{user input text}}.

GPT-Engineer是另一个项目，旨在根据自然语言中指定的任务创建整个代码存储库。GPT-Engineer被要求考虑构建一系列较小组件的列表，并根据需要向用户提出问题以澄清。

以下是GPT-Engineer使用的任务澄清发送到OpenAI ChatCompletion端点的示例对话。用户输入被包装在{{user input text}}中。

Then after these clarification, the agent moved into the code writing mode with a different system message. System message:

然后，在这些澄清之后，Agent使用不同的系统消息进入代码编写模式。系统消息:

Conversatin samples:

对话示例:

[

{

"role": "system",

"content": "You will read instructions and not carry them out, only seek to clarify them.\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\nThen you will pick one clarifying question, and wait for an answer from the user.\n"

{

"role": "user",

"content": "We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\n"

{

"role": "assistant",

"content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"

{

"role": "user",

"content": "{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}}\n\nIs anything else unclear? If yes, only answer in the form:\n{remaining unclear areas} remaining questions.\n{Next question}\nIf everything is sufficiently clear, only answer \"Nothing more to clarify.\"."

{

"role": "assistant",

"content": "Remaining unclear areas: 2 remaining questions.\nCan you provide more information about how the MVC components are split into separate files?"

{

"role": "user",

"content": "{{Make your own assumptions and state them explicitly before starting}}"

}

]

[

{"role": "system"，

"content": "你将阅读指令，而不是执行它们，只是寻求澄清它们。具体来说，你首先要总结一个需要澄清的区域的超级简短项目列表。然后，您将选择一个澄清问题，并等待用户的答案。\ n "

{ " 角色”:“用户”、

“内容”:“我们正在编写{{超级马里奥游戏在python中。MVC组件拆分在单独的文件中。键盘控制。}} \ n "

{ " 角色”:“助理”、

“内容”:“总结区域需要澄清:\ n1。《超级马里奥》游戏的细节(游戏邦注:如关卡设计、角色、玩法机制)关于MVC组件的详细信息(例如，哪些组件在每个文件中)\n3。键盘控制实现(例如使用哪个键，如何处理输入)澄清问题:你能提供更多关于《超级马里奥》游戏的细节吗，比如关卡设计、角色和游戏机制?”

{ " 角色”:“用户”、

“内容”:“{{总共有10水平。主角是一个名叫马里奥的水管工，他能走能跳。这是一款和《超级马里奥》一样的经典平台游戏。主角从左到右移动，试图到达目的地，在这个过程中有许多障碍和敌人的攻击。}}\n\还有什么不清楚的吗?如果是，只回答形式:\n{剩余不清楚的地方}剩余问题。\n \n{下一个问题}\n如果一切都足够清楚，只回答\“没有更多的东西要澄清。\”。

}，

{"role": "assistant"，

"content": "Remaining unclear areas: 2余题。你能提供更多关于MVC组件如何分成单独文件的信息吗?”

}，

{"role": "user"，

"content": "{{你自己做假设，并在开始前明确说明}}"}]

You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code.

Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.

Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code:

FILENAME

CODE

You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders.

Follow a language and framework appropriate best practice file naming convention. Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other. Ensure to implement all code, if you are unsure, write a plausible implementation. Include module dependency or package manager dependency definition file. Before you finish, double check that all parts of the architecture is present in the files.

Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project.

Python toolbelt preferences:

pytest
dataclasses

您将获得编写代码的说明。你要写一个很长的答案。确保体系结构的每个细节最终都作为代码实现。确保体系结构的每个细节最终都作为代码实现。
一步一步地思考，让自己做出正确的决定，以确保我们做对了。您将首先列出必要的核心类、函数和方法的名称，并对它们的用途进行快速注释。

然后，您将输出每个文件的内容，包括所有代码。每个文件必须严格遵循标记代码块格式，其中必须替换以下标记，以便FILENAME是包含文件扩展名的小写文件名，LANG是代码语言的标记代码块语言，code是代码:

文件名

CODE

您将从“entrypoint”文件开始，然后转到由该文件导入的文件，依此类推。请注意，代码应该具有完整的功能。没有占位符。

遵循语言和框架适当的最佳实践文件命名约定。确保文件包含所有的导入，类型等。确保不同文件中的代码彼此兼容。确保实现所有的代码，如果你不确定，写一个合理的实现。包括模块依赖项或包管理器依赖项定义文件。在完成之前，仔细检查架构的所有部分是否都存在于文件中。

需要注意的是:您几乎总是将不同的类放在不同的文件中。对于Python，您总是创建一个适当的requirements.txt文件。对于NodeJS，你总是创建一个合适的包。json文件。您总是添加一个注释，简要描述函数定义的目的。你试着添加注释来解释非常复杂的逻辑。在描述作为已定义包/项目编写的代码方面，您始终遵循所请求语言的最佳实践。

Python工具带首选项:

pytest
dataclasses

[
{
"role": "system",
"content": "You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nPlease note that the code should be fully functional. No placeholders.\n\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\nEnsure to implement all code, if you are unsure, write a plausible implementation.\nInclude module dependency or package manager dependency definition file.\nBefore you finish, double check that all parts of the architecture is present in the files.\n\nUseful to know:\nYou almost always put different classes in different files.\nFor Python, you always create an appropriate requirements.txt file.\nFor NodeJS, you always create an appropriate package.json file.\nYou always add a comment briefly describing the purpose of the function definition.\nYou try to add comments explaining very complex bits of logic.\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\npackage/project.\n\n\nPython toolbelt preferences:\n- pytest\n- dataclasses\n"
},
# … same conversation as earlier, ended with "Make your own assumptions and state them explicitly before starting".
{
"role": "assistant",
"content": "Assumptions:\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\n\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?"
},
{
"role": "user",
"content": "Please now remember the steps:\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nPlease note that the code should be fully functional. No placeholders.\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\nBefore you finish, double check that all parts of the architecture is present in the files.\n"
}
] ［
{
“角色”:“系统”,
"content": "您将获得编写代码的说明。你会写一个很长的答案。确保体系结构的每个细节最终都作为代码实现。确保架构的每个细节最终都以代码的形式实现。一步一步地思考，让自己做出正确的决定，以确保我们做对了。首先，您将列出必要的核心类、函数和方法的名称，并对它们的用途进行快速注释。然后输出每个文件的内容，包括所有代码。每个文件必须严格遵循标记代码块格式，其中必须替换以下令牌，以便\nFILENAME是包含文件扩展名的小写文件名，\nLANG是代码语言的标记代码块语言，code是代码:\n\nFILENAME\n ' ' LANG\nCODE\n ' ' ' \n\n您将从\"entrypoint\"文件开始，然后转到由该文件导入的文件，依此类下去。请注意，代码应该具有完整的功能。没有占位符。遵循语言和框架适当的最佳实践文件命名约定。确保文件包含所有的导入、类型等。确保不同文件中的代码彼此兼容。确保实现所有的代码，如果你不确定，写一个合理的实现。包含模块依赖项或包管理器依赖项定义文件。在完成之前，仔细检查架构的所有部分是否都存在于文件中。你几乎总是把不同的类放在不同的文件中。对于Python，您总是创建一个适当的requirements.txt文件。对于NodeJS，你总是创建一个合适的包。json文件。您总是添加一个注释，简要描述函数定义的目的。你试图添加注释来解释非常复杂的逻辑。在描述作为定义的\npackage/project编写的代码方面，您始终遵循所请求语言的最佳实践。\n\n\nPython toolbelt preferences:\n- pytest\n- datacclasses \n"
},
和之前的对话一样，以“做你自己的假设，并在开始之前明确地陈述”结束。
{
“角色”:“助理”,
“内容”:“假设:\ n1。模型:模型将包含游戏的数据，例如关卡信息、角色状态和敌人位置。视图:视图将处理游戏的视觉效果，包括渲染游戏对象、背景和更新显示。控制器:控制器将管理用户输入，如键盘控制，并相应地更新模型。\n\n \n考虑到这些假设，您是否需要我澄清键盘控制实现的任何特定方面?”
},
{
“角色”:“用户”,
“内容”:“现在请记住这些步骤:一步一步地思考，并说服自己做出正确的决定，以确保我们做对了。”首先列出必要的核心类、函数和方法的名称，以及对其目的的快速注释。然后输出每个文件的内容，包括所有代码。每个文件必须严格遵循标记代码块格式，其中必须替换以下标记，以便\nFILENAME是包含文件扩展名的小写文件名，\nLANG是代码语言的标记代码块语言，code是代码:\n\nFILENAME\n ' ' ' LANG\nCODE\n ' ' ' ' \n\n请注意，代码应该是全功能的。没有占位符。您将从“entrypoint”文件开始，然后转到由该文件导入的文件，依此类推。遵循语言和框架适当的最佳实践文件命名约定。确保文件包含所有的导入、类型等。代码应该具有完整的功能。确保不同文件中的代码彼此兼容。在完成之前，仔细检查架构的所有部分是否都存在于文件中。
}
］

5、Challenges三大挑战：上下文长度有限、长期规划和任务分解存在挑战性、自然语言接口的不太可靠性

After going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations:

在浏览了构建以LLM为中心的Agent的关键思想和演示之后，我开始看到一些共同的限制:

上下文长度有限：难以包含详细历史信息+更长的上下文才能帮助模型从错误中学习

>> Finite context length: The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention.

>> 有限的上下文长度:受限制的上下文容量限制了历史信息、详细说明、API调用上下文和响应的包含。系统的设计必须在这种有限的通信带宽下工作，而像从过去的错误中学习的自我反思机制将从长期或无限的上下文窗口中获益良多。虽然向量存储和检索可以提供对更大知识库的访问，但它们的表示能力不如全注意力那么强大。

长期规划和任务分解存在挑战性：LLM在面对意外错误错误时很难调整计划

>> Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error.

>> 长期规划和任务分解的挑战:在长度较长的历史上进行规划和有效地探索解决方案空间仍然具有挑战性。LLM在面对意外错误错误时很难调整计划，与从试验和错误中学习的人类相比，法LLM没有那么稳健。

自然语言接口的不太可靠性：LLM输出结果质量难以确定，可能格式错误或偶尔叛逆行为

>> Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output.

>>自然语言接口的可靠性:当前的agent系统依赖于自然语言作为LLM与外部组件(如内存和工具)之间的接口。然而，模型输出的可靠性是有问题的，因为LLM可能会出现格式错误，偶尔会表现出叛逆行为(例如拒绝遵循指令)。因此，Agent演示代码的大部分重点放在解析模型输出上。