AGI之MFM：《多模态基础模型：从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势

Large Language Models (LLMs) (Chowdhery et al., 2022; OpenAI, 2023a) have shown intriguing properties generalizing to user prompts in various domains, and rapidly adapting to new scenarios, using in-context learning with a few examples. Inspired by such strong capabilities, researchers are now exploring a new modeling paradigm that shifts from standalone models for solving finite, pre-defined problems, into synergistically chaining multiple tools or experts with LLMs to solve complicated, open problems. Unlike what has been introduced in Chapter 5, such a system can be built without any training involved, just by using a few demonstration examples to teach the LLM to generate proper calling to existing tools.

大型语言模型（LLMs）（Chowdhery等人，2022；OpenAI，2023a）已经展示了一些有趣的特性，可以泛化到各个领域的用户提示，并通过几个例子使用上下文学习快速适应新的场景。受到这种强大能力的启发，研究人员现在正在探索一种新的建模范式，从解决有限预的、预定义问题的独立模型，转向为协同链接多个工具或具有LLMs的专家来解决复杂的、开放的问题。与第5章中介绍的不同，这样的系统可以在不涉及任何训练的情况下构建，只需使用少量示范示例来教导LLM生成对现有工具的适当调用即可。

In this chapter, we review the fast-evolving literature on chaining different multimodal experts with LLMs to solve complicated multimodal understanding problems, referred to as multimodal agents. We start with an overview on the evolution of this modeling paradigm in Section 6.1, highlighting the differences between traditional approaches and the new modeling paradigm of chaining tools with LLM. Section 6.2 gives a general overview of multimodal agents. Pivoting on an exemplary multimodal agent MM-REACT (Yang* et al., 2023), Section 6.3 comprehensively reviews how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incorporate the latest and strongest LLM and potentially millions of tools. Finally, in Section 6.4, we end the chapter with discussions on advanced topics, such as how to improve/evaluate multimodal agents, the diverse applications powered by multimodal agents.

在本章中，我们将回顾有关将不同多模态专家与LLMs协同工作以解决复杂的多模态理解问题的快速发展文献，称为多模态智能体。我们从

第6.1节中对这种建模范式的演变进行概述，强调传统方法与使用LLMs协同工具的新建模范式之间的差异。

第6.2节概述了多模态智能体的总体概述。以典型的多模式代理MM-REACT (Yang* et al.， 2023)为中心，

第6.3节全面回顾了如何构建多模态智能体，它在多模态理解方面的新兴能力，以及如何轻松扩展以包含最新和最强大的LLM和潜在的数百万工具。最后，在

第6.4节中，我们以高级主题的讨论结束本章，例如如何改进/评估多模态智能体，多模态智能体驱动的各种应用。

6.1、Overview概述

建模范式的演变：特定任务模型→大型多模态模型→带有LLM的链接工具范式(无需任何训练+加持现有开源平台或API工具)

We first revisit the evolution of modeling paradigms, from task-specific models to the most recent large multimodal models, which all require data curation and model training. We then introduce the new modeling paradigm of chaining tools with LLM, which may not require any training, but instead directly takes advantage of a pre-trained LLM and existing tools that are widely available through open-source platforms or APIs.

首先，我们重新审视了建模范式的演变，从特定任务模型到最新的大型多模态模型，所有这些都需要数据管理和模型训练。然后，我们引入了带有LLM的链接工具的新的建模范例，它可能不需要任何培训，而是直接利用了预训练的LLM和通过开源平台或API广泛提供的现有工具的优势。

(1)、Evolution of modeling paradigm建模范式的演变：

特定任务的专用模型→预训练模型的二阶段(预训练+微调范式，如NLP中的BSRT系列、VL中的UNITER/OSCAR，仍是针对特定任务的微调)→

As summarized in Figure 6.1, we are witnessing the transition from task-specific models towards general-purpose assistants across language, vision, and multi-modal research.

We started with task-specific models that are trained on small-scale well-annotated data. This results in dedicated models (Anderson et al., 2018; Li et al., 2019a; Yu et al., 2019) for each task or even each dataset.

如图6.1所总结的那样，我们正在目睹从特定任务模型向跨语言、视觉和多模态研究的通用助手的过渡。

我们从特定于任务的模型开始，这些模型是在小规模的注释良好的数据上训练的。这就产生了针对每个任务甚至每个数据集的专用模型（Anderson等人，2018；Li等人，2019a；Yu等人，2019）。

We then transitioned to the phase of pre-trained models, with the pretrain-then-finetune paradigm widely adopted across both NLP and vision-language (VL) research. During pre-training, the model can take advantages of large-scale, web-crawled noisy data, for example, millions to billions of image-text pairs (Chen et al., 2020d; Wang et al., 2022a), or billions of text tokens (Devlin et al., 2019; Liu et al., 2019). However, it is still mostly task-specific finetuned, requiring similarly small-scale, well-annotated data as the ones used in training task-specific models. This paradigm has led to many well-known models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) in NLP, and UNITER (Chen et al., 2020d), OSCAR (Li et al., 2020b) in VL. These early VL founda-tion models were considered to be large-scale (trained with 10M image-text pairs), but may be of intermediate or even small size in today’s view (billions of pairs).

然后，我们过渡到了预训练模型的阶段，在NLP和视觉语言（VL）研究中广泛采用了预训练-微调的范式。在预训练过程中，模型可以利用大规模的网络抓取的嘈杂数据，例如数百万到数十亿的图像-文本对（Chen等人，2020d；Wang等人，2022a），或数十亿的文本token（Devlin等人，2019；Liu等人，2019）。但是，它仍然主要是针对特定任务的微调，需要与训练特定任务模型中使用的数据类似的小规模、良好注释的数据。这种范式已经产生了许多著名的模型，例如NLP中的BERT（Devlin等人，2019），RoBERTa（Liu等人，2019），以及VL中的UNITER（Chen等人，2020d），OSCAR（Li等人，2020b）。这些早期的VL基础模型被认为是大规模的（训练了1000万个图像-文本对），但在今天的观点中可能是中等甚至较小的规模(数十亿对)。

→基于通用的大型语言/多模态模型(比如GPT系列/PaLM/LLaMA系列/Flamingo)→基于通用模型的构建指令跟随(比如Alpaca/LLaVA【视觉指令调优】)

Nowadays, we are entering a new era of generalist modeling, where the pre-training has been fur-ther scaled up to trillions of text tokens (Gao et al., 2023b). For downstream adaptation, these gen-eralist models have shown strong performance with in-context few-shot learning on a few demon-stration examples, or even zero-shot evaluation. These models are what we now refer as large lan-guage/multimodal models, including the GPT family (OpenAI, 2022, 2023a), PaLM family (Chowd-hery et al., 2022; Driess et al., 2023), LLaMa (Touvron et al., 2023), Flamingo (Alayrac et al., 2022).

Based on the generalist models, the pipeline of building instruction-following models covered in Chapter 5, similarly follows the pretrain-then-finetune paradigm. For example, Alpaca (Taori et al., 2023), is built on top of the pre-trained LLaMa (Touvron et al., 2023), then finetuned on a smaller-scale instruction tuning dataset. Similarly, for instruction-following VL models (e.g. LLaVA (Li et al., 2023e)), an additional stage of image-text alignment pre-training is introduced to align the visual representations to the frozen LLM first, followed by visual instruction tuning.

如今，我们正在进入通用建模的新时代，其中预训练已经进一步扩展到数万亿的文本token（Gao等人，2023b）。对于下游适应，这些通用模型已经展示出在少量示范示例上进行上下文适应学习或甚至零样本评估时的强大性能。这些模型现在被称为大型语言/多模态模型，包括GPT系列（OpenAI，2022，2023a），PaLM系列（Chowd-hery等人，2022；Driess等人，2023），LLaMa（Touvron等人，2023），Flamingo（Alayrac等人，2022）。

在通用模型基础上，构建指令跟随模型的流程涵盖在第5章中，同样遵循预先训练然后微调的范式。例如，Alpaca（Taori等人，2023）是在预先训练的LLaMa（Touvron等人，2023）之上构建的，然后在较小规模的指令调优数据上进行微调。同样，对于指令跟随VL模型（例如LLaVA（Li等人，2023e）），引入了额外的图像-文本对齐预训练阶段，以首先将视觉表示与冻结的LLM对齐，然后进行视觉指令调优。

(2)、New modeling paradigm:新的建模范式：与LLM链接的工具链

痛点：基本功能上的挑战(数学推理/信息检索)、通用局限性(能力依赖于过时训练数据的世界而无法及时更新信息)

New modeling paradigm: chaining tools with LLM. LLMs (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023a) have demonstrated exceptional abilities to tackle new tasks with only a few examples or textual instructions, showing the promise of serving as general-purpose foundations for many applications. Despite being versatile and impressive, they encounter challenges with the basic functionalities, such as mathematical reasoning and information retrieval. Furthermore, a fundamental limitation of not only LLMs but also other large-scale models nowadays, is that they only represent the world described by their training data, which will inevitably become outdated over time. Regularly re-training the model with the latest information is simply not feasible.

新的建模范式：与LLM链接的工具链。

LLMs（Brown等人，2020；Chowdhery等人，2022；OpenAI，2023a）已经展示出了只使用少量示例或文本指令就能处理新任务的卓越能力，显示出它们有望作为许多应用的通用基础。尽管它们用途广泛且令人印象深刻，但它们在数学推理和信息检索等基本功能方面面临挑战。此外，不仅LLMs，而且现今的其他大规模模型的一个基本限制是，它们只代表其训练数据中描述的世界，这将不可避免地随着时间的推移而过时。用最新的信息定期重新训练模型是不可行的。

提出语言建模一种新的建模范式(外部NLP工具补充LLMs，如计算器/搜索引擎/翻译系统/日历等)→未来的智能代理(使用工具启用LLMs对多模态信号进行感知)

Meanwhile, many tasks with real-world impact cannot be readily tackled by by LLMs alone. For example, accessing up-to-date information and performing computations, can be done via existing tools (e.g. , search engine or calculator). Hence, recent research in language modeling has explored a new modeling paradigm by supplementing LLMs with external NLP tools (Nakano et al., 2021; Huang et al., 2022b; Ahn et al., 2022), including calculators, search engines, translation systems, calendars, or even API calls on other models.

The above studies mainly focus on a single modality, i.e., language, in which the output of the tools are in text format, thereby can naturally be fed into LLMs as additional knowledge. However, we live in a multimodal world and a truly intelligent agent should be able to perform advanced multimodal reasoning and actions. How to enable LLMs with perception of multimodal signals via tool using, is the focus of the remaining part of this chapter.

与此同时，许多对现实世界有影响的任务不能轻易地由LLMs 单独解决。例如，访问最新信息和执行计算可以通过现有工具（例如，搜索引擎或计算器）来完成。因此，语言建模的最新研究探索了一种新的建模范式，用外部NLP工具补充LLMs（Nakano等人，2021；Huang等人，2022b；Ahn等人，2022），包括计算器、搜索引擎、翻译系统、日历，甚至其他模型的API调用。

上述研究主要集中在单一模态，即语言，其中工具的输出以文本格式呈现，因此可以自然地输入到LLMs中作为额外的知识。然而，我们生活在一个多模态的世界，一个真正智能的代理应该能够执行高级的多模态推理和行动。如何通过使用工具启用LLMs对多模态信号进行感知，是本章剩余部分的重点。

6.2、Multimodal Agent多模态智能体

代表性作品：VISPROG(第一个利用编程语言将不同的视觉工具与LLM相结合的工作)、Visual ChatGPT(通过各种图像生成工具结合对话提问实现图像编辑)、MM-ReAct(体现LLM通过融合多个视觉专家完成复杂的跨模态行为和推理)

There are several representative works on building multimodal agent with tool use of vision experts, including VISPROG (Gupta and Kembhavi, 2022b), Visual ChatGPT (Wu et al., 2023a) and MM-ReAct (Yang* et al., 2023). VISPROG is the very first work on using programming language to chain different vision tools with a LLM. Visual ChatGPT enables dialogue-based image editing by complementing ChatGPT (OpenAI, 2022) with various image generation tools. MM-ReAct shows that when collaborating various advanced vision experts, ChatGPT can perform complex multimodal actions and reasoning. Figure 6.2 presents the fast-evolving literature in multimodal agents from November 18, 2022 to July 26th, 2023. Among them, we include a few more exemplary multimodal agents in Table 6.1, along with two representative works in the NLP domain.

利用视觉专家的工具构建多模态智能体的代表性作品有括VISPROG（Gupta和Kembhavi，2022b）、Visual ChatGPT（Wu等人，2023a）和MM-ReAct（Yang*等人，2023）。

>> VISPROG是第一个使用编程语言将不同的视觉工具与LLM链接起来的作品。

>> Visual ChatGPT通过结合ChatGPT（OpenAI，2022）和各种图像生成工具，实现了基于对话的图像编辑。

>> MM-ReAct表明，当协作各种高级视觉专家时，ChatGPT可以执行复杂的多模态操作和推理。

图6.2展示了从2022年11月18日到2023年7月26日多模态智能体领域的快速发展文献。其中，我们在表6.1中列出了几个更具代表性的多模态智能体，以及NLP领域的两个代表性作品。

典型多模态智能体框架实现：通过用户与工具分配器的直接交互,由LLM担任分配器的大脑来规划使用单个或协同多个工具完成用户需求的步骤,执行后汇集结果输入到LLM中实现响应

An overview of a typical multimodal agent framework is illustrated in Figure 6.3. The user directly interacts with the Tool Allocator, which functions as the brain of the agent. In current literature, the tool allocator is usually a LLM. To achieve the user’s goal, the LLM will outline all the steps necessary with either a single tool or collaborating multiple tools together. Subsequently, it will retrieve from all the candidate tools for the needed tools, and execute possibly multiple rounds of tools to fulfill the human requirement. Finally, the execution results from the tools are gathered as inputs of the LLM to generate a response to the user. Next, we cover the three key components of multimodal agents.

典型多模态智能体框架的概述如图6.3所示。用户直接与工具分配器交互，它充当代理的大脑。在当前文献中，工具分配器通常是一个LLM。为了实现用户的目标，LLM将使用单个工具或协作多个工具来概述实现任务所需的所有步骤。随后，它将从所有候选工具中检索所需的工具，并执行可能涉及多轮工具以满足人类需求。最后，工具的执行结果被收集作为LLM的输入，以生成对用户的响应。

接下来，我们将介绍多模态智能体的三个关键组成部分。

模态代理的三个关键组件

Tools工具：提供LLM缺失的多模态信息，比如开源模型/API/代码解释器

Tools. Tools are external modules that are callable by the LLM to obtain extra information that is missing from the model weights, including open-source models, public/private APIs, or code inter-preters. As LLMs only accept language inputs, one must include tools that can process multimodal inputs to build a multimodal agent.

工具。工具是可以由LLM调用的外部模块，用于获取模型权重中缺失的额外信息，包括开源模型、公共/私有API或代码解释器。由于LLMs只接受语言输入，因此必须包含能够处理多模态输入的工具来构建多模态智能体。

Planning规划：将用户需求细化为可执行步骤调用工具

Planning. During planning, the LLM decomposes the user requests into smaller, manageable sub-problems, and outlines a step-by-step solution, each of which involves calling an external tool. There are two ways to teach LLMs for planning. One is to prompt the LLM with in-context few-shot examples of all candidate tools. This approach can extend the general model directly but is limited by the context length. The other approach relies on large amounts of annotated data to fine-tune the LLM, which most likely will damage the robustness and generalizability of the model.

规划。在规划过程中，LLM将用户的请求分解为较小、可管理的子问题，并概述了逐步解决方案，每个解决方案都涉及调用外部工具。

教导LLMs进行规划有两种方式。一种是使用所有候选工具的上下文少量示例来提示LLM进行规划。这种方法可以直接扩展通用模型，但受上下文长度的限制。另一种方法依赖于大量注释数据来对LLM进行微调，这很可能会损害模型的稳健性和通用性。

Execution执行：由LLM翻译计划调用工具得到结果与用户对话

Execution. The generated plan is further translated into executable calls to the required tools, which can be done via regular expression matching (Yang* et al., 2023); directly prompting LLMs to generate executable programs (Sur´ıs et al., 2023); or leveraging in-context few-shot learning capability of LLMs by providing natural language instructions that describe the roles of each module together with a few calling examples (Lu et al., 2023b). The execution results are fed back to the LLM to generate a response to the user.

执行。生成的计划进一步转化为对所需工具的可执行调用，可以通过正则表达式匹配（Yang*等人，2023）来完成；直接提示LLMs生成可执行程序（Sur´ıs等人，2023）；或者通过提供描述每个模块角色的自然语言指令以及一些调用示例来利用LLMs 的上下文少量学习能力（Lu等人，2023b）。执行结果反馈给LLM，以生成对用户的响应。

6.3、Case Study: MM-REACT案例研究：MM-REACT

We use MM-REACT (Yang* et al., 2023) as a case study to show how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incor-porate the latest and strongest LLM and potentially millions of tools.

我们以MM-REACT（Yang*等人，2023）作为案例研究，展示如何构建多模态智能体，它在多模态理解方面的新兴能力，以及如何轻松扩展以整合最新和最强大的LLM以及潜在的数百万工具。

6.3.1、System Design系统设计：MM-REACT设计智能体范式，通过ChatGPT作为大脑结合多模态视觉专家，支持图像和视频等多模态输入输出实现多模态推理与行动能力

MM-ReAct designs the system paradigm that composes numerous multimodal tools1 with Chat-GPT (OpenAI, 2022) for multimodal reasoning and action. By augmenting the language-only ChatGPT with various multimodal tools, MM-REACT supports both inputs and outputs in multi-modalities, including text, image and video, as shown in Figure 6.4.

Figure 6.5 shows the system design of MM-REACT. The multimodal tools explored in MM-REACT are mainly computer vision models that take an image as input and interpret the image content from different perspectives. For instance, the image captioning model generates a natural description, the OCR model extracts the scene text in the image, the celebrity recognition model identifies the celebrity names, and the object detection model extracts the salient object with bound-ing box locations. LLMs such as ChatGPT serves as the brain of the agent, which plans on which tools to use, and in what order, based on the input image and the user intent. Next, with the example in Figure 6.5, we unfold the planning and execution of MM-REACT behind the scene.

MM-ReAct设计了系统范例，该系统范例使用ChatGPT (OpenAI, 2022)组成了多个多模态工具，用于多模态推理和行动。通过利用各种多模态工具来增强仅支持语言的ChatGPT，MM-REACT支持多种模态输入和输出，包括文本、图像和视频，如图6.4所示。

图6.5显示了MM-REACT的系统设计。MM-REACT中探讨的多模态工具主要是计算机视觉模型，它们以图像作为输入并从不同角度解释图像内容。例如，图像字幕模型生成自然描述，OCR模型提取图像中的场景文本，名人识别模型识别名人姓名，物体检测模型提取带有边界框位置的显著物体。像ChatGPT这样的LLM充当了代理的大脑，它根据输入图像和用户意图规划使用哪些工具以及以什么顺序使用。接下来，通过图6.5中的示例，我们将揭示MM-REACT在幕后的规划和执行过程。

User prompt用户提示：MM-REACT利用图像文件路径作为ChatGPT输入,让其在规划阶段通过调用视觉工具来理解图像内容并回答用户问题

User prompt. As ChatGPT only accepts language inputs, to enable image as inputs, we simply use the file path as the input to ChatGPT. The file path functions as a placeholder, allowing ChatGPT to treat it as a black box and later seek help from different tools during the planning stage. Besides the input image, the user can also provide the intent in text format (e.g. , a question about the input image). When there is no text input from the user, the goal is to get a general understanding about the image.

由于ChatGPT仅接受语言输入，为了启用图像作为输入，我们简单地使用文件路径作为ChatGPT的输入。文件路径充当占位符，允许ChatGPT将其视为黑盒，并在规划阶段寻求不同工具的帮助。除了输入图像，用户还可以以文本格式提供意图（例如关于输入图像的问题）。当用户没有提供文本输入时，目标是对图像有一个大致的了解。

Planning规划：MM-REACT通过提示词与正则判断是否需要外部工具,并提供工具描述与使用示例指导 ChatGPT合理调用视觉专家完成任务

Planning. Upon receiving the input image and user prompt, ChatGPT plans for what tools to use. Inspired by REACT (Yao et al., 2022c), MM-REACT instructs ChatGPT to respond with certain watchwords, such as “Assistant, what objects are there in the image? <file path>”, if a specific tool is required (i.e., action request in Figure 6.5). In practice, one can tell whether a multimodal tool is needed by simply string-matching the keyword “Assistant,” in ChatGPT’s response.

在接收到输入图像和用户提示后，ChatGPT规划要使用哪些工具。受到REACT（Yao等人，2022c）的启发，MM-REACT指导ChatGPT以特定的关键词来回应，例如“助手，图像中有什么对象？<文件路径>”，如果需要特定工具（即图6.5中的操作请求）。在实践中，可以通过简单地字符串匹配ChatGPT的响应中的关键字“Assistant”来判断是否需要多模态工具。

MM-ReAct encourages ChatGPT to show the thought (reasoning) process to highlight why an exter-nal tool is needed, which has been proven beneficial in NLP studies (Yao et al., 2022c). In addition, for generating proper calling to each tool, both instructions and in-context examples are added as the prefix when prompting ChatGPT. Each tool is described with the model name, a general description of its capability, the input data format, and the output information. After describing each tool, a few in-context dialogue examples are also included for enhanced performance.

MM-ReAct鼓励ChatGPT展示思考（推理）过程，以突出为什么需要外部工具，这在NLP研究中已被证明是有益的（Yao等人，2022c）。此外，为了生成对每个工具的正确调用，当提示ChatGPT时，会将指令和上下文示例作为前缀添加。每个工具都用模型名称、其功能的一般描述、输入数据格式和输出信息来描述。在描述每个工具之后，还包括了一些上下文对话示例，以增强性能。

Execution执行：MM-REACT通过正则匹配解析ChatGPT的动作请求,调用相应工具完成各类视觉任务后,汇总结果与ChatGPT对话,解决用户提出的问题。

Execution. Given the action request from ChatGPT, the tool name and the file path can be parsed via regular expression matching, which are used to invoke the tool (action execution).

Take the example shown in Figure 6.5, upon receiving the input image, ChatGPT first invokes a se-ries of tools for a general understanding about the image. The invoked tools include image caption-ing for an overall description of the image; dense captioning to get the region-level, more detailed description about the objects in the image; object tagging to get the tags of the objects in the image; face detection to get the box coordinates of the two faces mentioned in the object tags. The outputs from the tools (i.e. observations) are serialized as text, and fed back to ChatGPT.

Combining observations with the chat history, ChatGPT can further invoke additional experts or return the final answer to the user. In this specific example, ChatGPT invokes a second round of thought-action-observation over the two faces detected in the image and calls celebrity recognition to get the names of these two persons.

执行。根据ChatGPT的行动请求，可以通过正则表达式匹配解析工具名称和文件路径，这些信息用于调用工具（操作执行）。

以图6.5中显示的示例为例，收到输入图像后，ChatGPT首先调用一系列工具以对图像进行一般性理解。所调用的工具包括用于所述图像的总体描述的图像字幕；密集字幕以获取图像中物体的区域级更详细的描述；对象标签以获取图像中物体的标签；人脸检测以获取物体标签中提到的两张脸的框坐标。工具的输出（即观察结果）被序列化为文本，并反馈给ChatGPT。

将观察结果与聊天历史结合起来，ChatGPT可以进一步调用其他专家或将最终答案返回给用户。在此特定示例中，ChatGPT在图像中检测到的两张脸上进行了调用第二轮的思考-行动-观察，并调用名人识别以获取这两位人物的姓名。

Response generation响应生成：MM-REACT实现了对话系统，通过判断是否需要调用外部工具，将所有观察信息分析总结给用户，或利用人名和边界框调用Bing搜索来回答未知详情的 follow-up 问题

When ChatGPT decides no external tools are needed, it takes consideration of all observations gathered and summarize them as the response to the user, which is “This image contains two celebrities, Kobe Bryant and Paul Pierce. They are both basketball players.” for the example shown in Figure 6.5.

If the user continues to interact with MM-REACT, it repeats the process described above, but with all observations and chat history available when planning for the tools needed. For instance, if the user then asks “how many championship rings did the player on the left win in his career”, it is not available in the existing observations nor chat history, but ChatGPT has the bounding boxes to decide who is on the left, and also the names of the players. It plans to invoke Bing Search to find the right answer, which should be 5.

当ChatGPT确定不需要外部工具时，它会考虑到收集到的所有观察结果，并将它们总结为向用户的响应，例如图6.5中所示的响应是“这张图像包含两位名人，科比·布莱恩特和保罗·皮尔斯。他们都是篮球运动员。”。

如果用户继续与MM-REACT进行互动，它将重复上述过程，但在规划所需工具时会考虑到所有观察结果和聊天历史。例如，如果用户接着问“左边的球员在他的职业生涯中赢得了多少个总冠军戒指”，在现有观察结果和聊天历史中没有该信息，但ChatGPT可以使用边界框决定谁在左边，以及球员的名字。它计划调用Bing Search来找到正确的答案，答案应该是5。

6.3.2、Capabilities能力：MM-REACT 证明了多种代表性能力和应用场景

Figure 6.6 shows the representative capabilities and application scenarios that MM-REACT demon-strates, including visual math and text reasoning, understanding visual-conditioned jokes/memes, spatial/coordinate understanding, visual planning and prediction, multi-image reasoning, multi-hop document understanding, open-world concept understanding, video analysis and summarization.

In addition, we show an example of the full response from MM-REACT on multi-image reasoning in Figure 6.7, which may not be easily achievable by visual instruction tuning in Chapter 5. For more comprehensive examples of all emerging capabilities of MM-REACT, we refer the reader to the original paper.

图6.6显示了MM-REACT展示的代表性能力和应用场景，包括视觉数学和文本推理、理解视觉条件下的笑话/表情、空间/坐标理解、视觉规划和预测、多图像推理、多跳文档理解、开放世界概念理解、视频分析和总结。

此外，图6.7中展示了MM-REACT在多图像推理方面的完整响应示例，这可能在第5章的视觉指令调优中不容易实现。对于MM-REACT的所有新兴能力的更全面示例，我们建议读者参阅原始论文。

6.3.3、Extensibility可扩展性(工具链构建多模态智能体的优势)：可扩展性的两大策略

One favorable property of tool chaining to build multimodal agents is that the system can be easily extended and enhanced, from two perspectives. One is to upgrade the core part of the system, the LLM, and the other is to expand the number of external tools.

工具链构建多模态智能体的一个有利特性是系统易于扩展和增强，从两个方面来看。一个是升级系统的核心部分LLM，另一个是扩展外部工具的数量。

(1)、Upgrading LLM升级LLM：MM-REACT的系统设计可不需重训练就将LLM升级为更强大的新模型，比如ChatGPT升级到多模态能力的GLP-4

The system design of MM-REACT allows for upgrading the core part of the system, the LLM, to newer and more powerful models as they come out, without the need of re-training. We show an example in Figure 6.8 on upgrading ChatGPT to language-only GPT-4, which improves MM-REACT to potentially match the performance of multimodal GPT-4.

MM-REACT的系统设计允许将系统的核心部分LLM升级为更新和更强大的模型，而无需重新训练。我们在图6.8中展示了将ChatGPT升级为仅支持语言的GPT-4的示例，这可以改进MM-REACT以潜在地匹配多模态GPT-4的性能。

(2)、Plug-and-play 即插即用（添加更多工具）：现有的多模态智能体通过插拔式机制整合工具(如HuggingGPT、Chameleon和RestGPT)允许在无需训练的情况下添加更多工具→扩展到数千万个工具仍然具有挑战性(TaskMatrix.AI的潜力)→SAM可以作为一种工具来实现与多模态智能体的多种方式的人际互动

Plug-and-play (adding more tools). Existing multimodal agents incorporates tools via a plug-and-play mechanism, allowing adding more tools without training. One prominent work along this direction is HuggingGPT (Shen et al., 2023b), which proposes to leverage all open-source models hosted on huggingface. Chameleon (Lu et al., 2023b), incorporates not only huggingface models, but also open-source models from GitHub, Bing search API, and python compiler. RestGPT (Song et al., 2023) proposes a multi-level online planning framework that effectively handles the practical challenges associated with integrating LLMs with more than 100 RESTful APIs. However, it re-mains challenging in scaling this framework to thousands to millions of tools, which is the potential future demonstrated in TaskMatrix.AI (Liang et al., 2023b).

现有的多模态智能体通过即插即用的机制集成工具，允许在无需训练的情况下添加更多工具。沿着这个方向的一项重要工作是HuggingGPT（Shen等人，2023b），该工作提出利用托管在huggingface上的所有开源模型。Chameleon（Lu等人，2023b）不仅包括huggingface模型，还包括来自GitHub、必应搜索API和Python编译器的开源模型。RestGPT（Song等人，2023）提出了一个多层次在线规划框架，有效处理了将LLMs与超过100多个RESTful API集成相关的实际挑战。然而，将该框架扩展到成千上万的工具仍然具有挑战性，这是TaskMatrix.AI（Liang等人，2023b）所展示的未来潜力。

Moreover, one can leverage SAM (Kirillov et al., 2023) as a tool to allow for more types of human interaction with the multimodal agent other than text. Recall in MM-REACT, the user intent is all captured by the natural language query from the user. In InternGPT (Liu et al., 2023l), by connecting the tool SAM with GPT, it allows for more ways to interact with the system, for example, via clicks, scribbles, and drawing bounding boxes. These additional interactions, to some extent, are mimicking the action of finger-pointing when we humans are having a conversation.

此外，人们可以利用SAM（Kirillov等人，2023）作为一种工具，允许以文本之外的多模态智能体进行更多类型的人类交互。回顾一下MM-REACT，用户意图都是通过用户的自然语言查询来捕获的。在InternGPT（Liu等人，2023l）中，通过将工具SAM与GPT连接，它允许以更多方式与系统进行互动，例如通过点击、涂鸦和绘制边界框。在某种程度上，这些额外的互动方式模拟了我们人类进行对话时指向的动作。

6.4、Advanced Topics高级主题

In this section, we discuss more advanced topics and shed light on potential future directions.

在本节中，我们将讨论更高级的主题，并探讨潜在的未来发展方向。

6.4.1、Comparison to Training with LLM in Chapter与第五章中LLM训练的比较

构建基于LLM的高级多模态系统方向上的两个方法→融合两种范式优势的中间领域的可能性→探讨：否可以用LLaVA替代LLM作为工具分配器+需要哪些能力来启用工具

T1、通过指令调整实现端到端模型+直接解释多模态输入中的丰富语义+但需要数据筛选和训练+成本较高

T2、通过将LLM与现成的工具链接+借助上下文中的少样本示例来教导LLM进行规划+但存在如何选择工具的问题且弱领域专家导致性能差

We have covered two directions on building advanced multimodal systems based on LLMs. As the key distinction, the multimodal agents in this chapter leverages LLMs’ high-level planning abilities to allocate various multimodal tools, while training multimodal models with LLMs in Chapter 5 solely leverages LLMs for text generation conditioned on multimodal inputs.

Nonetheless, both of these methods exhibit their respective advantages and disadvantages. On one hand, instruction tuning enables an end-to-end model that directly interprets rich semantics in multi-modal inputs, but requires data curation and training, hence more computationally expensive. How-ever, limited instruction tuning data may limit its capabilities in certain scenarios, such as OCR. On the other hand, one can build a multimodal agent without any training by chaining LLMs with abundant off-the-shelf models/APIs/code interpreters as tools, and leveraging in-context few-shot examples to teach LLMs on planning. However, as there is no training, the system may fail to in-voke the right tool. Moreover, weak domain experts may produce noisy outputs, that can confuse LLM on planning or reasoning, leading to weak performance.

我们已经介绍了两种基于LLM构建高级多模态系统的方法。作为关键区别，本章中的多模态智能体利用了LLM的高级规划能力来分配各种多模态工具，而第5章中使用LLM训练多模态模型仅利用LLM来生成基于多模态输入的文本。

然而，这两种方法都具有各自的优点和缺点。一方面，指令调优使端到端模型能够直接解释多模态输入中的丰富语义，但需要数据管理和训练，因此计算成本更高。然而，有限的指令调优数据可能会限制其在某些场景下的能力，例如OCR。另一方面，可以通过将LLM与大量现成的模型/API/代码解释器链在一起作为工具，以及利用上下文中的少量示例来教导LLM进行规划，来构建多模态智能体而无需任何训练。然而，由于没有训练，系统可能无法调用正确的工具。此外，弱领域专家可能会产生噪声的输出，这可能会使LLM在规划或推理方面感到困惑，导致性能较差。

Though these two approaches exhibit distinct variations,, we envision the possibility of an interme-diate domain that amalgamates the strengths of both paradigms, and raise the following questions. Now that we have open-source LMM such as LLaVA (Liu et al., 2023c), can we replace the LLM with LLaVA as the tool allocator? If so, what capabilities would require a tool to be enabled? And what problems can be solved by instruction tuning. These are interesting directions that may worth exploring in the near future.

尽管这两种方法具有不同的变化，但我们设想了一种中间领域的可能性，将两种范例的优势融合在一起，并提出以下问题。既然我们有像LLaVA（Liu等人，2023c）这样的开源LMM，是否可以将LLM替换为LLaVA作为工具分配器？如果是这样，需要启用工具的哪些功能?哪些问题可以通过指令调优来解决。这些都是值得在不久的将来探索的有趣方向。

6.4.2、Improving Multimodal Agents提高多模态智能体的性能

痛点：当前主要依赖上下文内的少样本示例来教导LLM进行规划，导致不够可靠和不准确

Existing multimodal agents mainly rely on in-context few-shot examples to teach LLM on planning, which can be unreliable, leading to inaccurate tool using. To improve the accuracy in planning, several works have been proposed and we group them into three categories below.

现有的多模态智能体主要依赖于上下文中的少量示例来教导LLM进行规划，这可能不可靠，导致工具的使用不准确。为了提高规划的准确性，已经提出了一些方法，我们将它们分为以下三类。

Composing tools via code generation通过代码生成组合工具(代码仍由LLM生成导致准确性问题)：探索使用编程语言来代替自然语言进行更准确的工具使用规划，基于自然语言指令利用GPT-3(Codex)生成Python代码，如视觉编程/ViperGPT

Most existing multimodal agents uses natural language to prompt LLM for planning which tool to use. Researchers (Gupta and Kembhavi, 2023; Sur´ıs et al., 2023) have also been exploring using programming language for more accurate execution. Visual programming (Gupta and Kembhavi, 2023) is a prominent work along this direction, which uses the in-context learning ability of GPT-3 (Brown et al., 2020) to generate python-like modular programs from natural language instructions for compositional visual tasks ViperGPT Sur´ıs et al.(2023) instructs GPT-3 Codex (Chen et al., 2021a) to generate Python code to compose multimodal tools for a one-round query answering. However, as the codes are still generated by a LLM, the problem of inaccurate tool using still remains.

大多数现有的多模态智能体使用自然语言提示LLM规划使用哪个工具。研究人员（Gupta和Kembhavi，2023；Sur´ıs等人，2023）也一直在探索使用编程语言来更准确地执行。视觉编程（Gupta和Kembhavi，2023）是这个方向的一项突出工作，它利用了GPT-3（Brown等人，2020）的上下文学习能力，从自然语言指令中生成类似Python的模块化程序，用于组合视觉任务。ViperGPT Sur´ıs等人（2023）指示GPT-3 Codex（Chen等人，2021a）生成Python代码，以组合多模态工具进行一轮查询回答。然而，由于代码仍然是由LLM生成的，因此仍然存在工具使用不准确的问题。

Improving accuracy in tool using: self-assessment提高工具使用的准确性—自我评估：AssistGPT试图通过自我评估提升工具使用准确性

A recent work AssistGPT (Gao et al., 2023a) tries to improve the accuracy in tool using via self-assessment. It adds a stage of inspection and learning loop into the system. When the round of plan and execution is finished, the system inspects the outcome, and determines whether the reasoning path of calling the tool is a success or not, if so, save it as an in-context example, to teach LLM for a more accurate tool calling in the future rounds.

最近的一项工作AssistGPT（Gao等人，2023a）尝试通过自我评估来提高工具使用的准确性。它在系统中添加了一个检查和学习循环的阶段。当一轮计划和执行完成后，系统检查结果，判断调用工具的推理路径是否成功，如果成功，将其保存为上下文示例，以指导LLM在以后的轮中更准确地调用工具。

Improving accuracy in tool using: instruction tuning提高工具使用的准确性—指令调优：通过自我指导产生指令-API对数据集微调较小规模LLM，从而改善工具使用准确性

Improving accuracy in tool using: instruction tuning. Another thread on improving accuracy in tool using is to combine the system with instruction tuning (Patil et al., 2023; Yang et al., 2023c). One can generate a dataset of instruction-API pairs via self-instruct to tune a smaller LLM (e.g. , Vicuna-7B (Vicuna, 2023)).

提高工具使用的准确性：指令调优。另一种提高工具使用准确性的方法是将系统与指令调优（Patil等人，2023；Yang等人，2023c）相结合。可以通过自我指导生成指令-API对的数据集，以调优较小的LLM（例如，Vicuna-7B（Vicuna，2023））。

LMM 作为工具分配器？将LMM替换为系统中的多模态工具分配器,取消统一工具输出为文本序列的需求,支持更自然的多模态工具交互，如模态GPT-4

LMM as the tool allocator?

In addition, as LMMs evolve, we envision that the LLM can be replaced by a LMM as the tool allocator in the system, to enable even more advanced application scenarios. If the tool allocator can take multimodal inputs, there is no need to unify the outputs of tools into text sequence, allowing more natural interactions between the tool allocator and multi-modal tools, particularly those producing multimodal outputs. For instance, one can imagine using multimodal GPT-4 (OpenAI, 2023a) to coordinate various image or video generation tools to make a short movie by providing it with a sketch of the storyline and visual examples of the main characters.

此外，随着LMM的发展，我们设想LMM可以取代LLM作为系统中的工具分配器，以实现更高级的应用场景。如果工具分配器可以接受多模态输入，就无需将工具的输出统一为文本序列，从而允许工具分配器与多模态工具之间进行更自然的交互，特别是那些生成多模态输出的工具。例如，可以想象使用多模态GPT-4（OpenAI，2023a）来协调各种图像或视频生成工具，通过提供故事情节的草图和主要角色的视觉示例来制作短片。

6.4.3、Diverse Applications of Multimodal Agents多模态智能体的多样应用

通过组合来自特定领域的工具的系统范式可以支持多样化的领域特定应用，比如图像合成、机器人执行、音频生成、3D场景生成、医学图像理解和视觉语言导航等

By composing tools from a specific domain, this new system paradigm can also support diverse domain-specific applications.

Yu et al. (2023b) composes LLMs with image synthesis tools and object-level/pixel-level image un-derstanding tools to build a data synthesis pipeline to provide diverse annotations on synthesized image. Instruct2Act (Huang et al., 2023c) complements the LLM with robotic executors, to enable robotic actions based on multi-modal instructions. When chaining a pool of audio models with LLM, AudioGPT (Huang et al., 2023a) can understand and generate speech, music, sound and talk-ing head. Similarly, WavJourney (Liu et al., 2023i) further supports compositional audio creation with storylines encompassing speech, music, and sound effects. With tracking, captioning, audio un-derstanding models, ChatVideo (Wang et al., 2023c) enables ChatGPT to understand multi-channel videos. Other application scenarios include 3D scene generation (Lin et al., 2023; Feng et al., 2023), medical image understanding (Liu and Zuo, 2023; Sun et al., 2023c) and vision-language naviga-tion (Zhou et al., 2023b).

通过组合来自特定领域的工具，这个新的系统范例还可以支持不同的特定于领域的应用程序。

Yu等人（2023b）将LLMs与图像合成工具和物体级/像素级图像理解工具组合起来，构建了一个数据合成管道，为合成图像提供多种注释。

Instruct2Act（Huang等人，2023c）将LLM与机器人执行器结合使用，以实现基于多模态指令的机器人动作。

在将一组音频模型与LLM链接时，AudioGPT（Huang等人，2023a）可以理解和生成语音、音乐、声音和说话头。类似地，WavJourney（Liu等人，2023i）进一步支持包括语音、音乐和音效在内的叙述性音频创作。

借助跟踪、字幕、音频理解模型，ChatVideo（Wang等人，2023c）使ChatGPT能够理解多通道视频。

其他应用场景包括3D场景生成（Lin等人，2023；Feng等人，2023）、医学图像理解（Liu和Zuo，2023；Sun等人，2023c）和视觉语言导航（Zhou等人，2023b）。

6.4.4、Evaluation of Multimodal Agents多模态智能体的评估

多模态工具使用能力广泛但其工具使用准确率尚无定量研究：API-Bank（是在系统评估工具增强型LLM中的起点

Multimodal tool using. Although we have seen qualitative examples of new scenarios enabled by multimodal agents, it remains unclear how these agents perform in terms of the accuracy in tool using. API-Bank (Li et al., 2023k) is a starting point on building pipeline in systematically evaluating tool-augmented LLMs.

多模态工具使用。尽管我们已经看到了多模态智能体所启用的新场景的定性示例，但目前尚不清楚这些代理在工具使用准确性方面的表现如何。API-Bank（Li等人，2023k）是系统评估工具增强的LLM的起点，用于系统地评估工具增强的LLM。

Emergent capabilities新兴能力：现有的视觉语言基准未能考察到大型多模态智能体的涌现能力，研究人员已经开始设计全面的评估样本来促进LMM评估，比如MM-Vet定义的6个核心视觉语言能力

Emergent capabilities. Existing VL benchmarks focus on specific capabilities of interest, such as visual recognition (Antol et al., 2015), image description (Chen et al., 2015; Agrawal et al., 2019), as well as other benchmarks for specialized capabilities such as scene text understanding (Sidorov et al., 2020; Gurari et al., 2018), commonsense reasoning (Zellers et al., 2019), outside knowl-edge (Schwenk et al., 2022). The intriguing abilities shown in large multimodal models and multi-modal agents are not examined by existing benchmarks, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, or explaining visual jokes. Furthermore, the long, chatty outputs from these systems poses challenges to today’s evaluation met-rics. Researchers (Fu et al., 2023; Liu et al., 2023j) have started to design comprehensive evaluation samples to facilitate the LMM evaluation. As an attempt to test multimodal systems on integrated capabilities, MM-Vet (Yu et al., 2023d) defines 6 core VL capabilities and examines the 16 integra-tions of interest derived from the capability combination (Figure 6.10). In addition, to accommodate for the open-ended free-form text outputs, MM-Vet proposes an LLM-based evaluator to enable evaluation across different question types and answer styles.

现有的VL基准主要关注特定感兴趣的能力，例如视觉识别（Antol等人，2015）、图像描述（Chen等人，2015；Agrawal等人，2019），以及用于专门能力的其他基准，例如场景文本理解（Sidorov等人，2020；Gurari等人，2018）、常识推理（Zellers等人，2019）、外部知识（Schwenk等人，2022）。

大型多模态模型和多模态智能体中显示出的有趣能力并没有被现有的基准所检验，比如解决黑板上写的数学问题，推理新闻图像中的事件和名人，或者解释视觉笑话。

此外，这些系统产生的冗长对话输出对今天的评估指标提出了挑战。研究人员（Fu等人，2023；Liu等人，2023j）已经开始设计全面的评估样本，以促进LMM的评估。作为在综合能力上测试多模态系统的尝试，MM-Vet（Yu等人，2023d）定义了6种核心VL能力，并检查了从能力组合中派生出的16种感兴趣的整合（图6.10）。此外，为了适应自由形式文本输出，MM-Vet提出了一种基于LLM的评估器，以实现跨不同问题类型和答案风格的评估。

6.4.5、Tool Creation工具创建

NLP领域：探索通过编写代码或指令来即时创建工具以满足用户需求，比如CREATOR

多模态智能体领域：寻找方法来创建能够处理多模态输入的工具，比如ViperGPT/AutoML GPT

Imagine if we have a completely new scenario without a robust tool to use. Can we create a tool based on the user need on-the-fly? In NLP, CREATOR (Qian et al., 2023) proposes to create tools by writing python code for math reasoning, as opposed to calling math solver API such as Wolfram Alpha. Cai et al. (2023) further explores the capabilities of LLMs to make tools, and experiment with two LLMs, one as the tool maker and the other as the tool user to collaboratively solve com-plicated tasks, such as scheduling a meeting. In terms of multimodal agents, the challenge is how to create a tool that can process multimodal inputs. One may follow ViperGPT (Sur´ıs et al., 2023) to instruct LLMs to generate python programs leveraging pre-existent python packages such as Open-CV. AutoML GPT (Zhang et al., 2023j) envisions that one can utilize LLMs to automate the model training pipeline. There may be potential to develop novel multimodal deep learning tools tailored to more effectively address the requirements of users.

想象一下，如果我们有一个全新的场景，没有一个强大的工具可以使用。我们能否根据用户的需求创建一个即时工具？

在NLP领域，CREATOR（Qian等人，2023）建议通过为数学推理编写Python代码来创建工具，而不是调用数学求解器API，例如Wolfram Alpha。

Cai等人（2023）进一步探讨了LLM创建工具的能力，并对两个llm进行了实验，，一个作为工具制造商，另一个作为工具用户，以协同解决复杂的任务，如安排会议。

在多模态智能体方面，挑战是如何创建一个可以处理多模态输入的工具。可以借鉴ViperGPT（Sur´ıs等人，2023）的方法，指示LLM利用现有Python包（如Open-CV）生成python程序。AutoML GPT（Zhang等人，2023j）设想可以利用LLM自动化模型训练管道。有可能开发出新的多模态深度学习工具，以更有效地满足用户的需求。

6.4.6、Retrieval-Augmented Multimodal Agents检索增强的多模态智能体

背景：大部分信息存储在数据库中，用户可能需要准确检索这些信息

NLP领域：通过外部数据以结构化语言和关系表示来增强LLMs，通过检索器检索相关文档并使用生成器生成预测，以解决无法将所有世界知识编码到预训练模型的权重中的问题

In real-life applications, a substantial portion of information resides within databases, and user needs may require accurate retrieval of such information. Meanwhile, it is infeasible to encode all the world knowledge into the weights of pre-trained models, particularly when it comes to the long-tail concepts and fast-evolving data.

In NLP, several works augment LLMs with external data encoded with structured language and relation representations (Peters et al., 2019; Guu et al., 2020; Lewis et al., 2020). Given input texts, such retrieved-augmented models utilize a retriever that retrieves relevant documents from an external memory, and uses a generator to generate predictions given the retrieved documents.

在实际应用中，大部分信息存储在数据库中，用户可能需要准确检索这些信息。与此同时，在将世界知识编码到预训练模型的权重中，特别是对于长尾概念和快速发展的数据来说，是不可行的。

在NLP领域，一些工作利用结构化语言和关系表示来增强LLMs与外部数据的连接（Peters等人，2019；Guu等人，2020；Lewis等人，2020）。这些检索增强模型利用检索器从外部存储中检索相关文档，并使用生成器根据检索到的文档生成预测。

多模态智能体领域：基于检索增强模型的启发，利用视觉和/或文本知识来提高视觉任务，通过检索和应用外部知识，提供核心模型所需的额外信息来改善任务性能，比如RAC/K-LITE/REACT//

Motivated by retrieval-augmented models in NLP, several recent works leverage visual and / or textual knowledge to improve vision tasks, such as image classification (Long et al., 2022), cap-tioning (Yang et al., 2023a), question answering (Wu et al., 2021; Marino et al., 2021; Yang et al., 2022d; Chen et al., 2022e), image generation (Blattmann et al., 2022; Sheynin et al., 2022; Chen et al., 2022f; Zhou et al., 2022c), and multi-modal tasks simultaneously (Yasunaga et al., 2022). RAC (Long et al., 2022) improves long-tail classification by retrieving from a non-parametric mem-ory consisting of pre-encoded images and text. K-LITE (Shen et al., 2022a) enhances the text prompts with the retrieved external knowledge that is encoded in natural language. REACT (Liu et al., 2023d) retrieve from billions of the paired knowledge of image-text and aims to improve task transfer performance for core vision problems. Among them, RA-CM3 (Yasunaga et al., 2022) builds the first retrieval-augmented LMM with a multimodal retriever to retrieve multimodal docu-ments, and a retrieval-augmented generator that can generate both text and image. Chaining tools with LLM shares a strong connection with the retrieval-augmented methods in that both leverage ex-ternal knowledge to provide additional information for the core model to utilize. In the multimodal regime, the image itself can be used as the query to gain external knowledge, either retrieved from a knowledge base, or extracted from another pre-trained vision expert models.

受NLP中的检索增强模型的启发，最近的一些工作利用视觉和/或文本知识来提高视觉任务的性能，例如图像分类（Long等人，2022）、图像描述（Yang等人，2023a）、问答（Wu等人，2021；Marino等人，2021；Yang等人，2022d；Chen等人，2022e）、图像生成（Blattmann等人，2022；Sheynin等人，2022；Chen等人，2022f；Zhou等人，2022c）以及同时进行多模态任务（Yasunaga等人，2022）。

RAC（Long等人，2022）通过从预先编码的图像和文本组成的非参数存储中检索来提高长尾分类性能。

K-LITE（Shen等人，2022a）利用检索到的以自然语言编码的外部知识增强文本提示。

REACT（Liu等人，2023d）从数十亿的图像-文本对知识中检索，并旨在提高核心视觉问题的任务迁移性能。

其中，RA-CM3（Yasunaga等人，2022）构建了第一个检索增强LMM，使用多模态检索器检索多模态文档，并使用检索增强生成器生成文本和图像。将工具与LLM链接与检索增强方法具有很强的联系，因为两者都利用外部知识为核心模型提供额外信息。在多模态模式下，图像本身可以作为查询来获取外部知识，或者从知识库中检索，或者从另一个预训练的视觉专家模型中提取。

7、Conclusions and Research Trends结论和研究趋势

多模态基础模型在快速发展：共同的总体目标是—创建通用型模型能够遵循人类意图并执行各种域外视觉任务

Multimodal foundation models have garnered significant interest among scholars in the fields of computer vision and multimodal vision-language research. Although prevailing research topics, approaches and methodologies have been evolving – encompassing image self-supervised learning, language-image contrastive learning, text-to-image generation, unified vision modeling, and large language-and-vision assistants – they converge on a common overarching objective: the creation of general-purpose models and systems capable of following human intents and effortlessly executing a diverse array of vision and vision-language tasks in the wild. In this chapter, we provide a concise summary of what has been reviewed, and delve into the prevailing research tendencies in the field.

多模态基础模型在计算机视觉和多模态视觉语言研究领域引起了学者们的极大兴趣。尽管流行的研究主题、方法和方法学一直在不断发展，包括图像自监督学习、语言-图像对比学习、文本到图像生成、统一视觉建模以及大规模语言和视觉助手，但它们都聚焦于一个共同的总体目标：创建通用模型和系统，能够遵循人类意图并轻松执行各种域外视觉和视觉语言任务。在本章中，我们对已经回顾过的内容进行了简要总结，并深入探讨了该领域的主要研究趋势。

7.1、Summary and Conclusions总结和结论：多模态基础模型研究的最新进展的两大类

特定用途的多模态基础模型：关注问题相关数据的预训练和零-少样本迁移

This paper surveys the most recent advances at the frontier of multimodal foundation model research, categorized into two classes discussed below.

>>Specific-purpose multimodal foundation models. There is a diverse set of problems to tackle in the computer vision community. To lay a comprehensive foundation for the introduction of general-purpose visual assistants, we have discussed many seminar papers in the era of pre-training. The major paradigm during this period is pre-training on a large amount of problem-related data, and then transferring to a number of real-world scenarios of the same problem type in a zero- or few-shot fashion. More specifically, we have presented two general topics: (i) Vi-sual Understanding in Chapter 2: individual multimodal foundation models have developed to analyze the content of visual data in the image, region, pixel levels, prospectively. The language-augmented vision models are a popular family, contributing to the recent success of visual under-standing tasks in the wild. (ii) Visual Generation in Chapter 3: text-to-image generation models have served the foundation for image synthesis, which has been successfully extended to allow user controllability and customization at more fine-grained manners. The availability and cre-ation of large amount of problem-related data has played a key role in making these multimodal foundation models possible.

本文对多模态基础模型研究前沿的最新进展进行了调查，分为以下两类进行讨论。

>> 特定用途的多模态基础模型。在计算机视觉社区中有各种各样的问题需要解决。为了为通用视觉助手的引入奠定一个全面的基础，我们讨论了在预训练时代的许多研讨会论文。这个时代的主要范式是在大量与问题相关的数据上进行预训练，然后以零次或少次的方式将其转移到相同问题类型的多种实际场景中。

更具体地说，我们提出了两个主题：

(i) 第2章中的视觉理解：个体多模态基础模型已经发展起来，可以在图像、区域、像素级别上分析视觉数据的内容。语言增强的视觉模型是一个受欢迎的系列，为域外视觉理解任务的最近成功做出了贡献。

(ii) 第3章中的视觉生成：文本到图像生成模型为图像合成提供了基础，并已成功扩展到允许用户以更细粒度的方式进行可控性和自定义。问题相关数据的可用性和创建在使这些多模态基础模型成为可能方面发挥了关键作用。

通用型助手：关注具有统一网络架构和接口的通用型助手模型研究，为视觉任务提供了类似于NLP中的通用助手的解决方案

>>General-purpose assistants. We have reviewed recently emerged literature on building general-purpose assistants, which often possess an unified network architecture, an unified input-output data format, and a general interface that facilitates easy interaction with humans. Inspired by the success in NLP that LLM such as ChatGPT/GPT-4 is a general assistant for a wide range of lan-guage tasks, researchers in computer vision have explored various solutions to their counterpart for vision tasks. Based on how LLM is leveraged in the methodology, existing works can be cate-gorized into three topics: (i) Unified Vision Models in Chapter 4: The spirit of unifying modeling in LLM is borrowed to build unified vision models at different levels and across different tasks.(ii) Training with LLM in Chapter 5: Starting with a pre-trained LLM, visual data is connected to LLM for end-to-end training. (iii) Chaining with LLM in Chapter 6: By freezing LLM, existing vision experts can be triggered by prompt engineering LLM to complete specific vision tasks.

>> 通用助手。我们回顾了最近出现的关于构建通用助手的文献，这些助手通常具有统一的网络架构、统一的输入输出数据格式以及便于与人类进行轻松交互的通用接口。在NLP中，像ChatGPT/GPT-4这样的LLM是广泛语言任务的通用助手，受到其成功的启发，计算机视觉研究人员已经探索了各种解决方案来解决视觉任务。根据LLM在方法论中的运用，现有的工作可以分为三个主题：

(i) 第4章中的统一视觉模型：借鉴LLM中的统一建模精神，构建了不同层次和不同任务的统一视觉模型。

(ii) 第5章中的LLM训练：从预训练的LLM开始，将视觉数据与LLM连接，进行端到端的训练。

(iii) 第6章中的LLM链接：通过冻结LLM，可以通过提示工程LLM触发现有的视觉专家，以完成特定的视觉任务。

The comparisons among these models are summarized in Table 7.1.

这些模型之间的比较总结在表7.1中。

7.2、Towards Building General-Purpose AI Agents迈向构建通用AI代理

专门的多模态基础模型→通用视觉助手：目前已出现强大的视觉助手(如Flamingo和 multimodal GPT-4相)，但相比未来的多模态AI智能体仍处于初级阶段

At the end of each chapter, we have discussed future trends for individual topics. The paper itself is organized to demonstrate the transition from specialist multimodal foundation models to general-purpose visual assistants. Though powerful, existing visual assistants such as Flamingo (Alayrac et al., 2022) and multimodal GPT-4 (OpenAI, 2023b) are in the preliminary form, compared with grand vision on building a general-purpose multimodal AI agent via foundation models. In what follows, we highlight a number of research trends towards this goal.

在每章的结尾，我们讨论了各个主题的未来趋势。本文自身的组织方式旨在展示从专门的多模态基础模型向通用视觉助手的过渡。尽管像Flamingo（Alayrac等人，2022）和多模态GPT-4（OpenAI，2023b）这样的现有视觉助手已经非常强大，但与通过基础模型构建通用多模态AI智能体的宏伟愿景相比，还处于初级阶段。接下来，我们将重点介绍一些朝着实现这一目标的研究趋势。

Generalist agents with multi-modality多模态的通用代理：研究趋势是构建一个通用多模态智能体(融合多种通道与世界进行交互),感知和合成视觉信号(如Gato/PaLM-E)，其中视觉感知是关键更是挑战

Generalist agents with multi-modality. This aligns with the grand goal of building a single gen-eralist agent that interacts with world like humans through fusing multiple channels such as lan-guage, vision, speech and actions. From this perspective, the notion of multimodal foundation models becomes somewhat indistinct on its own. Instead, it serves as a crucial component of the agent for perceiving and synthesizing visual signals. For example, Gato (Reed et al., 2022) and PaLM-E (Driess et al., 2023) perform a wide range of language, multimodal and control tasks with a single set of model weights, where visual perception is a crucial component in understanding the environment. It also raises challenges in the effective and scalable pre-training objectives for unified vision and multimodal modeling.

这与构建一个像人类一样通过融合多个渠道（如语言、视觉、语音和行动）与世界互动的通用智能体的宏伟目标一致。从这个角度看，多模态基础模型的概念本身显得有些模糊。相反，它作为智能体的重要组成部分，用于感知和合成视觉信号。例如，Gato（Reed等人，2022）和PaLM-E（Driess等人，2023）使用一组模型权重执行各种语言、多模态和控制任务，其中视觉感知是理解环境的关键组成部分。这也为统一视觉和多模态建模的有效和可扩展的预训练目标提出了挑战。

Alignment with human intents与人类意图的对齐：视觉提示比语言表达更好，基于视觉提示的多模态人机交互是解锁新场景的关键

Alignment with human intents. AI alignment research focuses on steering AI systems towards humans’ intended goals, values, or ethical guidelines. An AI system is deemed aligned when it effectively promotes the desired goals. Though language has exhibited its generality in expressing human intents, it is not always the best option. As demonstrated in SAM (Kirillov et al., 2023) and ControlNet/GLIGEN (Zhang and Agrawala, 2023; Li et al., 2023n), human intents can be more precisely and conveniently represented in visual prompts such as key points, bounding boxes and sketch drawing, for visual understanding and generation tasks, respectively. Building foundation models that are well equipped with such a multimodal human-machine interaction interface is a key step to unlock new use scenarios, where human intents are best represented visually. For example, the spatial arrangement of elements within a scene, as well as the artistic style and visual appeal of a piece of visual art.

AI对齐研究专注于将AI系统引导到人类预期的目标、价值观或道德准则上。当一个AI系统有效地促进所期望的目标时，AI系统被认为是对齐的。尽管语言在表达人类意图方面表现出了其通用性，但它并不总是最佳选择。如SAM（Kirillov等人，2023）和ControlNet/GLIGEN（Zhang和Agrawala，2023；Li等人，2023n）所示，人类意图可以更精确、更方便地以视觉提示的形式表示，如关键点、边界框和草图绘制，用于视觉理解和生成任务。

构建具备这种多模态人机交互接口的基础模型，是解锁新的使用场景的关键步骤，其中人类意图最好以视觉方式表示，例如场景中元素的空间排列，以及视觉艺术品的艺术风格和视觉吸引力。

AI智能体系统框架四大组成=基于LLM驱动的智能体大脑+三大组件(视觉模态的作用，规划【改进视觉理解】、记忆【利用上下文学习和交错多模态提示实现短期记忆+通过多模态向量空间快速检索实现长期记忆】和工具使用【智能体学习利用外部API来获取基础模型权重中缺少的知识】)

Planning, memory, and tool use. It is highlighted in Weng (2023) that a LLM-powered au-tonomous agent system can be built, where LLM functions as the agent’s brain, complemented by several key components: planning, memory and tool use. Following the framework, we could foresee the role of multimodal foundation models in this AI agent system. (i) Planning. To com-plete complex tasks in real-world scenarios, the agent should be able to decompose large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. In the ideal case, the AI agent possesses the self-improvement ability, engaging in self-assessment and introspection re-garding previous actions, enabling it to learn from errors and enhance its approach for subsequent endeavors, ultimately leading to better outcomes. Visual modality is a common channel to represent state of the environment. To facilitate planning, it raises challenges in improving the capability of the current visual understanding models in perceiving more fine-grained visual details and longer sequence videos. (ii) Memory. For short-term memory, in-context learning (or prompt engineering) is utilized as short-term memory for the model to learn. Interleaved multimodal prompts can enable new scenarios to clarify the human intents. For long-term memory, it provides the agent with the capability to recall external knowledge over extended sessions, which can be implemented by fast retrieving from a multi-modal vector space (Liu et al., 2023d). In term of modeling, foundation models are required to learn the new skills to effectively leverage both types of memory. (iii) Tool use. The agent learns to utilize external APIs for knowledge that is missing from the foundation model weights. New capabilities are required to deal with the vision modality in several scenar-ios. For example, based on both the input visual signal and instructions, the model decides and plans whether certain external APIs are needed to complete the goal, such as code execution of detection/segmentation/OCR/generator experts.

在Weng（2023）中强调了可可以构建一个由LLM驱动的自主智能体系统，其中LLM作为智能体的大脑，辅以几个关键组成部分：规划、记忆和工具使用。在这个框架下，我们可以预见多模态基础模型在这一AI智能体系统中的作用。

(i) 规划。为了在现实场景中完成复杂的任务，智能体应该能够将大型任务分解成较小、可管理的子目标，从而实现对复杂任务的高效处理。在理想情况下，AI智能体应具备自我改进的能力，进行自我评估和对先前行动的反思，使其能够从错误中学习，并增强其在后续任务中的方法，最终实现更好的结果。视觉模态是表示环境状态的常见通道。为了便于规划，这对当前视觉理解模型在感知更细粒度的视觉细节和更长的序列视频方面的能力提出了挑战。

(ii) 记忆。对于短期记忆，可以利用上下文学习（或提示工程）作为模型的短期记忆，以便学习。交错的多模态提示可以启用新的场景来澄清人类的意图。对于长期记忆，它为智能体提供了在长时间会话中回忆外部知识的能力，这可以通过从多模态向量空间中快速检索来实现(Liu et al.， 2023)。在建模方面，基础模型需要学习新的技能来有效地利用这两种类型的记忆。

(iii) 工具使用。智能体学习利用外部API来获取基础模型权重中缺少的知识。在一些场景中，需要新的功能来处理视觉模式。例如，基于输入的视觉信号和指令，模型决定和计划是否需要某些外部API来完成目标，例如检测/分割/OCR/生成器专家的代码执行。

The field of multimodal foundation models is evolving at a rapid speed, with new directions/methods emerging frequently. There are many important research topics that are not discussed in this paper, mostly due to the daily-updated research innovation. We are optimistic about the future of mul-timodal foundation models, not only because we are convinced that foreseeable exciting research innovations/ideas in individual areas are becoming reality by following the path of LLM in the near future, but also because connecting computer vision with the broader AI community, and building general-purpose AI agents is going to significantly advance the daily life of human being.

多模态基础模型领域正在快速发展，新的研究方向和方法不断涌现。由于研究创新每天都在更新，因此本文未讨论的许多重要研究主题。我们对多模态基础模型的未来充满信心，不仅因为我们相信在不久的将来，通过追随LLM的道路，各个领域可预见的令人兴奋的研究创新/想法正在成为现实，而且还因为将计算机视觉与更广泛的AI社区联系起来，构建通用的人工智能智能体将大大改善人类的日常生活。

Acknowledgments

This book is largely based on our CVPR 2023 tutorial on vision foundation models. Many people have supported us and provided valuable feedback to the writing of this book. We thank all the authors who have contributed to the related papers, which makes the tutorial and book possible. We are also grateful to Mark de Jongh, the editor from the journal of Foundations and Trends® in Computer Graphics and Vision, for inspiring and encouraging us to write the book on multimodal foundation models.

本书主要基于我们在CVPR 2023上关于视觉基础模型的教程。许多人在书写本书过程中为我们提供了支持和宝贵的反馈意见。我们感谢所有为相关论文做出贡献的作者，这使得教程和书籍得以实现。我们还感谢《计算机图形与视觉基础与趋势》期刊的编辑Mark de Jongh，他启发并鼓励我们撰写关于多模态基础模型的书籍。

一个处女座的程序猿

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
1
评论
AGI之MFM：《多模态基础模型：从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势

AGI之MFM：《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型：从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势目录6、Multimodal Agents:Chaining Tools with LLM—与LLM协同工作的多模态智能体提出新的建模范式：将多个工具或专家与LLMs协同链接以解决复杂的开放问题，不需要训练，只需要示例教导6.
复制链接

扫一扫