AGI之MFM：《多模态基础模型：从专家到通用助手》翻译与解读之统一的视觉模型、加持LLMs的大型多模态模型

In this chapter, we discuss the unification of vision models. We start with an overview of the chal- lenges in the unification of vision models and the most recent efforts towards this goal in Section 4.1. What follows are detailed discussions on (i) how to transform closed-set models to open-set ones in Section 4.2; (ii) how to unify different granularities of vision tasks in Section 4.3; and (iii) how to build a more promptable interface for vision in Section 4.4. Finally, we summarize the chapter and discuss future trends in Section 4.5.

在本章中，我们讨论了视觉模型的统一。我们首先在第4.1节中概述了统一视觉模型面临的挑战以及最近朝着这一目标努力的工作。接下来，我们将在以下方面进行详细讨论：

(i)如何将封闭集模型转化为开放集模型（第4.2节）；

(ii)如何统一不同粒度的视觉任务（第4.3节）；

(iii)如何构建更具提示性的视觉接口（第4.4节）。最后，在第4.5节中，我们总结了本章，并讨论了未来的趋势。

4.1、Overview概述

NLP的发展：2018年之前(不同的NLP任务使用不同的任务特定模型解决，如翻译/语义解析/摘要生成)→2018年之后(GPT-style模型+NST任务+指令微调，如ChatGPT)

Before talking about general-purpose unified vision systems, we revisit how language models and natural language processing (NLP) have evolved in the past years. Before 2018, different NLP tasks are addressed with different task-specific models, such as translation (Bahdanau et al., 2015), semantic parsing (Berant et al., 2013), summarization (Allahyari et al., 2017), and so on. With the emergence of the transformer architecture (Vaswani et al., 2017), language models for different NLP tasks are unified with a decoder-only architecture, e.g., the GPT models (Brown et al., 2020). Afterwards, the GPT models learned using the next word prediction task are further finetuned to follow human instructions. This leads to ChatGPT 1, which fundamentally changes our expectations on what AI systems can do. The evolution as depicted in Figure 1.1 motivates us to wonder whether we can build a general-purpose vision system in a similar manner.

在讨论通用统一视觉系统之前，我们回顾了自然语言处理（NLP）和语言模型在过去几年中的发展。在2018年之前，不同的NLP任务使用不同的任务特定模型来解决，比如翻译（Bahdanau等人，2015）、语义解析（Berant等人，2013）、摘要生成（Allahyari等人，2017）等等。随着transformer 架构（Vaswani等人，2017）的出现，不同NLP任务的语言模型采用了仅解码器的统一架构，例如GPT模型（Brown等人，2020）。随后，使用下一个词预测任务训练的GPT模型进一步微调以遵循人类指令。这导致了ChatGPT 1的诞生，从根本上改变了我们对AI系统可以做什么的期望。如图1.1所示的的进化过程促使我们思考是否可以用类似的方式构建一个通用的视觉系统。

Challenges挑战：两个方面

建模方面：差异很大的视觉任务=不同类型输入的任务+不同粒度的任务+输出也具有不同的格式

数据方面：标注成本差异大且粒度和语义各异+收集图像成本高且数量少

That computer vision tasks vary greatly presents a great challenge to build a unified vision model. First, vision tasks have different types of inputs, ranging from static images (Rus- sakovsky et al., 2015) to sequential videos (Miech et al., 2019), from pure vision inputs such as image dehazing (He et al., 2010) to multi-modality inputs that include e.g., vision and language An- tol et al. (2015). Second, different granularities are required for different tasks, such as image-level tasks like image classification (He et al., 2016) and captioning (Vinyals et al., 2016), region-level tasks like object detection (Girshick, 2015) and grounding (Plummer et al., 2015), and pixel-level tasks like image segmentation (He et al., 2017), super-resolution (Wang et al., 2020), etc. As a result, the outputs of vision systems are also of different formats, such as spatial information like edges, boxes, and masks, semantic information like class labels, multi-label tags, or detailed descriptions. In addition to the challenges in modeling, there are also challenges with data. First, the cost of annotation varies greatly among different types of labels. As shown in Figure 4.6, these labels are at different levels of granularity and semantic richness, ranging from whole images, regions (box an- notations), to masks (pixel annotations). Second, it is in general much more costly to collect image data than text data. So, the scale of vision data is often much smaller than that of text corpora.

计算机视觉任务的差异很大，这对构建统一视觉模型提出了巨大的挑战。

首先，视觉任务具有不同类型的输入，从静态图像（Rus- sakovsky等人，2015）到序列视频（Miech等人，2019），从纯视觉输入(如图像去雾)(He等人，2010)到多模态输入，包括视觉和语言An- tol等人(2015)。

其次，不同的任务需要不同的粒度，例如图像级任务，如图像分类（He等人，2016）和字幕生成（Vinyals等人，2016），区域级任务，如目标检测（Girshick，2015）和grounding （Plummer等人，2015），像素级任务，如图像分割（He等人，2017）、超分辨率（Wang等人，2020）等。

因此，视觉系统的输出也具有不同的格式，如边缘、框和掩模等的空间信息，类别标签、多标签标记或详细描述等的语义信息。除了建模方面的挑战，数据方面也存在挑战。

首先，不同类型的标签的注释成本差异很大。如图4.6所示，这些标签具有不同的粒度和语义丰富度，从整个图像、区域(方框和注释)到掩码(像素注释)。其次，通常情况下，收集图像数据比文本数据要昂贵得多。因此，视觉数据的规模通常比文本语料库要小得多。

Towards a unified vision model朝着统一视觉模型迈进：三大类研究=衔接视觉与语言的桥梁(如CLIP)+统一多任务建模+类似LLM的可提示接口

Despite these challenges, there is a growing interest in the com- puter vision community to develop a general-purpose, unified vision system, in particular for visual understanding tasks. As illustrated in Figure 4.1, we group these efforts in three categories:

>> Bridging vision and language. By extending closed-set classification to open-world recogni- tion, the contrastive language-image models like CLIP (Radford et al., 2021) demonstrate impres- sive zero-shot transferability for different vision tasks. These models learn the mapping between raw visual signals and rich semantics and can power various open-vocabulary vision recognition tasks (Zhong et al., 2022b; Gu et al., 2022; Li et al., 2022f; Ghiasi et al., 2022b).

>> Unified multi-task modeling. Traditional task-specific vision models are trained using task- specific data. It is often prohibitively expensive to develop a model for a new task. Thus, it is desirable to develop a unified vision model that can perform well across many vision tasks (Yang et al., 2022c; Lu et al., 2022a; Zou et al., 2023a; Chen et al., 2022c).

>> LLM-like promptable interface. LLMs can take different language and in-context prompts as in- puts and produce user-desired outputs without finetuning. A general-purpose vision model should have possessed the same in-context learning capability to align the output to various user intents without changing its model parameters (Bar et al., 2022; Kirillov et al., 2023; Zou et al., 2023b; Wang et al., 2023j; Balazˇevic´ et al., 2023).

尽管存在这些挑战，计算机视觉界对开发通用的、统一的视觉系统，尤其是用于视觉理解任务的兴趣正在增长。如图4.1所示，我们将这些努力分为三类：

>> 衔接视觉与语言的桥梁。通过将封闭集分类扩展到开放集识别，对比语言-图像模型如CLIP（Radford等人，2021）展示了对不同视觉任务的令人印象深刻的零样本可迁移性。这些模型学习了原始视觉信号和丰富语义之间的映射关系，可以支持各种开放词汇的视觉识别任务（Zhong等人，2022b；Gu等人，2022；Li等人，2022f；Ghiasi等人，2022）。

>> 统一多任务建模。传统的任务特定视觉模型是使用任务特定数据训练的。为一项新任务开发模型的成本往往高得令人望而却步。因此，需要开发一种能够在许多视觉任务中表现良好的统一视觉模型是可取的（Yang等人，2022c；Lu等人，2022a；Zou等人，2023a；Chen等人，2022c）。

>> 类似LLM的可提示接口。LLMs可以接受不同的语言和上下文提示作为输入，并在不微调的情况下产生用户期望的输出。通用视觉模型应该具备相同的上下文学习能力，以将输出与各种用户意图对齐，而不改变其模型参数（Bar等人，2022；Kirillov等人，2023；Zou等人，2023b；Wang等人，2023j；Balazˇevic´等人，2023）。

In what follows, we will elaborate the detailed techniques and methods in each category.

接下来，我们将详细讨论每个类别中的技术和方法。

4.2、From Closed-Set to Open-Set Models从封闭集模型到开放集模型

传统基于任务的定制模型(如图像分类/目标检测，痛点=很难迁移)→近期提出CLIP(通过引入对比语言图像预训练方法来训练开放集模型+不是输入到标签的映射+学习而是学习对齐的视觉-语义空间)

Traditionally, visual recognition is formulated as a classification problem that maps raw visual data (e.g., images) to discrete text labels. For example, image classification predicts a label from a pre- defined close set for a whole image (Deng et al., 2009), and object detection identifies the objects, defined in a close set, within an image (Lin et al., 2014). However, such closed-set models can hardly transfer to other tasks where the close set (or vocabulary) is insufficient. For example, it is difficult to apply an object detector trained using the Microsoft COCO object set 2 to detect Minecraft objects. Recently, CLIP (Radford et al., 2021) addresses the limitation of closed-set models by introducing a contrastive language-image pre-training method to train an open-set model. As illustrated in Figure 4.2 (a), instead of learning the mapping from input to labels, CLIP learns an aligned visual-semantic space using hundreds of millions of image-text pairs. Mathematically,the traditional vision tasks optimize the log-likelihood of assigning label y = c to an image, often represented as a feature vector u ∈ RP :

where w ∈ RK×P is the projection matrix. Instead of using a pre-determined project matrix w, the CLIP method uses a text encoder Enctext to for the projection:

where v plays the role of w in Eq. (4.1). The reason why a text encoder can help achieve open-set recognition is that all textual concepts are embedded in the same feature space through large-scale pre-training, and the feature distributions are coherent to the semantic meanings without the need of a pre-defined vocabulary. As such, the aligned visual-semantic space can be easily transferred to a wide range of image recognition tasks in a zero-shot manner. Please refer to Chapter 2 for a detailed discussion. In the following, we focus our discussion on the region-level and pixel-level models.

传统上，视觉识别被制定为一个分类问题，它将原始视觉数据（例如图像）映射到离散的文本标签。例如，图像分类从预定义的封闭集中预测一个整个图像的标签（Deng等人，2009），目标检测则在图像中识别出封闭集中定义的对象（Lin等人，2014）。然而，这样的封闭集模型很难迁移到其他任务，其中封闭集（或词汇表）不足够。例如，使用Microsoft COCO对象集2训练的目标检测器很难用于检测Minecraft中的对象。

最近，CLIP（Radford等人，2021）通过引入对比语言图像预训练方法来训练开放集模型，从而解决了封闭集模型的局限性。如图4.2(a)所示，CLIP不是学习从输入到标签的映射，而是使用数亿个图像-文本对来学习对齐的视觉-语义空间。

在数学上，传统的视觉任务优化将标签y=c分配给图像的对数似然，通常表示为特征向量u∈RP：

其中w∈RK×P是投影矩阵。CLIP方法不使用预先确定的投影矩阵w，而是使用文本编码器Enctext进行投影：

其中v在等式（4.1）中扮演了w的角色。

文本编码器之所以能够帮助实现开放集识别，是因为所有文本概念都通过大规模预训练嵌入到相同的特征空间中，特征分布与语义含义一致，无需预定义的词汇表。因此，对齐的视觉-语义空间很容易地以零样本方式轻松迁移到各种图像识别任务中。有关详细讨论，请参阅第2章。接下来，我们将重点讨论区域级和像素级模型。

CLIP模型引领了使用大量文本-图像对进行不同粒度的视觉理解模型的发展：图像级(如图像分类/图像-文本检索/图像字幕生成)、区域级(目标检测/短语定位)、像素级(图像分割/指代分割)

After the release of the CLIP model (Radford et al., 2021), a number of open-set vision models have been developed using large amounts of text-image pairs for visual understanding at different levels of granularity (Yang et al., 2022b; Zhang et al., 2023e; Li et al., 2022f; Ghiasi et al., 2022a), rang- ing from image-level tasks (e.g., image classification Deng et al. (2009), image-text retrieval, image captioning Chen et al. (2015)), region-level localization (e.g., object detection and phrase ground- ing Plummer et al. (2015)), to pixel-level grouping tasks (e.g., image segmentation and referring segmentation Long et al. (2015); Kirillov et al. (2019); Hafiz and Bhat (2020)). These models can be categorized along the following three dimensions: model initialization, design and training.

在CLIP模型（Radford等人，2021）发布后，使用大量文本-图像对进行不同粒度的视觉理解的开放集视觉模型得到了发展（Yang等人，2022b；Zhang等人，2023e；Li等人，2022f；Ghiasi等人，2022a），包括图像级任务（例如图像分类Deng等人(2009)、图像-文本检索、图像字幕生成Chen等人(2015)）、区域级定位（例如目标检测和短语定位Plummer等人(2015)）、以及像素级分组任务（例如图像分割和指代分割Long等人(2015)；Kirillov等人(2019)；Hafiz和Bhat(2020)）。

这些模型可以沿以下三个维度进行分类：模型初始化、设计和训练。

维度1：Model initialization模型初始化：

CLIP初始化：主要采用预训练好的模型，基于预训练的ResNet作为视觉编码器+基于预训练的RPN提取区域特征(如OVR-CNN/RegionCLIP)→利用CLIP模型提取像素的密集标签(如MaskCLIP/FreeSeg)→利用CLIP中的冻结卷积网络ConvNeXt来编码各种分辨率的输入图像(如FC-CLIP)

There are different initialization methods for open-set model training.

>> CLIP initialized. Many recent open-set models are trained by using a pre-trained model such as CLIP for initialization since a pre-trained model already provides a well-aligned (but often coarse-grained) visual-semantic feature space. For example, OVR-CNN (Zareian et al., 2021) and RegionCLIP (Zhong et al., 2022b) use a CLIP-style pre-trained ResNet (He et al., 2016) as the vision encoder and a pre-trained RPN (Ren et al., 2015) to extract regional features. Like- wise, MaskCLIP (Zhou et al., 2022a) and FreeSeg (Qin et al., 2023b) exploit the CLIP model to extract dense labels for pixels. FC-CLIP (Yu et al., 2023a) uses a frozen convolution network ConvNeXt (Liu et al., 2022b) in CLIP to encode input images of various resolutions.

对于开放集模型的训练，有不同的初始化方法。

>> CLIP初始化。最近的许多开放集模型是通过使用预训练模型（如CLIP）进行初始化而训练的，因为预训练模型已经提供了一个良好对齐（但通常是粗粒度的）的视觉-语义特征空间。例如，OVR-CNN（Zareian等人，2021）和RegionCLIP（Zhong等人，2022b）使用类似CLIP的预训练ResNet（He等人，2016）作为视觉编码器，并使用预训练的RPN（Ren等人，2015）提取区域特征。同样，MaskCLIP（Zhou等人，2022a）和FreeSeg（Qin等人，2023b）利用CLIP模型提取像素的密集标签。FC-CLIP（Yu等人，2023a）使用CLIP中的冻结卷积网络ConvNeXt（Liu等人，2022b）来编码各种分辨率的输入图像。

CLIP增强：使用预训练的CLIP模型来辅助模型训练，知识蒸馏CLIP特征(如ViLD)、利用CLIP模型提供特征和分数（例如MaskCLIP和Mask-Adapted CLIP）

例如通过知识蒸馏（knowledge-distillation）将模型与对齐的CLIP特征相结合（例如ViLD），或者在模型训练过程中依赖预训练的CLIP模型提供特征和分数（例如MaskCLIP和Mask-Adapted CLIP）。

>> CLIP augmented. Instead of initializing a model with CLIP parameters, other methods initialize the model parameters as usually (e.g., setting random values to model parameters), but use the pre-trained CLIP to help model training.For example, ViLD (Gu et al., 2022) augments the model with aligned CLIP features via knowledge-distillation. MaskCLIP (Ding et al., 2022b) and Mask- Adapted CLIP Liang et al. (2023a) rely on the pre-trained CLIP model to provide features and scores, respectively, during the course of model training.

>> CLIP增强。与使用CLIP参数初始化模型参数不同，其他方法通常初始化模型参数（例如，将模型参数设置为随机值），但使用预训练的CLIP来帮助模型训练。例如，ViLD（Gu等人，2022）通过知识蒸馏使用与CLIP特征对齐的方法来增强模型。MaskCLIP（Ding等人，2022b）和Mask-Adapted CLIP（Liang等人，2023a）在模型训练过程中依赖于预训练的CLIP模型提供特征和分数。

其他方法：使用监督预训练模型或从零开始训练(如GLIP/OpenSeeD//)、采用联合训练的方法(如GroupViT)、利用预训练的稳定扩散模型提取紧凑的掩码(如ODISE)

总结了不同方法用于学习视觉语义特征空间的方式，包括使用监督预训练模型或从零开始训练，其中一些使用预训练的文本和图像编码器，而另一些则采用联合训练的方法，同时涉及语义分割和图像文本对齐任务，并提到了一种利用预训练稳定扩散模型来提取紧凑掩模的方法。

>> Other works learn a visual-semantic feature space using supervised pre-trained models or from scratch. For example, GLIP (Li et al., 2022f) and OpenSeeD (Zhang et al., 2023e) use a pre-trained BERT model (Devlin et al., 2019) and the CLIP text encoder, respectively, and use a vision back- bone pre-trained on ImageNet for image encoding. Though these separately pre-trained image and text encoders do not explicitly learn the alignment between image and language, it turns out that these models still give good representations for images and texts, and are instrumental to efficient model training. Differently, GroupViT (Xu et al., 2022a) is trained jointly using an open-set se- mantic segmentation task and a global image-text alignment task from scratch. ODISE (Xu et al., 2023a) exploits pre-trained Stable Diffusion models (Rombach et al., 2022) to extract compact masks.

>> 其他方法使用监督预训练模型或从头开始学习视觉-语义特征空间。例如，GLIP（Li等人，2022f）和OpenSeeD（Zhang等人，2023e）分别使用了预训练的BERT模型（Devlin等人，2019）和CLIP文本编码器，以及在ImageNet上预训练的视觉骨干进行图像编码。尽管这些分别预训练的图像和文本编码器并未明确学习图像和语言之间的对齐，但事实证明，这些模型仍然为图像和文本提供了良好的表示，并有助于高效的模型训练。

与此不同，GroupViT（Xu等人，2022a）是使用从零开始进行开放集语义分割任务和全局图像-文本对齐任务的联合训练。ODISE（Xu等人，2023a）利用预训练的稳定扩散模型（Rombach等人，2022）提取紧凑的掩码。

维度2：Model design模型设计

两阶段模型：将目标定位和识别分离开+无需额外训练

Two-stage models通常分离了定位和识别，采用预训练的网络进行目标定位和提取掩模，然后使用预训练的CLIP模型来度量视觉内容和语言概念之间的相似性，其优势在于能够继承开放式语义理解能力，无需额外训练，从而将建模训练集中在优秀的定位网络上。

Open-set models can be either multi-stage or end-to-end.

>> Two-stage models. These models usually follow the design of the pre-DETR based models (Ren et al., 2015; He et al., 2017), which decouples localization and recognition. For object detection, a region proposal network is typically pre-trained for localizing the object of interest (Zhong et al., 2022b; Gu et al., 2021), and a mask proposal network for extracting masks (Ghiasi et al., 2022a; Yao et al., 2022a). Given the localization results, a pre-trained CLIP model is used to measure the similarity between visual contents and language concepts. A clear advantage for two-stage models is that they can inherit the open-set semantic understanding capacity without additional training so as to devote modeling training to requiring a well-performed localization network.

开放集模型可以是多阶段的也可以是端到端的。

>> 两阶段模型。这些模型通常遵循基于pre-DETR的模型（Ren等人，2015；He等人，2017）的设计，将定位和识别解耦。对于目标检测，通常对区域建议网络进行预训练以定位感兴趣的目标（Zhong等人，2022b；Gu等人，2021），以及用于提取掩码的掩码提议网络（Ghiasi等人，2022a；Yao等人，2022a）。在获得定位结果后，通常会使用预训练的CLIP模型来度量视觉内容与语言概念之间的相似性。两阶段模型的明显优势在于，它们可以继承开放集语义理解能力，无需额外的训练，从而将建模训练专注于要求性能良好的定位网络。

端到端模型：将目标检测视为textual grounding语言地面匹配+整体训练

End-to-end models与两阶段模型不同，采用DETRe等一阶段模型，直接在图像文本对上进行端到端训练，形式化为文本定位，可进一步增强视觉语言交互或采用DETRe样式的模型设计，适用于目标检测和分割任务。

>> End-to-end models. Different from two-stage models, the end-to-end models follow the DETR- based methods (Carion et al., 2020; Cheng et al., 2022) or other one-stage models (Dai et al., 2021). GLIP (Li et al., 2022f) is one of the representative works. GLIP formulates object detec- tion as textual grounding and is trained end-to-end on image-text pairs with detection and ground- ing labels. Follow-up works enhance GLIP by enabling deeper vision-language interactions (Liu et al., 2023h) or using DETR-like model design (Zang et al., 2022; Minderer et al., 2022). For segmentation, both ZegFormer (Ding et al., 2022a) and OpenSeeD (Zhang et al., 2023e) exploit a DETR-like architecture and predict the masks and categories based on the outputs of their de- coders.

>> 端到端模型。与两阶段模型不同，端到端模型遵循DETR- based方法（Carion等人，2020；Cheng等人，2022）或其他单阶段模型（Dai等人，2021）。GLIP（Li等人，2022f）是代表性的作品之一。GLIP将目标检测形式化为文本定位，并在图像-文本对上进行端到端训练，包括检测和定位标签。后续工作通过使视觉-语言交互更加深入（Liu等人，2023h）或使用DETR-like模型设计（Zang等人，2022；Minderer等人，2022）来增强GLIP。对于分割，ZegFormer（Ding等人，2022a）和OpenSeeD（Zhang等人，2023e）都利用了DETR-like架构，并根据它们的解码器的输出来预测掩码和类别。

维度3：Model pre-training模型预训练：三种学习方法

监督学习：利用现有标注转换为语言监督训练=将标签监督转化为语言监督+使用现有的标注数据来训练开放式模型

There are mainly three learning methods for pre-training open-set vision models.

>> Supervised learning. By converting label supervision to language supervision, many works di- rectly leverage the existing supervised annotations for training open-set models. For example, OVR-CNN (Zareian et al., 2021) trains a model with COCO categories and then evaluates its per- formance on novel categories. Likewise, ViLD (Gu et al., 2021) trains and evaluates two separate models on COCO and LVIS datasets, respectively. Following a similar protocol, many works train the open-set segmentation models on a subset of annotated segmentation data and evaluate the generalization ability on held-out data (Ding et al., 2022a,b; Zhang et al., 2023e; Xu et al., 2023a).

对于预训练开放集视觉模型，主要有三种学习方法。

>> 监督学习。通过将标签监督转换为语言监督，许多工作直接利用现有的监督注释来训练开放集模型。例如，OVR-CNN（Zareian等人，2021）训练了一个带有COCO类别的模型，然后评估其在新颖类别上的性能。同样，ViLD（Gu等人，2021）在COCO和LVIS数据集上分别训练和评估了两个单独的模型。遵循类似的协议，许多工作在带注释的分割数据子集上训练开集分割模型，并评估在保留数据上的泛化能力（Ding等人，2022a,b；Zhang等人，2023e；Xu等人，2023a）。

半监督学习：同时利用标注数据和未标注或弱标记数据+通过丰富的语义信息提高模型的泛化能力

>> Semi-supervised learning. One might use both annotated data and unlabeled or weakly-labeled data. For example, both RegionCLIP (Zhong et al., 2022b) and GLIP (Li et al., 2022f) use a teacher model to extract fine-grained region-text alignments from image-text pairs to augment the training data for better open-set detection performance. Differently, OpenSeg (Ghiasi et al., 2022b) exploits Localized Narrative datasets (Pont-Tuset et al., 2020) as weakly-labeled data, which provides coarse correspondence between language phrases and strokes in images. Empir- ically, such semi-supervised learning methods often help improve models’ generalization ability because they can effectively leverage rich semantics from noisy data.

>> 半监督学习。可以同时使用带注释的数据和未标记或弱标记的数据。例如，RegionCLIP（Zhong等人，2022b）和GLIP（Li等人，2022f）都使用教师模型从图像-文本对中提取细粒度的区域-文本对齐来增强训练数据，以获得更好的开放集检测性能。不同地，OpenSeg（Ghiasi等人，2022b）利用本地化叙事数据集(Pont-Tuset等人，2020)作为弱标记数据，提供图像中语言短语和笔画之间的粗略对应关系。从经验上看，这些半监督学习方法通常有助于提高模型的泛化能力，因为它们可以有效地利用来自嘈杂数据的丰富语义信息。

弱监督学习：仅使用弱标注数据训练模型

>> Weakly-supervised learning. Some works solely use weakly-labeled data for modeling. For example, GroupViT (Xu et al., 2022a) uses a contrastive learning method where all supervisions for model training are from positive and negative image-text pairs. Following the same contrastive learning method, SegCLIP (Luo et al., 2023b) uses a gathering mechanism to learn to merge image patches through the training on image-text pairs.

>> 弱监督学习。一些工作仅使用弱标记数据进行建模。例如，GroupViT（Xu等人，2022a）使用对比学习方法，其中所有模型训练的监督都来自正负图像-文本对。遵循相同的对比学习方法，SegCLIP（Luo等人，2023b）使用收集机制来学习通过图像-文本对的训练来合并图像块。

Below, we review recent models developed for region-level and pixel-level tasks.

在下面，我们将回顾为区域级和像素级任务开发的最新模型。

4.2.1、Object Detection and Grounding目标检测和定位

目标检测(识别和定位感兴趣的对象)：基于区域的方法(R-CNN/Fast R-CNN/Faster R-CNN)→提高实时性(YOLO系列)→基于Transformer架构(如DETR/DINO/Group DETR/Co-DETR)

Object detection is a fundamental task in computer vision that involves identifying and localizing objects of interest within an image or a video sequence (Viola and Jones, 2001). Over the years, various techniques and algorithms have been developed to improve the accuracy and efficiency of object detection. In the past, region-based approaches such as R-CNN Girshick et al. (2015), Fast R-CNN (Girshick, 2015) and Faster R-CNN (Ren et al., 2015) have been fostering the development of advanced techniques for object detection. To improve real-time performance, YOLO (Redmon et al., 2016) proposes a single neural network that simultaneously predicts object classes and bound- ing box coordinates. Some improvements are made by either using multiple feature maps at differ- ent scales (Liu et al., 2016) or introducing a focal loss to address the class imbalance problem in dense object detection scenarios (Lin et al., 2017). After the emergence of Transformer (Vaswani et al., 2017), DETR (Carion et al., 2020) applies the transformer architecture to object detection, treating it as a set prediction problem. Since DETR, a number of methods have been proposed to improve transformer-based detection models from various aspects, such as DINO (Zhang et al., 2022a), Group DETR (Chen et al., 2022b), and Co-DETR (Zong et al., 2023).

目标检测是计算机视觉中的一个基本任务，涉及在图像或视频序列中识别和定位感兴趣的对象（Viola和Jones，2001）。多年来，人们开发了各种技术和算法，以提高目标检测的准确性和效率。

过去，基于区域的方法，如R-CNN（Girshick等人，2015）、Fast R-CNN（Girshick，2015）和Faster R-CNN（Ren等人，2015）一直在促进目标检测的高级技术的发展。

为了提高实时性能，YOLO（Redmon等人，2016）提出了一个单一的神经网络，同时预测对象类别和边界框坐标。一些改进是通过在不同尺度上使用多个特征图（Liu等人，2016）或引入焦点损失来解决密集目标检测场景中的类别不平衡问题（Lin等人，2017）而实现的。

在Transformer（Vaswani等人，2017）出现之后，DETR（Carion等人，2020）将Transformer架构应用于目标检测，将其视为一个集合预测问题。自从DETR以来，已经提出了许多方法，以从各个方面改进基于Transformer的检测模型，例如DINO（Zhang等人，2022a）、Group DETR（Chen等人，2022b）和Co-DETR（Zong等人，2023）。

基于开放集目标检测模型的三种主要的评估设置：零样本目标检测(评估模型的迁移能力)、严格的开放词汇目标检测(检测限定训练词汇但不限定测试词汇)、通用的开放词汇目标检测(允许训练测试词汇交集+更接近实际应用场景)、

Open-set object detection models aim to detect arbitrary concepts beyond the vocabulary provided in training data. Three main evaluation settings have been developed in the literature:

>> Zero-shot object detection. Similar to zero-shot image classification (Xian et al., 2018), zero- shot object detection restricts the object classes used for training, and evaluates models’ transferra- bility to novel classes. Methods falling in this category mainly focus on evaluating how a model leverages pre-trained concept embeddings (e.g., word2vec (Mikolov et al., 2013)) and learns good visual-semantic alignments (Bansal et al., 2018; Rahman et al., 2020; Zhu et al., 2019, 2020).

>> Strict open-vocabulary object detection. First introduced in OV-RCNN (Zareian et al., 2021), this setting differs from zero-shot object detection in that there is no limit on the training vocab- ulary as long as it does not cover any target classes. Under this protocol, some representative works are ViLD (Gu et al., 2021), RegionCLIP (Zhong et al., 2022a) which leverage large-scale language-image models (Radford et al., 2021; Jia et al., 2021), and Detic (Zhou et al., 2022b) that learns from image-label data.

>> Generalized open-vocabulary object detection. Some recent works like GLIP (Li et al., 2022f), and OWL-VIT (Minderer et al., 2022) advocate a more flexible setting to evaluate the dataset or task transferrability for object detection models. This setting allows vocabulary overlap between training and test sets, e.g., Objects365 for training while COCO for evaluation. This is arguably a more practical setting than the two settings described above in that models can be trained using any arbitrary set of training data and their detection performance evaluated in the wild (Li et al., 2022b).

开放集目标检测模型旨在检测训练数据中未提供的任意概念。文献中已经发展了三种主要的评估设置：

>> 零样本目标检测。类似于零样本图像分类（Xian等人，2018），零样本目标检测限制了用于训练的对象类别，并评估模型对新类别的可转移性。属于这一类别的方法主要关注模型如何利用预训练的概念嵌入（例如，word2vec（Mikolov等人，2013））并学习良好的视觉-语义对齐（Bansal等人，2018；Rahman等人，2020；Zhu等人，2019，2020）。

>> 严格的开放词汇目标检测。首次在OV-RCNN（Zareian等人，2021）中引入，该设置与零样本目标检测不同之处在于，只要不涵盖任何目标类，训练词汇就没有限制。在此协议下，一些代表性的工作包括ViLD（Gu等人，2021）、RegionCLIP（Zhong等人，2022a），它们利用了大规模的语言-图像模型（Radford等人，2021；Jia等人，2021），以及从图像标签数据中学习的Detic（Zhou等人，2022b）。

>> 通用的开放词汇目标检测。一些最近的工作，如GLIP（Li等人，2022f）和OWL-VIT（Minderer等人，2022），提倡一种更灵活的设置来评估目标检测模型的数据集或任务可转移性。该设置允许训练集和测试集之间存在词汇重叠，例如在训练时使用Objects365，而在评估时使用COCO。这可以说是比上述两种设置更实用的设置，因为模型可以使用任意一组训练数据进行训练，并在实际场景中评估其检测性能（Li等人，2022b）。

目标定位(定位与输入的名词短语相关联的对象)：一种广义的开放式目标检测任务，基于Transformer架构的方法(M-DETR/GLIP/DetCLIPv2/Grounding-DINO/Grounding-SAM)=联合学习目标检测和对象定位数据以处理开放式场景+使用语言条件来指导对象检测和定位

总结了目标定位可以被视为一种广义的开放式目标检测任务，其中模型以句子和图像作为输入，定位与名词短语相关联的对象，介绍了一系列采用Transformer架构的方法，包括M-DETR、GLIP、DetCLIPv2、Grounding-DINO和Grounding-SAM，它们通过联合学习目标检测和对象定位数据以处理开放式场景，并使用语言条件来指导对象检测和定位。

Object grounding can be considered as a generalized open-set object detection task (Plummer et al., 2015; Kazemzadeh et al., 2014; Chen et al., 2019; Deng et al., 2018). In this task, models take a sentence and an image as input and localize objects that are associated with the noun phrases. Recently, M-DETR (Kamath et al., 2021) employs a transformer-based architecture to build an end- to-end modulated detector to detect objects in an image given a raw text query. Unlike previous works where models are trained on specific datasets, the network is pre-trained with 1.3M pairs of text and images, sourced from multi-modal datasets where the connections between text phrases and corresponding image objects are labeled. Inspired by M-DETR, GLIP (Li et al., 2022f) casts object detection as a grounding problem, and jointly learns a model using object detection and grounding data for open-set scenarios. Following this line of research, DetCLIPv2 (Yao et al., 2023) proposes a simple joint learning method where multiple tasks are converted into a word-region alignment task, and then a model is trained end-to-end on a corpus consisting of object detection data, grounding data and image-text pairs. Grounding-DINO (Liu et al., 2023h) is a state-of-the-art grounded object detection method, where the object detector is composed of components: a backbone, a neck, and a head, and inject language conditions at every stage. A combined text and image backbone is employed to extract features at multiple scales, which are then passed on to the neck. The text and image characteristics generated by the neck are subsequently used for language-driven query selection. Grounding-SAM is developed by combining Grounding-DINO with SAM (Kirillov et al., 2023). As shown in Figure 4.4, an image and a group of concepts are first fed into Grounding-DINO to produce the boxes, and then the boxes are used as prompts for SAM to predict masks for each box.

目标定位可以被视为广义的开放词汇目标检测任务（Plummer等人，2015；Kazemzadeh等人，2014；Chen等人，2019；Deng等人，2018）。在这个任务中，模型接受一个句子和一张图像作为输入，并定位与名词短语相关联的对象。

最近，M-DETR（Kamath等人，2021）采用了基于Transformer的架构，构建了一个端到端的调制探测器，以在给定原始文本查询的情况下检测图像中的物体。与之前在特定数据集上训练模型的工作不同，该网络是用130万对文本和图像进行预训练的，这些文本和图像来自多模态数据集，其中文本短语和相应图像对象之间的关系被标记。

受到M-DETR的启发，GLIP（Li等人，2022f）将目标检测视为一个定位问题，并联合学习用于开放集场景的目标检测和定位数据的模型。

在这一研究领域的基础上，DetCLIPv2（Yao等人，2023）提出了一种简单的联合学习方法，其中多个任务被转化为一个单词-区域对齐任务，然后模型在由目标检测数据、定位数据和图像-文本对组成的语料库上进行端到端训练。

Grounding-DINO（Liu等人，2023h）是一种最先进的定位目标检测方法，其中目标检测器由多个组件组成：主干、颈部和头部，并在每个阶段注入语言条件。结合文本和图像主干提取多个尺度的特征，然后传递到颈部。颈部生成的文本和图像特征随后用于语言驱动的查询选择。

Grounding-SAM是通过将Grounding-DINO与SAM（Kirillov等人，2023）结合使用而开发的。如图4.4所示，首先将图像和一组概念输入Grounding-DINO，以生成框，然后将框用作SAM的提示，以预测每个框的掩码。

4.2.2、Image Segmentation and Referring图像分割和引用

图像分割：三个子任务(语义分割【每个像素的语义】+实例分割【相同语义含义的像素分组成对象】+全景分割【前二者】)

基于CNN的架构→基于Transformer的架构(如Mask2Former)

两阶段模型/单阶段模型→基于查询的方法

Image segmentation is a long-standing and challenging vision problem. There are mainly three sub- tasks, including semantic (Long et al., 2015), instance (Hafiz and Bhat, 2020), and panoptic (Kirillov et al., 2019) segmentation. Semantic segmentation cares about the per-pixel semantic within an im- age (Long et al., 2015; Chen et al., 2017, 2022j), whereas instance segmentation groups pixels of the same semantic meaning into objects. Models for both tasks have evolved from CNN-based ar- chitectures (Long et al., 2015) to transformer-based ones (Chen et al., 2022j), and from two-stage models (He et al., 2017) and one-stage models (Bolya et al., 2019; Tian et al., 2020b) to the recent query-based approaches (Dong et al., 2021; Zou et al., 2022). With the capability of per-pixel and instance-level understanding, a natural step was taken to formulate panoptic segmentation (Kirillov et al., 2019; Wang et al., 2021a; Cheng et al., 2022). Most recently, Mask2Former (Cheng et al., 2022) proposed to address all three tasks with a unified encoder-decoder architecture. Nevertheless, all these works cope with a limited number of categories. In the following, we will review the most recent works on open-set image segmentation and referring segmentation.

图像分割是一个长期存在且具有挑战性的视觉问题。主要有三个子任务，包括语义分割（Long等人，2015）、实例分割（Hafiz和Bhat，2020）和全景分割（Kirillov等人，2019）。语义分割关注图像中每个像素的语义，而实例分割将具有相同语义含义的像素分组成对象。

这两个任务的模型已经从基于CNN的架构（Long等人，2015）发展到基于Transformer的架构（Chen等人，2022j），从两阶段模型（He等人，2017）和单阶段模型（Bolya等人，2019；Tian等人，2020b）到最近的基于查询的方法（Dong等人，2021；Zou等人，2022）。具有逐像素和实例级别理解能力后，自然而然地采取了制定全景分割（Kirillov等人，2019；Wang等人，2021a；Cheng等人，2022）的步骤。

最近，Mask2Former（Cheng等人，2022）提出了使用统一的编码器-解码器架构来处理这三个任务。然而，所有这些工作都涉及有限数量的类别。接下来，我们将回顾最新的开放集图像分割和引用分割工作。

Open-Vocabulary Segmentation开放词汇分割：将基础模型的丰富视觉-语义知识转移到特定的分割任务中，如(LSeg/OpenSeg、GroupViT、DenseCLIP、MaskCLIP、FC-CLIP、ODISE)

Open-Vocabulary Segmentation. Recently, a number of methods have been proposed to trans- fer or distill the rich visual-semantic knowledge from foundation models (Radford et al., 2021; Jia et al., 2021) to specific segmentation tasks. Prominent examples include LSeg (Li et al., 2022a), OpenSeg (Ghiasi et al., 2022a), and Huynh et al. (2022). Instead of using existing mod- els, GroupViT Xu et al. (2022a) performs language-image pre-training from scratch with a bottom- up grouping ViT (Dosovitskiy et al., 2021), while DenseCLIP (Rao et al., 2022) demonstrates the superiority of foundation models in finetuning settings compared with supervised models. Re- cently, MaskCLIP (Ding et al., 2022b) is proposed to tackle open-vocabulary panoptic and se- mantic segmentation simultaneously by leveraging CLIP, and achieves impressive performance on ADE20K (Zhou et al., 2017) and PASCAL (Mottaghi et al., 2014; Everingham and Winn, 2011).Instead of using the ViT backbone, a recent work called FC-CLIP (Yu et al., 2023a) exploits a convolutional CLIP backbone (i.e., ConvNeXt trained by OpenCLIP (Ilharco et al., 2021)) as both a feature extractor and a vision encoder. Based on a simplified pipeline, FC-CLIP shows plausi- ble efficiency and lefts the state of the art on various open-vocabulary segmentation benchmarks. Rather than only using CLIP, a recent work ODISE (Xu et al., 2023a) leverages text-to-image diffu- sion models, and shows that the latent features in the pre-trained UNet can provide useful compact segmentation information for open-vocabulary segmentation.

开放词汇分割。最近，已经提出了许多方法，以将基础模型（Radford等人，2021；Jia等人，2021）的丰富视觉-语义知识转移到特定的分割任务中。杰出的例子包括LSeg（Li等人，2022a）、OpenSeg（Ghiasi等人，2022a）和Huynh 等人（2022）。GroupViT（Xu等人，2022a）不是使用现有的模型，而是从头开始进行语言-图像预训练，采用自下而上的分组ViT（Dosovitskiy等人，2021），而DenseCLIP（Rao等人，2022）则在微调设置中展示了基础模型相对于监督模型的优越性。

最近，MaskCLIP（Ding等人，2022b）被提出，以利用CLIP来同时处理开放词汇全景分割和语义分割，并在ADE20K（Zhou等人，2017）和PASCAL（Mottaghi等人，2014；Everingham和Winn，2011）上取得了令人印象深刻的性能。

与使用ViT主干不同，最近的一项名为FC-CLIP（Yu等人，2023a）的工作利用了卷积CLIP主干（即由OpenCLIP训练的ConvNeXt（Ilharco等人，2021））作为特征提取器和视觉编码器。基于简化的流程，FC-CLIP在各种开放词汇分割基准上展示出了合理的效率，并取得了各种开放词汇分割基准的最新技术。

与仅使用CLIP不同，最近的工作ODISE（Xu等人，2023a）利用了文本到图像扩散模型，并显示出预训练UNet中的潜在特征可以为开放词汇分割提供有用的紧凑分割信息。

重大挑战：缺乏带有语义标签的分割数据

A big challenge in open-vocabulary segmentation is the lack of segmentation data annotated with semantic labels. Thus far, most of the works are still using COCO segmentation annotations. A few recent works attempt to leverage object detection data as the extra supervision to augment the training of segmentation models, such as OpenSeeD (Zhang et al., 2023e) (shown in Figure 4.5) and DataSeg (Gu et al., 2023). In addition to these new modeling techniques, new datasets have been developed to mitigate this problem, including curating multi-domain segmentation datasets (Lam- bert et al., 2020), collecting high-quality annotations (Lu et al., 2023c) or scaling up to billions of masks (Kirillov et al., 2023).

在开放词汇分割中的一个重大挑战是缺乏带有语义标签的分割数据。到目前为止，大多数工作仍然使用COCO分割注释。最近的一些工作尝试利用目标检测数据作为额外的监督来增强分割模型的训练，例如OpenSeeD（Zhang等人，2023e）和DataSeg（Gu等人，2023）。除了这些新的建模技术外，还开发了新的数据集来缓解这个问题，包括策划多领域分割数据集（Lambert等人，2020）、收集高质量的注释（Lu等人，2023c）或扩展到数十亿个掩码（Kirillov等人，2023）。

Referring Segmentation指代分割(一种开放式词汇的任务)：使用多模态融合策略设计的模型来处理目标数据集—如CLIPSeg(扩展文本查询)、LAVT(增强跨模态交)→PolyFormer(将掩模转换为多边形)

引用分割本身就是开放词汇的任务，相关工作主要通过多模态融合提升效果，如利用视觉查询网络进行查询分割，或者在视觉变压器结构中增强交叉模态交互等方法。近年来，一些工作也将掩码表示方式从像素转化为多边形，或利用端到端语言驱动模型联合进行对象检测和分割，取得了目前最优的 referring segmentation 性能。

指代分割是一种开放式词汇的任务，涉及使用多模态融合策略设计的模型来处理目标数据集，并介绍了一系列方法如CLIPSeg、LAVT、PolyFormer等，它们采用不同的方法来处理指代分割，包括扩展文本查询、增强跨模态交互以及将掩模转换为多边形等。

Referring Segmentation by design is open-vocabulary. Models are usually designed specifi- cally to learn from target datasets using various multimodal fusion strategies (Hu et al., 2016; Liu et al., 2017; Margffoy-Tuay et al., 2018; Ye et al., 2019a; Yu et al., 2016; Wu et al., 2022a). CLIPSeg (Lu¨ddecke and Ecker, 2022) extends a textual query to a visual query and shows supe- rior performance not only on referring segmentation but also on semantic segmentation.

Since the emergence of vision transformers, works like LAVT (Yang et al., 2022e) enhance the cross-modal interactions from the very beginning, which leads to a decent performance on RefCOCO (Yu et al., 2016), RefCOCO+ (Yu et al., 2016) and G-Ref (Mao et al., 2016; Nagaraja et al., 2016). Differently, PolyFormer (Liu et al., 2023e) converts masks into polygons and asks the transformer decoder to de- code a sequence of polygon coordinates. Inspired by Pix2Seq (Chen et al., 2022c), a similar method in object detection, PolyFormer presents an alternative way to represent masks for state-of-the-art referring segmentation. As we discussed earlier, one can also compose Grounding DINO (Liu et al., 2023h) with SAM (Kirillov et al., 2023) for referring segmentation.

设计上，指代分割是开放词汇的。通常，模型专门设计用于从目标数据集中使用各种多模态融合策略进行学习(Hu et al.， 2016;Liu et al.， 2017;margffy - tuay等人，2018;Ye et al.， 2019;Yu et al.， 2016;Wu et al.， 2022a)。CLIPSeg（Lu¨ddecke和Ecker，2022）将文本查询扩展为视觉查询，并在指代分割以及语义分割方面表现出卓越性能。

自从出现视觉transformers以来，像LAVT（Yang等，2022e）这样的工作从一开始就增强了跨模态互动，这导致在RefCOCO（于等，2016年），RefCOCO+（于等，2016年）和G-Ref（毛等，2016年；Nagaraja等，2016年）上取得了不错的性能。与此不同，PolyFormer（刘等，2023e）将掩码转换为多边形，并要求transformer 解码器解码一系列多边形坐标。受到物体检测中的Pix2Seq（Chen等，2022c）的启发，PolyFormer提出了一种替代方法来表示最先进的参考分割的掩模。正如我们之前讨论的那样，还可以将Grounding DINO（刘等，2023h）与SAM（Kirillov等，2023）组合用于指代分割。

Unified Segmentation统一分割：将所有分割任务统一到单一框架中，如X-Decoder(重定义任务+使用通用的编码器-解码器结构)、UNINEXT(早期融合策略来统一不同的分割任务)

如何将各种分割任务统一在一个框架中，提出了两个方法：X-Decoder和UNINEXT，它们分别使用通用的编码器-解码器结构和早期融合策略来统一不同的分割任务，其中X-Decoder将指代分割任务重新定义为条件全景分割，而UNINEXT尝试统一图像和视频中的所有实例级别分割任务。

Given the above methods for open-vocabulary and referring segmenta- tion, an open question is how to unify all segmentation tasks in a single framework. Recently, X-Decoder (Zou et al., 2023a) uses a generalized encoder-decoder architecture to unify all these segmentation tasks. The referring segmentation task is reformulated as a conditioned panoptic seg- mentation that takes some textual phrases as input to the decoder. UNINEXT (Yan et al., 2023) is another work that attempts to unify all instance-level segmentation in images and videos. Different from X-Decoder, UNINEXT uses early fusion to fuse the various prompts and vision features, which are then fed to the transformer encoder-decoder.

鉴于上述用于开放词汇和指代分割的方法，一个未解之谜是如何将所有分割任务统一到单一框架中。

最近，X-Decoder（邹等，2023a）使用了一个广义的编码器-解码器架构来统一所有这些分割任务。指代分割任务任务被重新定义为一个条件全视分割，它将一些文本短语作为解码器的输入。

UNINEXT（严等，2023）是另一个试图统一图像和视频中所有实例级别分割的工作。与X-Decoder不同，UNINEXT使用早期融合来融合各种提示和视觉特征，然后将其输入到transformer 编码器-解码器中。

Figure 4.6: (a) CV task landscape

Figure 4.6: (a) CV task landscape: CV tasks can span different axes, including modality, space and time, which renders significant challenges to unify all of them in a single model. Image credit: Yuan et al. (2021). (b) The data scale pyramid: In particular, datasets in different tasks usually contain different types of supervision. Image-level datasets like ImageNet (Deng et al., 2009) and LAION Schuhmann et al. (2021) have annotations that have rich semantics coverage but are coarse- grained, while pixel-level datasets like COCO panoptic segmentation (Chen et al., 2015) provides fine-grained annotations but with limited concepts.

图4.6：（a）CV任务景观：CV任务可以涵盖不同的轴，包括模态、空间和时间，这使得在单一模型中统一所有这些任务面临重大挑战。图片来源：Yuan等人（2021）。

（b）数据规模金字塔：特别是，不同任务中的数据集通常包含不同类型的监督。像ImageNet（Deng等人，2009年）和LAION Schuhmann等人（2021年）这样的图像级数据集具有丰富的语义覆盖，但粒度较粗，而像COCO全景分割（Chen等人，2015年）这样的像素级数据集提供了精细的标注，但概念有限。

4.3、From Task-Specific Models to Generic Models从特定任务模型到通用模型

背景(之前模型主要针对单个任务设计+未能利用不同粒度或领域任务之间的协同关系)、两大原因(视觉任务分类多样【空间+时间+模态】+数据量规模不同→使得建立统一模型面临重重困难)

讨论了将封闭集模型转变为开放集模型以进行检测和分割的最新努力，同时指出了导致视觉任务难以统一的两个主要原因：

1) 视觉任务的碎片化，涵盖了不同领域、粒度和模态的任务，难以开发一个统一的模型；

2) 数据规模的不同，不同任务的人工标注数据规模差异巨大，导致统一模型的构建具有挑战性。

Above we have discussed the recent efforts of transforming closed-set models to open-set ones for detection and segmentation. Until recently, however, most vision tasks have been separately tackled with specialized model designs, preventing the synergy of tasks across different granularities or domains from being exploited. This is arguably due to two reasons:

>> Vision tasks are fragmented. As shown in Figure 4.6 (a), computer vision tasks span across different axes including space, time, and modality. From the space aspect, it can be image-level, region-level and pixel-level tasks as we discussed before. Along the time axis, we need to tackle not only static images but also temporal video sequences. Regarding the modality, the inputs and outputs can be images, texts, or other types (e.g., human pose, depth map). Such diverse task formats significantly impede the development of a unified model for all tasks.

>> Data scales are different. In addition to the complicated task landscape, the scarcity of hu- man annotations and their different scales for different tasks also make building a unified model challenging. In Figure 4.6 (b), we can see a clear pyramid of data scale, where different lay- ers of human annotations have different semantics. More specifically, image-text datasets like LAION Schuhmann et al. (2021) contain up to 2B samples, while object detection datasets like Objects365 (Shao et al., 2019) have 1.7M images in total. More significant gap is observed in segmentation datasets due to the high cost of annotating masks.

在前面，我们已经讨论了将封闭集模型转变为开放集模型以进行检测和分割的最新工作。然而，直到最近，大多数视觉任务都是通过专门的模型设计单独处理的，阻止了跨不同粒度或领域的任务之间协同作用的发挥。这可能是由于两个原因：

>> 视觉任务是碎片化的。如图4.6（a）所示，计算机视觉任务跨越不同的轴，包括空间、时间和模态。

从空间的角度来看，可以是图像级、区域级和像素级任务，正如我们之前讨论的那样。

在时间轴上，我们不仅需要处理静态图像，还需要处理时间视频序列。

关于模态，输入和输出可以是图像、文本或其他类型（例如，人体姿态、深度图）。这种多样化的任务格式显著阻碍了为所有任务开发统一模型的发展。

>> 数据规模不同。除了复杂的任务环境之外，人工标注的稀缺性和不同任务的标注规模也给统一模型的构建带来了挑战。在图4.6（b）中，我们可以看到一个清晰的数据规模金字塔，其中不同层次的人工注释具有不同的语义。更具体地说，像LAION Schuhmann等人（2021）这样的图像-文本数据集包含高达20亿个样本，而像Objects365（Shao等人，2019）这样的目标检测数据集总共有170万张图像。由于标注掩码的成本较高，在分割数据集上观察到更显著的差距。

基于transformers多功能性+致力于建立统一的通用视觉模型的两类探究：I/O统一(将各种视觉任务重新构建为序列到序列问题)、功能统一(使用一致的编码器-解码器架构+需要复杂的模型设计来适应各种任务)

Despite the aforementioned challenges, we are now witnessing a growing interest in building unified, general-purpose models that can learn from and be applied to a diverse set of vision and vision- language tasks, thanks to the versatility of transformers (Vaswani et al., 2017). These attempts can be grouped into two main categories:

>> I/O Unification. Following the development of unified LLMs, a number of recent works reformu- late many vision tasks as a sequence-to-sequence problem (Wang et al., 2022b; Yang et al., 2022c; Chen et al., 2022d; Lu et al., 2022a). They typically use a tokenizer to tokenize the original inputs and outputs (I/O) in different modalities used in various tasks into a coherent sequence (visual or text) tokens and then exploit a unified, sequence-to-sequence model.

>> Functionality Unification. In addition to I/O unification, one might built a generic model via functionality unification. Extending multi-task learning methods (Lu et al., 2020; Gupta et al., 2022a; Hu and Singh, 2021a), many recent use a coherent encoder-decoder architectures (Yu et al., 2022a; Zhang et al., 2022b; Zou et al., 2023a). This line of work usually does not need task-specific or modality-specific tokenizers but requires a sophisticated model design to accommodate various tasks.

尽管存在上述挑战，但由于transformers的多功能性，我们现在看到人们对构建统一的通用模型越来越感兴趣，这些模型可以学习并应用于各种视觉和视觉语言任务(Vaswani et al.， 2017)。这些尝试可以分为两大类：

>>输入/输出统一（I/O统一）。随着统一LLM的发展，最近的一些工作重新将许多视觉任务形式化为序列到序列问题（Wang等人，2022b；Yang等人，2022c；Chen等人，2022d；Lu等人，2022a）。它们通常使用一个标记器将不同任务中使用的原始输入和输出（I/O）转化为连贯的序列标记（视觉或文本），然后利用统一的序列到序列模型。

>>功能统一。除了I/O统一，还可以通过功能统一构建通用模型。扩展多任务学习方法（Lu等人，2020；Gupta等人，2022a；Hu和Singh，2021a），许多最近的工作使用一致的编码器-解码器体系结构（Yu等人，2022a；Zhang等人，2022b；Zou等人，2023a）。这一行工作通常不需要特定于任务或特定于模式的标记器，但需要复杂的模型设计来适应各种任务。

图4.7展示了这两类统一方法的差异：

Figure 4.7 illustrates the difference between the two categories of unification methods. For I/O uni- fication, the I/O unification module always generates a sequence of tokens, and exploits a separate decoder to decode the final outputs for different tasks. For functionality unification, the functional unification module generates heterogeneous outputs for different task, e.g., semantic outputs and spatial outputs. Then, these different types of outputs are combined to produce the final task-specific outputs. Both unification methods strive to make use of synergy across tasks with different levels of granularity. For example, coarse-grained data is expected to contribute to rich semantic under- standing required by fine-grained tasks, while fine-trained data to enhance the grounding ability for coarse-grained tasks. In the following, we review some recent works of these two categories.

图4.7展示了这两类统一方法的差异。对于I/O统一，I/O统一模块总是生成一系列标记，并利用单独的解码器来解码不同任务的最终输出。对于功能统一，功能统一模块为不同任务生成异构输出，例如语义输出和空间输出。然后，这些不同类型的输出组合在一起生成最终的任务特定输出。这两种统一方法都致力于利用不同粒度任务之间的协同作用。例如，粗粒度数据有望为需要丰富语义理解的细粒度任务做出贡献，而经过精细训练的数据则可以增强粗粒度任务的基础能力。接下来，我们将回顾这两个类别的一些最新工作。

4.3.1、I/O Unification—I/O统一

‌This line of work is mainly inspired by LLMs that unify many NLP tasks as sequential modeling. In the vision domain. the methods of building generic models via I/O unification can be grouped into two categories depending on the tasks of interest and output formats.

这一行的工作主要受到LLMs的启发，LLMs将许多NLP任务统一为顺序建模。在视觉领域。通过I/O统一构建通用模型的方法可以根据感兴趣的任务和输出格式分为两类。

类别1—Sparse and discrete outputs稀疏和离散输出

近年来的一些工作试图通过序列解码或利用预训练语言模型来统一文本和定位输出，实现不同视觉和视觉语言任务的统一，其中UniTab通过特殊符号统一文本和坐标输出，Pix2SeqV2统一球体定位和引用分割，VisionLLM等则利用大型预训练语言模型增强具体任务能力。

For vision tasks that produce sparse or discrete token outputs, we can easily exploit a language tokenizer, such as byte-pair encoding (BPE) (Sennrich et al., 2016), for I/O unification. In contrast, spatial outputs like boxes, masks, or human skeletons can be formulated as a sequence of numeric coordinates which are then tokenized into discrete tokens (Cho et al., 2021; Yang et al., 2022c; Liu et al., 2023e). As a result, the decoded output tokens are interleaved with organic textual tokens and numeric textual tokens to support a wide range of tasks. Without the loss of generality, the decoding process is formulated as auto-regressive generation and the model trained with the objective function defined as:

is the discrete token sequence of length T , and v is the visual feature. Below, we review some representative works.

对于产生稀疏或离散token输出的视觉任务，我们可以轻松利用语言标记器，例如字节对编码（BPE）（Sennrich等人，2016年），进行I/O统一。相比之下，像边界框、掩码或人体骨骼这样的空间输出可以被形式化为一系列数字坐标，然后被标记化为离散token（Cho等人，2021年；Yang等人，2022c；Liu等人，2023e）。因此，解码的输出token与有机文本token和数值文本token交错排列，以支持广泛的任务。在不丧失一般性能的前提下，将解码过程表述为自回归生成，将目标函数定义为训练的模型：

其中，y是长度为T的离散token序列，v是视觉特征。以下，我们将回顾一些代表性的工作。

UniTab：通过特殊符号统一文本和坐标输出，通过符号代表坐标进行序列解码实现不同任务输出统一

UniTab (Yang et al., 2022c) unifies text and box output in a sequence decoding manner. As shown in Figure 4.8 (a), the box coordinates are represented by numerical numbers with <> and then a special token <obj> is used to encompass the location information. In this way, the model can unify a variety of tasks that require textual and location outputs, including image captioning (Chen et al., 2015), grounded captioning (Plummer et al., 2015), visual grounding, object localization and visual question answering (Antol et al., 2015). The model is trained in three stages: pre-training, multi-task finetuning, and task-specific finetuning.

UniTab（Yang等人，2022c）以序列解码方式统一了文本和边界框输出。如图4.8（a）所示，边界框坐标用<>中的数字表示，然后使用特殊token<obj>来包含位置信息。通过这种方式，模型可以统一各种需要文本和位置输出的任务，包括图像字幕（Chen等人，2015年），基于场景的字幕（Plummer等人，2015年），视觉定位、物体定位和视觉问答（Antol等人，2015年）。该模型经过三个阶段的训练：预训练、多任务微调和任务特定微调。

Pix2SeqV2：统一球体定位和引用分割，提出任务提示词来区分球体定位和引用分割两个任务

Pix2SeqV2 (Chen et al., 2022d) slightly differs from UniTab in that it unifies two different vi- sion tasks: referring segmentation and keypoint detection. Following Pix2Seq (Chen et al., 2022c), Pix2SeqV2 represents objects in an image as [ymin, xmin, ymax, xmax, text]. Then, it introduces a unique task prompt for each task, which contains task type information or a combination of task types and specific locations. For mask decoding, a mask contour is converted into a polygon and then its coordinates extracted from the polygon (Castrejon et al., 2017). A similar strategy is also used for referring segmentation, as in Polyformer (Liu et al., 2023e).

Pix2SeqV2（Chen等人，2022d）与UniTab略有不同，它统一了两种不同的视觉任务：指代分割和关键点检测。在Pix2Seq（Chen等人，2022c）之后，Pix2SeqV2将图像中的对象表示为[ymin、xmin、ymax、xmax、text]。然后，它为每个任务引入了一个独特的任务提示，其中包含任务类型信息或任务类型和特定位置的组合。对于掩码解码，将掩码轮廓转换为多边形，然后从多边形中提取坐标（Castrejon等人，2017）。与Polyformer（Liu等人，2023e）一样，也使用了类似的策略进行指代分割。

LLM增强：VisionLLM等则利用优秀的大型预训练语言模型来增强不同视觉语言任务的能力，如Kosmos-2、VisionLLM、DetGPT、GPT4ROI、BubaGPT、LISA和PaLI-X，它们通过融合LLMs的语言理解能力，使模型具备强大的视觉-语言推理能力

LLM-augmented.

Recent works have also explored building a generic decoding interface based on LLMs, which are pre-trained on large amounts of text data and human instructions. Kosmos- 2 (Peng et al., 2023b) exploits the pretrained LLMs of Kosmos-1 (Huang et al., 2023b) and augments the grounded multi-modal data by collecting a web-scale grounded image-text pair dataset (GRIT) consisting of 91M images. VisionLLM (Wang et al., 2023h) appends an even larger LLM (e.g., LLaMa (Touvron et al., 2023)) on top of an image tokenizer, as shown in Figure 4.9. The resultant model exhibits a very strong vision-language reasoning capacity and decent localization ability for object detection, segmentation, etc. Some other works that combine LLMs with grounding are DetGPT (Pi et al., 2023) and GPT4ROI (Zhang et al., 2023k). To further equip the model with the segmentation capability, both BubaGPT (Zhao et al., 2023c) and LISA (Lai et al., 2023) use an extra referring segmentation model to segment images by taking texts or embeddings as input, respectively. PaLI-X (Chen et al., 2023g) is by far the largest unified model that can cope with multilingual vision and vision-language tasks.

LLM增强。

最近的工作还探索了基于LLMs的通用解码接口的构建，LLMs是在大量文本数据和人类指令上进行预训练的。Kosmos-2（Peng等人，2023b）利用Kosmos-1（Huang等人，2023b）的预训练LLM，并通过收集一个包含9100万图像的Web规模的图像-文本对数据集（GRIT）来扩充基于多模态数据的LLM。VisionLLM（Wang等人，2023h）在图像标记器的基础上添加了一个更大的LLM（例如LLaMa（Touvron等人，2023）），如图4.9所示。结果模型展示了非常强的视觉语言推理能力以及对目标检测、分割等的良好的定位能力。其他一些，将LLM与定位结合使用的其他工作包括DetGPT（Pi等人，2023）和GPT4ROI（Zhang等人，2023k）。为了进一步赋予模型分割能力，BubaGPT（Zhao等人，2023c）和LISA（Lai等人，2023）都使用了额外的指代分割模型，通过使用文本或嵌入作为输入分割图像。PaLI-X（Chen等人，2023g）是迄今为止最大的统一模型，可以处理多语言视觉和视觉语言任务。

类别2—Dense and continuous outputs密集和连续输出

以下工作主要从模型结构、预训练策略、定向学习等角度,探索如何利用单一框架处理各种视觉任务,取得一定进展。但整体来说,相比专门设计的模型,通用模型在单个任务上的表现还有差距。

There are also some tasks that require dense and continuous outputs, such as image segmentation (He et al., 2017), depth estimation (Mertan et al., 2022), image inpainting and editing (Elharrouss et al., 2020; Brooks et al., 2023). Except for segmentation masks which can be approximated by poly- gons (Liu et al., 2023e; Chen et al., 2022d), most dense and continuous outputs cannot be easily converted into discrete tokens due to the high-dimensional space. Thus, we have to resort to an image-oriented tokenizer. Akin to the language tokenizer, an image tokenizer encodes raw images and extracts discrete tokens spanning the visual feature space. The most representative work is VQ-VAE (Oord et al., 2017; Razavi et al., 2019). As shown in Figure 4.10 (a), VQ-VAE learns an encoder ze, a decoder zq and a discrete codebook e = {e1, ..., eK } consisting of K embeddings. Given the input x, the posterior categorical probability q(z|x) is defined as:

where the decoder zq takes x (or its representation ek ) as input to predict class label. As a variant of VQ-VAE, VQ-GAN uses a discriminator and the perceptual loss (Larsen et al., 2016; Lamb et al., 2016) to maintain a good balance between output quality and model efficiency (via high compression rate). In Figure 4.10 (b), we see that the discriminator is applied at the patch level to regularize the decoding of images at high resolution. Below, we discuss some most recent works that attempt to unify different vision and multi-modal tasks that involve dense outputs.

还有一些任务需要密集和连续的输出，例如图像分割（He等人，2017年），深度估计（Mertan等人，2022年），图像修复和编辑（Elharrouss等人，2020年；Brooks等人，2023年）。除了可以通过多边形（Liu等人，2023e；Chen等人，2022d）逼近的分割掩码之外，由于高维空间，大多数密集和连续的输出不能容易地转换为离散的token。因此，我们必须借助图像标记器。类似于语言标记器，图像标记器对原始图像进行编码，并提取跨越视觉特征空间的离散token。最具代表性的工作是VQ-VAE（Oord等人，2017年；Razavi等人，2019年）。如图4.10 (a)所示，VQ-VAE学习一个编码器ze、一个解码器zq和一个离散码本e = {e1，…， eK}由K个嵌入组成。给定输入x，后验分类概率q(z|x)定义为：

其中解码器zq以x(或其表示ek)作为输入来预测类标号。作为VQ-VAE的变体，VQ-GAN使用了鉴别器和感知损失(Larsen et al.， 2016;Lamb等人，2016)在输出质量和模型效率(通过高压缩率)之间保持良好的平衡。如图4.10（b）所示，判别器应用于高分辨率图像的解码以规范化图像。以下，我们将讨论一些最近的工作，这些工作试图统一涉及密集输出的不同视觉和多模态任务。

UViM：第一个采用密集解码实现任务统一

UViM (Kolesnikov et al., 2022) is one of the first works that employ a dense decoding process to unify various core vision tasks, including panoptic segmentation, depth estimation and colorization. The learning process consists of two stages: (i) Base encoder-decoder f and restricted oracle Ω are learned to predict outputs given input images, where f takes raw image as input and Ω takes the desired output as input to decode the oracle code; (ii) Instead of using the desired output as input to the oracle Ω, the model learns a language model to produce the oracle code for the input raw image. Notably, the encoder-decoder model used here is trained with VQ-VAE objectives. As the first step to unify vision tasks with a single model, UViM shows promising results on three vision tasks.

UViM（Kolesnikov等人，2022）是第一个采用密集解码过程来统一各种核心视觉任务的工作之一，包括全景分割、深度估计和着色。学习过程包括两个阶段：

（i）基本编码器-解码器f和受限的oracle Ω被学习来根据输入图像预测输出，其中f以原始图像作为输入，Ω以所需的输出作为输入来解码oracle代码；

（ii）模型不使用所需的输出作为oracle Ω的输入，而是学习一个语言模型来为输入原始图像生成oracle代码。值得注意的是，这里使用的编码器-解码器模型是通过VQ-VAE目标函数进行训练的。作为统一不同任务的第一步，UViM在三个视觉任务上展现出了令人鼓舞的结果。

Unified-IO：采用多个VQ-VAE模型和Transformer编码器-解码器，预训练不同任务后端到端训练，来实现任务统一

Unified-IO (Lu et al., 2022a) is another representative work. Compared to UVIM, it scales to many more vision tasks and datasets. Unlike the training procedure of UViM, Unified-IO first trains differ-ent VQ-VAE models for different tasks, as depicted in Figure 4.11 left. After obtaining all VQ-VAE encoder-decoders, 90 datasets are combined to train another transformer encoder-decoder end-to- end, as shown on the right side. Similar to previous works, it also uses a language decoder to obtain the organic and numeric texts to generate coordinate outputs. After the second-stage pre-training, the model achieves state of the art on the GRIT benchmark (Gupta et al., 2022c) and exhibits compelling compositionality, although the performance still lags behind the strongest models on common tasks. As a follow-up, a soft-token strategy is proposed in Ning et al. (2023) to improve the accuracy for next token decoding. In addition, a masked modeling strategy is proposed to learn robust representa- tions. Evaluated on instance segmentation and depth estimation, the model achieves state-of-the-art performance on NYUv2 (Silberman et al., 2012) and competitive performance on segmentation. A recent work uses image inpainting as the general task to unify different pixel-level vision tasks (Bar et al., 2022). Given the target discrete tokens produced by VQ-GAN, the method exploits a masked autoencoder to decode the missed image regions, using the task input-output examples as prompts. Painter (Wang et al., 2023i) extends this pipeline to facilitate more vision tasks and obtains compet- itive performance on various standard benchmarks.

Unified-IO（Lu等人，2022a）是另一个代表性的工作。与UVIM相比，它扩展到了更多的视觉任务和数据集。与UViM的训练过程不同，Unified-IO首先为不同的任务训练了不同的VQ-VAE模型，如图4.11左侧所示。在获得所有VQ-VAE编码器-解码器后，将90个数据集组合起来，端到端训练另一个transformer 编解码器，如图所示。与之前的工作类似，它还使用语言解码器来获取有机文本和数字文本，以生成坐标输出。在第二阶段的预训练之后，该模型在GRIT基准测试（Gupta等人，2022c）上取得了最先进的性能，并展示出令人满意的组合性，尽管在常见任务上的性能仍然落后于最强大的模型。作为后续工作，Ning等人（2023年）提出了一种软token策略，以提高下一个token解码的准确性。此外，提出了一种掩码建模策略来学习强大的表示。在实例分割和深度估计上进行评估，该模型在NYUv2（Silberman等人，2012年）上取得了最先进的性能，并在分割上展现了竞争性能。

最近的一项工作，将图像修复作为统一不同像素级视觉任务的通用任务（Bar等人，2022）。根据VQ-GAN产生的目标离散token，该方法利用了掩码自编码器来解码缺失的图像区域，使用任务输入输出示例作为提示。Painter（Wang等人，2023i）将这一流程扩展到更多的视觉任务，并在各种标准基准测试中取得了竞争性能。

扩散增强：使用已有的稳定扩散模型来构建通用的视觉模型，如Prompt Diffusion和InstructDiffusion

Diffusion-augmented. Unlike the above works that learn their own decoding models, some re- cent works utilize the off-the-shelf stable diffusion model to build generalist vision models. For example, Prompt Diffusion (Wang et al., 2023m) initializes a model using Stable Diffusion and ControlNet (Zhang and Agrawala, 2023), and trains the in-context image-to-image model jointly on six different vision-language tasks, including segmentation, depth estimation, etc. InstructDiffu- sion Geng et al. (2023) also uses the diffusion model but explicitly introduces task-specific instruc- tions to the diffusion process. Moreover, it uses task-specific training and human alignment training to enable a generalist interface for vision tasks.

扩散增强。与上述工作不同，一些最近的工作利用现成的稳定扩散模型来构建通用的视觉模型。例如，Prompt Diffusion（Wang等人，2023m）使用Stable Diffusion和ControlNet（Zhang和Agrawala，2023）初始化一个模型，并在六个不同的视觉语言任务（包括分割、深度估计等）上联合训练上下文图像到图像模型。InstructDiffusion Geng等人（2023）也使用扩散模型，但明确地引入了任务特定的指令来指导扩散过程。此外，它使用任务特定的训练和人类对齐训练来实现视觉任务的通用接口。

4.3.2、Functionality Unification功能统一

Unlike I/O unification, functionality unification attempts to unify different tasks based on the task characteristics, with the awareness that they are neither fully isolated nor fully aligned. At a high level, vision tasks produce three types of outputs: (i) location outputs, (ii) semantic outputs, and pixel-level outputs. For example, both object detection and phrase grounding need to localize objects in the image, while both generic segmentation and referring segmentation produce masks. On the other hand, many tasks require semantic (or text) outputs to represent either concept names or textual descriptions.

与I/O统一不同，功能统一试图根据任务特性来统一不同的任务，同时意识到它们既不完全隔离也不完全一致。从高层次来看，视觉任务产生三种类型的输出：

（i）位置输出，

（ii）语义输出和

（iii）像素级输出。

例如，目标检测和短语定位都需要在图像中定位对象，而通用分割和指代分割都生成掩码。另一方面，许多任务需要语义（或文本）输出，以表示概念名称或文本描述。

Multi-task learning多任务学习

Some early works explore multi-task learning methods for unifying different vision or vision- language tasks.

一些早期的工作探索了用于统一不同视觉或视觉语言任务的多任务学习方法。

Vision models视觉模型：探索使用CNN在不同视觉任务间学习(如Cross-stitch/UberNet)，但都难以建立任务间协同关系来提升模型效果，Taskonomy通过学习视觉任务间关系提供深刻启发

介绍了几项使用CNN处理多任务学习的工作。Cross-stitch Networks和UberNet尝试设计CNN结构适应不同任务，但难将任务间关系整合提升效果。Taskonomy通过学习每个任务特定模型，然后将其映射到潜空间来研究任务间关系，发现表面法线估计等任务间存在紧密关联。它以任务内在关系为导向，为多任务视觉建模提供深入见解。总体来说，这些工作侧重CNN结构设计，但难构建任务协同学习机制。Taskonomy通过任务本身关联性研究，在一定程度上弥补了此不足。

A few works explore using CNNs for learning with different vision tasks at dif- ferent levels. For example, Cross-stitch Networks (Misra et al., 2016) develops a strategy to split different numbers of layers from the top in CNNs so as to adapt to different vision tasks. Results show that the best-performing multi-task architecture depends on the tasks of interest and can hardly generalize to new tasks. UberNet (Kokkinos, 2017) takes one step further to use a single universal CNN architecture and sophisticatedly design a routing mechanism to save the memory and comput- ing cost, as shown in Figure 4.12 (a). Both works require some tweaking to the CNN architecture so that they can adapt to different levels of tasks and loss types. But they unfortunately fail to build the synergy across tasks to improve model performance. Taskonomy (Zamir et al., 2018) specifically studies the relationship among vision tasks. It first trains task-specific models for each individual task and then performs transfer modeling across tasks in the latent space. The task affinity is then calculated in the latent space, providing us with the taskonomy. The result shows that vision tasks have different affinities for different groups, as shown in Figure 4.12 (b). For example, surface nor- mal estimation is heavily related to reshaping and point matching. Curvature extraction is related to image segmentation tasks. This study provides deep insights for multi-task vision modeling (Xu et al., 2018; Crawshaw, 2020).

一些工作探索了使用卷积神经网络（CNNs）来学习不同层次的不同视觉任务。例如，

Cross-stitch Networks（Misra等人，2016）开发了一种策略，从CNNs的顶部分割不同数量的层，以适应不同的视觉任务。结果显示，最佳的多任务架构依赖于感兴趣的任务，很难推广到新的任务。

UberNet（Kokkinos，2017）进一步使用了单一通用CNN架构，并精心设计路由机制，以节省内存和计算成本，如图4.12（a）所示。

这两项工作都需要对CNN架构进行一些调整，以便适应不同层次的任务和损失类型。但遗憾的是，它们未能在任务之间建立协同关系以提高模型性能。

Taskonomy（Zamir等人，2018）专门研究了视觉任务之间的关系。它首先为每个单独的任务训练特定于任务的模型，然后在潜在空间中跨任务进行传递建模。然后在潜在空间中计算任务关联，为我们提供任务分类。结果显示，不同的任务对不同的组有不同的亲和性，如图4.12（b）所示。例如，表面法线估计与重塑和点匹配密切相关。曲率提取与图像分割任务相关。这项研究为多任务视觉建模提供了深刻的见解。

The emergence of Transformers significantly facilitates the advancement of multi-task multi-modal learning. Among them, 12in1 (Lu et al., 2020) is one of the pioneering works that combine 12 vision-language tasks in a single BERT-based architecture. It uses task- specific heads for individual tasks and a commonly shared trunk ViLBERT (Lu et al., 2019). Results show that multi-task learning can achieve substantial improvements over single-task learning while reducing the model parameters significantly. Later on, UniT (Hu and Singh, 2021b) exploits an encoder-decoder architecture and expands to vision-only tasks like object detection. Additionally, it allows end-to-end training on the task pool without relying on pre-trained detectors. Similar to 12in1, it also uses a task-specific head for each task, motivated by the empirical result that sharing the same head usually hurts performance. Likewise, E2E-VLP (Xu et al., 2021) proposes an end-to- end pipeline for both localization tasks and text generation. Both UniT and E2E-VLP demonstrate the versatility of the encoder-decoder architecture of DETR (Carion et al., 2020). Following the same spirit, GPV (Gupta et al., 2022b) proposes an end-to-end task-agnostic architecture for differ- ent vision and vision-language tasks. It uses DETR to extract boxes and region features and then exploits a cross-attention module for fusion, followed by a vision decoder and a language decoder for decoding different outputs.

The above vision and multi-modal models unify different tasks by incorporating different modules or heads designed to cope with different tasks, and can hardly achieve synergy across tasks. In the following, we discuss recent model unification research that aims to make the best use of synergy among various vision and multi-modal tasks.

Transformer的出现出现极大地促进了多任务多模态学习的发展。其中，12in1（Lu等人，2020）是将12个视觉语言任务组合到单个基于BERT的体系结构中的开创性工作之一。它为单个任务使用特定于任务的头部和一个共同共享的主干ViLBERT（Lu等人，2019）。结果显示，多任务学习可以在显著减少模型参数的同时，取得实质性的性能提升，而不是单一任务学习。

后来，UniT（Hu和Singh，2021b）利用编码器-解码器架构进行了扩展，扩展到仅涉及视觉任务，如目标检测。此外，它允许在不依赖预训练检测器的情况下，对任务池进行端到端训练。与12in1类似，它也为每个任务使用任务特定的头部，其动机是经验结果，即共用同一个头部通常会损害性能。类似地，E2E-VLP（Xu等人，2021）提出了端到端管道，用于定位任务和文本生成。

UniT和E2E-VLP都展示了DETR（Carion等人，2020）的编码器-解码器架构的多才多艺。遵循同样的精神，GPV (Gupta等人，2022b)针对不同的视觉和视觉语言任务提出了一种端到端任务无关的架构。它使用DETR来提取边界框和区域特征，然后利用交叉注意模块进行融合，然后使用视觉解码器和语言解码器对不同的输出进行解码。

上述视觉和多模态模型通过整合不同的模块或头部来统一不同的任务，几乎无法实现任务之间的协同效应。接下来，我们将讨论最近的模型统一研究，旨在充分利用各种视觉和多模态任务之间的协同作用。

Unified learning统一学习：借助Transformer和开放集模型的发展，任务间障碍渐渐淡化，使得不同模态输入可以学习共享语义空间

The barrier across tasks is gradually blurred thanks to the use of Transformers (Vaswani et al., 2017) and the development of open-set models as we discussed earlier. It is now possible to bind inputs from different modalities to learn a shared semantic space. A number of works (Zhang et al., 2022b; Zou et al., 2023a; Li et al., 2023g) have recently been proposed to unify vision and vision- language tasks by using one model for all. After pre-training, the single model can be applied to tackle all tasks in a zero-shot manner and the performance can be further improved via task-specific finetuning. Note that unified learning in this context differs from previous works of large-scale pre- training. Like GPT which serves as a universal language interface after pre-training, a unified vision model is not only a representation learning engine but also an interface that supports as many tasks as possible in a zero-shot manner. Below, we review a few representative works.

由于前面讨论的Transformers（Vaswani等人，2017）和开放集模型的使用，任务之间的界限逐渐变得模糊。现在可以将来自不同模态的输入绑定到一个共享的语义空间中进行学习。

最近提出了一些工作（Zhang等人，2022b；邹等人，2023a；李等人，2023g），旨在通过一个模型来统一视觉和视觉语言任务。在预训练之后，可以以零样本方式应用单一模型来处理所有任务，并通过任务特定的微调进一步提高性能。需要注意的是，这种情况下的统一学习不同于以前大规模预训练的工作。就像GPT是预训练后的充当通用语言接口一样，统一视觉模型既是一个表征学习引擎，也是一个以零零样的方式支持尽可能多任务的接口。

GLIPv2—强调预训练策略：通过预训练任务融合定位与匹配模块训练

GLIPv2 (Zhang et al., 2022b) is proposed by extending GLIP (Li et al., 2022f) to support a wide range of vision and vision-language tasks, including grounded captioning, visual question asnwer- ing, etc. GLIPv2 seamlessly integrates localization pre-training and Vision-Language Pre-training (VLP) through three distinct pre-training tasks: (i) phrase grounding, which serves as a vision- language adaptation of detection tasks; (ii) region-word contrastive learning, introducing a novel contrastive task at the region-word level; and (iii) masked language modeling. In a zero-shot man- ner, this pre-trained model can be applied to different tasks and attain plausible performance across the board. Unlike previous works (e.g., GPV (Gupta et al., 2022b)), it merges the localization mod- ule and vision-language matching module in a coherent manner, which makes model training from fused data much more efficient and effective.

GLIPv2（Zhang等人，2022b）是通过扩展GLIP（Li等人，2022f）提出的，以支持广泛的视觉和视觉语言任务，包括基于场景的字幕生成、视觉问答等。GLIPv2通过三个不同的预训练任务

（i）短语定位，作为检测任务的视觉语言适应；

（ii）区域-词对比学习，在区域-词级别引入了一项新的对比任务；

（iii）掩码语言建模，将定位预训练和视觉-语言预训练（VLP）无缝整合在一起。在零样本方式下，这个预训练模型可以应用于不同的任务，并在各个方面取得合理的性能。与之前的工作不同（例如GPV（Gupta等人，2022b）），它以一种连贯的方式合并了定位模块和视觉-语言匹配模块，从而使模型从融合数据中进行训练变得更加高效和有效。

X-Decoder—注重设计支持不同任务学习：采用编码器-解码器架构+三个关键设计，分离图像与文本编码器支持不同粒度任务，通过不同类型查询和输出支撑不同粒度任务学习

X-Decoder (Zou et al., 2023a) follows the generic design of encoder-decoder architecture. Given an input image, it first uses an image encoder to extract features at multiple scales. Afterward, a text encoder is used to encode a textual query into a sequence of embeddings. The visual features, textual queries and the non-semantic or latent queries are fed to a decoder to predict the outputs. Three critical designs are proposed to empower the generalization ability of X-Decoder to a variety of vision and vision-language tasks: (i) It defines two types of queries and outputs. Specifically, the queries for the decoder are categorized into latent queries and text queries, which undertake generic vision and vision-language tasks, respectively. Likewise, the output is categorized into pixel-level masks and semantic embeddings; (ii) A single text encoder is exploited to encode the textual corpus from all tasks. The common text encoder is used to encode referring phrases, text descriptions, and image captions in the task of referring segmentation, image-text retrieval and image captioning, respectively; (iii) It fully decouples the image and text encoder, and use all the outputs as queries. As such, it can learn from both intra-image supervisions and inter-image ones, which is essential to learn stronger pixel-level representations and support different granularity of tasks. As shown in Figure 4.13, the pre-trained model can support different tasks by taking different routing while sharing the same suite of parameters.

X-Decoder（邹等人，2023a）遵循编码器-解码器架构的通用设计。给定输入图像，它首先使用图像编码器在多个尺度上提取特征。然后，使用文本编码器将文本查询编码为一系列嵌入。将视觉特征、文本查询和非语义或潜在查询被馈送到解码器以预测输出。为了增强X-Decoder在各种视觉和视觉语言任务中的泛化能力，提出了三个关键设计：

（i）它定义了两种类型的查询和输出。具体来说，解码器的查询被分类为潜在查询和文本查询，分别用于通用视觉和视觉语言任务。同样，输出被分类为像素级掩码和语义嵌入；

（ii）利用单一文本编码器对所有任务的文本语料库进行编码。使用通用文本编码器分别对引用分割、图像-文本检索和图像字幕任务中的引用短语、文本描述和图像标题进行编码；

（iii）完全解耦图像和文本编码器，并使用所有输出作为查询。因此，它既可以学习图像内监督，也可以学习图像间监督，这对于学习更强的像素级表示和支持不同粒度的任务至关重要。如图4.13所示，预训练模型可以在共享同一组参数的情况下，采用不同的路由方式来支持不同的任务。

Uni-Perceiver-v2—侧重统一不同数据级别任务学习：引入提议网络编码预测框来统一处理定位与非定位任务训练

Uni-Perceiver-v2 (Li et al., 2023g) is another generalist model that unifies vision and vision- language tasks. Similar to X-Decoder, the model exploits a vision encoder, a text encoder and a general decoder. Differently, it introduces a region proposal network on top of the vision backbone to explicitly predict the boxes and masks, which are then encoded as “queries” for the general de- coder. To jointly train on datasets at different levels, it introduces a unified max-likelihood estimation strategy for tasks with localization and without localization.

Uni-Perceiver-v2（李等人，2023g）是另一个统一视觉和视觉语言任务的通用模型。与X-Decoder类似，该模型利用了一个视觉编码器、一个文本编码器和一个通用解码器。不同之处在于，它在视觉主干的顶部引入了一个区域建议网络，以明确预测边界框和掩码，然后将其编码为通用解码器的“查询”。为了在不同层次的数据集上进行联合训练，它引入了一种统一的最大似然估计策略，用于有定位和没有定位的任务。

4.4、From Static to Promptable Models从静态到可提示模型

The success of Large Language Models (LLMs) such as ChatGPT (OpenAI, 2023b) have shown the importance of modern AI models in interacting with humans, and have provided a glimpse of AGI (Bubeck et al., 2023). The ability to interact with humans requires a user-friendly interface that can take as many types of human inputs as possible and generate responses that humans can easily understand. In NLP, such a universal interaction interface has emerged and evolved for a while from early models like GPT (Brown et al., 2020) and T5 (Raffel et al., 2020), to more advanced techniques like prompting (Shin et al., 2020; Zhao et al., 2021; Li and Liang, 2021) and chain-of- thought (Wei et al., 2022a; Kojima et al., 2022; Schick et al., 2023). However, most vision models are still static in that they are less flexible than LLMs to various prompts. Most recently, a number of works have proposed to enhance the static vision models with the capabilities to support: (i) multi-modal prompting; (ii) in-context prompting.

ChatGPT (OpenAI, 2023b)等大型语言模型(LLMs)的成功表明了现代人工智能模型在与人类交互中的重要性，并提供了AGI的一瞥(Bubeck等人，2023)。与人类互动的能力需要一个用户友好的接口，可以接受尽可能多类型的人类输入，并生成人类容易理解的响应。

在NLP中，这样一个通用的交互界面已经出现并发展了一段时间，从早期的模型，如GPT (Brown等人，2020)和T5 (rafael等人，2020)，到更先进的技术，如提示(Shin等人，2020;赵等，2021;Li and Liang, 2021)和思维链(Wei et al.， 2022a;小岛等人，2022;Schick et al.， 2023)。然而，大多数视觉模型仍然是静态的，因为它们对各种提示的灵活性不如LLMs 。最近，一些工作提出了增强静态视觉模型的能力，以支持:

(i)多模态提示;

(ii)上下文提示。

Vision is different from language by nature. To enable a smooth interaction between humans and AI, a model requires not only language prompts but also other types of prompts to complement the missing information or resolve the ambiguity in language. Recently, a number of works have explored how to combine or augment language prompts with other types of prompts, such as spatial prompts (Kirillov et al., 2023), visual prompts (Zou et al., 2023b) and other modalities (Girdhar et al., 2023; Liu et al., 2023f). In the following, we review some representative works.

视觉与语言有本质的区别。为了实现人与AI之间的顺畅交互，模型不仅需要语言提示，还需要其他类型的提示来补充缺失的信息或解决语言中的歧义。

最近，一些研究探索了如何将语言提示与其他类型的提示结合或增强，如空间提示(Kirillov等人，2023)、视觉提示(Zou等人，2023b)和其他方式(Girdhar等人，2023;Liu et al.， 2023f)。下面，我们回顾一些有代表性的作品。

Spatial prompting空间提示

Vision is rooted in the physical world, and as such it is not only semantic but also spatial by nature. Spatial prompting can be considered as a way to modulate the vision models through the inputs of location information, which could be a point, a box, or an arbitrary stroke, etc. Such clues have been heavily used in UI designs of computers (e.g., mouse) and mobile devices (e.g., touch screen). In computer vision, interactive segmentation (Mortensen and Barrett, 1998; McGuinness and O’connor, 2010; Chen et al., 2021c, 2022i) naturally requires such capability so that the model can take multiple clicks from users and gradually refine the segmentation mask. However, most of these works are still designed task-specifically and lack enough flexibility to support different types of spatial prompts.

SAM (Kirillov et al., 2023) is one of the pioneering works that propose a convenient spatial prompt- ing interface and learn a foundation model for image segmentation. As shown in Figure 4.14, the model can take points or boxes as the prompts, and segment images in arbitrary granularity. The ability to segment images following the user instructions from humans makes it readily a foundation to build many more models and applications (Zhang et al., 2023c). To name a few, a number of works (Ma and Wang, 2023; Roy et al., 2023) start with SAM and train a promptable segmentation model for the medical domain. Spatial prompting is particularly beneficial in that the textual anno- tations for medical images are usually limited and hard to interpret. Similar cases also happen in other industry domains (Tang et al., 2023a). To further improve point prompting, SAMAug (Dai et al., 2023a) proposes to refine the points using the max entropy criterion and saliency map, which can help to determine the most informative locations the model should look at.

视觉根植于物理世界，因此它不仅具有语义性，而且具有空间性。空间提示可以被看作是通过位置信息输入来调制视觉模型的一种方式，这些位置信息可以是点、框或任意笔划等。这些线索在电脑(如鼠标)和移动设备(如触摸屏)的UI设计中被大量使用。在计算机视觉中，交互式分割（Mortensen和Barrett，1998；McGuinness和O'Connor，2010；Chen等人，2021c，2022i）自然需要这种能力，以便模型可以接受用户的多次点击，并逐渐细化分割掩码。然而，大多数这些工作仍然是特定于任务的，缺乏足够的灵活性来支持不同类型的空间提示。

SAM (Kirillov et al.， 2023)是提出方便的空间提示界面并学习图像分割基础模型的开创性工作之一。如图4.14所示，该模型可以采用点或框作为提示，并以任意粒度分割图像。能够根据来自人类的用户指令分割图像的能力，使其成为构建许多更多模型和应用的基础（Zhang等人，2023c）。举个例子，一些工作（Ma和Wang，2023；Roy等人，2023）以SAM为基础，训练一个医疗领域的提示分割模型。空间提示尤其有益，因为医学图像的文本注释通常是有限的，难以解释。类似的案例也发生在其他行业领域(Tang et al.， 2023a)。

为了进一步改进点提示，SAMAug（Dai等人，2023a）提出使用最大熵准则和显著性图来细化点，这有助于确定模型应该关注的信息量最大的位置。

Visual prompting视觉提示

In many cases, textual descriptions of objects are not necessarily clear to con- vey the information. For example, given an unrecognizable or indescribable object, people may fail to express themselves clearly about the object. In this case, showing one or a few examples would be more informative and straightforward. With this idea, a lineup of works have studied exemplar- based visual modeling, such as image-to-image retrieval (Yoon et al., 2021; Datta et al., 2008; Zhang et al., 2018), image co-segmentation (Joulin et al., 2010; Jerripothula et al., 2016) and visual object tracking (Yilmaz et al., 2006; Luo et al., 2021; Wu et al., 2013). Most recently, this strategy has been formulated as visual prompting in that different types of visual inputs are usually encoded to some unified format and then fed into a Transformer architecture, as shown in LLMs.

在许多情况下，对对象的文本描述不一定能够清晰地传达信息。例如，对于一个不可识别或无法描述的对象，人们可能无法清晰地表达自己对该物体的看法。在这种情况下，展示一个或几个示例可能会更加信息丰富和直观。基于这个想法，已经研究了一系列基于示例的视觉建模方法，例如图像到图像检索（Yoon等人，2021；Datta等人，2008；Zhang等人，2018），图像协同分割（Joulin等人，2010；Jerripothula等人，2016）和视觉目标跟踪（Yilmaz等人，2006；Luo等人，2021；Wu等人，2013）。最近，这种策略被称为视觉提示，因为不同类型的视觉输入通常被编码为某种统一格式，然后馈送到Transformer架构中，正如LLMs中所示。

SEEM (Zou et al., 2023b) is one of the representative works that enable visual prompting to a vision model for image segmentation. As shown in Figure 4.15, SEEM differs from the aforementioned SAM and can take visual prompts by drawing points, boxes, and strokes on an image that can be the target image or another reference image. It develops a new module called a visual sampler that can extract visual features from an image according to the locations specified by users. Based on the visual sampler, the model can even take another reference image as input without any training like that. As a result, it shows impressive performance not only for various image segmentation tasks but also for video object segmentation in a zero-shot manner.

SEEM（Zou等人，2023b）是将视觉提示转化为视觉模型进行图像分割的代表性作品之一。如图4.15所示，SEEM与前述的SAM不同，它可以通过在一个图像上绘制点、框和笔画来获得视觉提示，这个图像可以是目标图像，也可以是另一个参考图像。它开发了一个称为“视觉采样器”的新模块，可以根据用户指定的位置提取图像的视觉特征。在视觉采样器的基础上，模型甚至可以在不进行任何训练的情况下，将另一幅参考图像作为输入。因此，它不仅在各种图像分割任务中表现出令人印象深刻的性能，而且在零样本方式下的视频对象分割中也表现出令人印象深刻的性能。

PerSAM (Zhang et al., 2023h) develops a personalized segmentation model on top of SAM and takes one shot as the input. It learns a specific model that takes a source image plus a mask as input and then predicts the mask for a target image. To extract the visual prompts, mask pooling is taken and used as the input tokens to the decoder of PerSAM. It also proposes a way to extract the positive and negative priors based on feature matching to facilitate pre-trained SAM models with comprehensive clues. Like most prompt learning methods in LLMs, a plausible feature for PerSAM is that it can be easily attained by some off-the-shelf models like SAM. SAM-PT (Rajicˇ et al., 2023) further applies this strategy to video object segmentation. Inspired by the spatial prompting in SAM, it exploits a point-tracking system (Harley et al., 2022) to track different points (both positive and negative ones) and then ask SAM to segment the image given the points. It exhibits strong point tracking performance as well as segmentation.

PerSAM（Zhang等人，2023h）在SAM之上开发了一个个性化分割模型，并以one shot为输入。它学习一个特定的模型，该模型将源图像加上掩码作为输入，然后预测目标图像的掩码。为了提取视觉提示，采用掩码池并将其用作PerSAM解码器的输入令牌。提出了一种基于特征匹配提取正先验和负先验的方法，以便为预训练的SAM模型提供全面的线索。与LLMs中的大多数提示学习方法一样，PerSAM的一个合理特点是，它可以很容易地通过像SAM这样的现成模型获得。

SAM-PT（Rajicˇ等人，2023）将此策略进一步应用于视频对象分割。受SAM中的空间提示启发，它利用了一个点跟踪系统（Harley等人，2022）来跟踪不同的点（正点和负点），然后要求SAM对给定点的图像进行分割。它表现出了强大的点跟踪性能以及分割性能。

Others其他

Some other works combine a wide range of visual prompting types. For example, Painter (Wang et al., 2023i) reformulates different vision tasks (e.g., depth estimation, image seg- mentation) all as prompting and learns a decoder in an in-context learning manner. The prompts are combinations of raw images and the corresponding dense annotations (e.g., depth or segmentation maps). In contrast, Prismer (Liu et al., 2023f) makes use of many off-the-shelf vision models to ex- tract different information from the raw images and then feed the information to a vision-language model. To facilitate the interplay across multiple modalities, ImageBind (Girdhar et al., 2023) learns a universal alignment among image/video, language, audio and depth. Once the embedding space is learned, it can be used to compose different types of prompts by simply doing the summations.

其他一些作品结合了广泛的视觉提示类型。例如，Painter (Wang et al.， 2023i)将不同的视觉任务(如深度估计、图像分割)都重新表述为提示，并以上下文学习的方式学习解码器。提示是原始图像和相应的密集注释（例如深度或分割地图）的组合。

相比之下，Prismer（Liu等人，2023f）利用了许多现成的视觉模型来从原始图像中提取不同的信息，然后将这些信息馈送到视觉语言模型中。

为了促进多模态之间的互动，ImageBind（Girdhar等人，2023）学习了图像/视频、语言、音频和深度之间的通用对齐。一旦学习了嵌入空间，就可以通过简单的求和来组合不同类型的提示。

4.4.2、In-context Prompting上下文提示

LLMs中的ICL能力使得模型可借助提示而无需更新模型参数，但视觉模型中的ICL能力的探究较少，两个尝试(如Flamingo和Kosmos-1)但只能生成文本输出

The in-context learning capability has been observed in many LLMs such as GPT-3 (Radford et al., 2019), which makes the model more configurable via prompting without any model parameter up- dates. In contrast, till now, the in-context learning capability for vision models is still less studied. Flamingo (Alayrac et al., 2022) is one of the pioneering works that demonstrate in-context language generation for multi-modal inputs, which is acquired by learning from interleaved image-text pair data. Likewise, Kosmos-1 (Huang et al., 2023b) is another work that takes visual inputs as a foreign language so that the in-context learning ability in LLMs can be naturally translated to multi-modal inputs. However, both methods take multi-modal data as inputs but merely generate texts as outputs. As we discussed earlier, vision tasks require outputs of different types beyond texts. How to endow the in-context learning ability for vision systems is still an open question. Below, we review some recent attempts toward the goal.

已经观察到许多LLMs（如GPT-3（Radford等人，2019））具有上下文学习能力，这使得模型可以通过提示进行配置，而无需进行任何模型参数更新。相比之下，迄今为止，对于视觉模型的上下文学习能力研究较少。

Flamingo（Alayrac等人，2022）是演示多模态输入的上下文语言生成的开创性工作之一，它是通过学习交错的图像-文本对数据而获得。同样，Kosmos-1（Huang等人，2023b）是另一项工作，将视觉输入视为一种外语，以便LLMs中的上下文学习能力可以自然地转化为多模态输入。然而，这两种方法都以多模态数据作为输入，但仅生成文本作为输出。正如我们之前讨论的，视觉任务需要不仅是文本之外的不同类型的输出。如何赋予视觉系统上下文学习能力仍然是一个待解决的问题。下面，我们将回顾一些最近的尝试。

修复进行视觉提示的方法：通过修复输入图像的方式来教导模型预测稠密输出的方法

研究提出了一种通过修复输入图像的方式来教导模型预测稠密输出的方法，使用预训练的模型来编码原始图像并预测掩码区域，以确保模型理解图像上下文。这个方法的目的是让模型学会预测图像中的缺失区域。

Visual prompting via inpainting is proposed in Bar et al. (2022) to teach the model to predict dense outputs, such as edges, masks, depths, etc. as shown in Figure 4.16. Given an input image x ∈ RH×W ×3 and a binary mask m ∈ {0, 1}H×W , an inpainting model is to predict the missing region y = f (x, m). The authors exploit a pre-trained VQ-GAN to encode the original image into discrete tokens and ask another ViT encoder to predict the masked regions. To make sure the model understands the visual “context” in the images, the authors collected a new dataset called Computer Vision Figures dataset which consists of 88k images from Arxiv papers. After pre-training, the model is used to predict the content at the bottom-right corner.

Bar等人（2022）提出了通过修复进行视觉提示的方法，以教模型预测密集输出，例如边缘、掩码、深度等，如图4.16所示。给定一个输入图像x∈RH×W×3和一个二进制掩码m∈{0,1}H×W，修复模型是为了预测缺失区域y = f (x, m)。作者利用一个预训练的VQ-GAN将原始图像编码成离散的标记，并要求另一个ViT编码器预测掩码区域。为了确保模型理解图像中的视觉“上下文”，作者收集了一个名为“计算机视觉图表数据集”的新数据集，其中包含来自Arxiv论文的88,000张图像。在预训练之后，该模型用于预测底部右下角的内容。

Painter→SegGPT ：Painter通过预测连续像素输出实现不同任务统一，比如为分割任务用颜色表示不同个体、SegGPT基于Painter去专注图像分割应用

Concurrently, Painter (Wang et al., 2023i) extends a similar idea of visual in-context learning to more diverse datasets and benchmarks. Unlike Bar et al. (2022), it predicts the output in the contin- uous pixel space instead of discrete tokens. For different tasks, the authors define rules to convert the output spaces into image spaces. For example, it uses different colors to represent different individ- ual instances in the image for the segmentation task. After unifying the input and output format, the authors use vanilla ViT as the encoder and masked image modeling (He et al., 2022a). A follow-up work called SegGPT (Wang et al., 2023j) is built on top of Painter and designed specifically for image segmentation tasks. The pre-trained model can be easily extended for exemplar-based image segmentation tasks.

与此同时，Painter（Wang等人，2023i）将视觉上下文学习的类似思想扩展到更多不同的数据集和基准。与Bar等人（2022）不同，它预测连续像素空间的输出，而不是离散符号。对于不同的任务，作者定义了将输出空间转换为图像空间的规则。例如，它使用不同的颜色来表示图像中不同的个体实例，以用于分割任务。在统一输入和输出格式之后，作者使用普通的ViT作为编码器和掩码图像建模（He等人，2022a）。后续工作SegGPT (Wang et al.， 2023j)建立在Painter之上，专门为图像分割任务设计。预训练模型可以很容易地扩展到基于样本的图像分割任务。

Hummingbird：利用目标与示例图像间的关注机制聚合信息,为密集预测任务匹配实例图像Label来进行预测。探索了一种新的视觉项目学习实现机制，即利用示例与目标间关联关系来传播语义信息

Hummingbird (Balazˇevic´ et al., 2023) resorts to a different method for in-context visual learning. Instead of using masked modeling, the authors propose to leverage attention across target and source images to aggregate the information. As shown in Figure 4.18, the models take multiple input images (first row) and corresponding semantic label maps (second row). Given a query image, it first finds the nearest neighbor feature locations in the prompt images for the query points and then projects the same matches to the semantic label maps so as to aggregate the label for the target query. This strategy is akin to earlier works that build classification models based on K-nearest-neighbor but differently applied to dense prediction tasks.

Hummingbird（Balazˇevic´等人，2023）采用了一种不同的上下文视觉学习方法。作者建议利用目标图像和源图像之间的注意力来聚合信息，而不是使用掩模建模。如图4.18所示，模型接受多个输入图像（第一行）和相应的语义标签图（第二行）。给定一个查询图像，它首先找到提示图像中查询点的最近邻特征位置，然后将相同的匹配投影到语义标签映射上，从而聚合目标查询的标签。这种策略类似于早期的工作，它们基于K最近邻构建分类模型，但是不同之处在于它们应用于密集预测任务。

Discussion讨论：未来研究方向=单一模型能够在多模态输入下以上下文学习方式预测不同类型的输出

In-context learning is arguably an appealing feature. On one hand, there are a number of works that attempt to bridge vision with LLM so as to inherit the in-context learning capability such as Flamingo (Alayrac et al., 2022) and Kosmos-1 (Huang et al., 2023b). On the other hand, researchers resort to pure vision-based in-context learning to address vision-specific tasks such as image segmentation, depth estimation, etc. Thus far, there is no single model that can take multi- modal inputs and predict different types of outputs as well in an in-context learning manner, which may render a promising future direction along this line.

上下文学习无疑是一个吸引人的特性。一方面，有许多工作试图将视觉与LLMs连接起来，以继承上下文学习的能力，例如Flamingo（Alayrac等人，2022）和Kosmos-1（Huang等人，2023b）。另一方面，研究人员则借助纯粹的基于视觉的上下文学习来解决视觉特定任务，例如图像分割、深度估计等。到目前为止，还没有单一模型可以以上下文学习的方式接受多模态输入并预测不同类型的输出，这可能是未来的一个有前途的方向。

4.5、Summary and Discussion总结与讨论

视觉和语言间存在4大固有差异：数据标记处理、数据标签、数据多样性和存储成本

To the end, an illustrative summary of the works that have been covered in this chapter is shown in Figure 4.19. There is a clear trend in the vision community to build open-world, unified and interactive vision models. Nevertheless, there are still some intrinsic differences between vision and language. First, vision differs from language in that it captures the physical world with raw signals. We need to develop some sophisticated tokenization methods to compress the raw data into compact “tokens”. In the language domain, this can be easily done by using some well-established heuristic tokenizers (Sennrich et al., 2016). Second, unlike language, vision data itself is not labeled and thus difficult to convey information or knowledge. It always requires human labors to annotate the visual contents in either a semantic or spatial manner. Third, language data is homogeneous while vision data and tasks are heterogeneous. Last but not least, storing vision data is much more costly than language data. For example, GPT-3 consumes 45 TB of training data, while the ImageNet dataset which contains 1.3M images costs more than hundreds of gigabytes. When it comes to video data like Howto100M (Miech et al., 2019), the storage cost already exceeds that of training corpus for GPT-3. All these differences cast some open questions that need to be addressed in the vision community, detailed below.

最后，本章所涵盖的作品的说明性总结如图4.19所示。构建开放世界、统一、互动的视觉模型在视觉界有明显的趋势。然而，视觉和语言之间仍然存在一些固有的差异。

首先，视觉不同于语言，因为它通过原始信号捕捉物理世界。我们需要开发一些复杂的标记方法，将原始数据压缩成紧凑的“标记”。在语言领域，这可以通过使用一些完善的启发式标记器轻松完成(Sennrich et al.， 2016)。

其次，与语言不同，视觉数据本身没有标签，因此很难传达信息或知识。通常需要人工劳动以以语义或空间方式注释视觉内容。

第三，语言数据是同质的，而视觉数据和任务是异质的。

最后，但并非最不重要的是，存储视觉数据比语言数据要昂贵得多。例如，GPT-3使用45TB的训练数据，而包含1.3M图像的ImageNet数据集的成本已经超过了数百GB。当涉及到像Howto100M（Miech等人，2019）这样的视频数据时，存储成本已经超过了GPT-3的训练语料库。

所有这些差异都引发了一些需要在视觉社区中解决的开放性问题，如下所述。

视觉和语言间的固有差异带来的三大探究：域外计算机视觉(无法涵盖世界全貌)、视觉中的规模定律(是否也会有类似LLMs的涌现能力)、以视觉为中心的模型(继续扩大模型规模还是采用适度规模去组合LLMs)

>> Computer vision in the wild. Due to the heterogeneous nature, the current vision data we use for training models can hardly cover the full picture of the physical world. Despite the effort in building open-set vision models, we are still facing significant challenges in coping with novel or long-tail scenarios.

>> Scaling law in vision. As discussed in Kaplan et al. (2020); Hoffmann et al. (2022), the perfor- mance of large language models improves smoothly with the increase of model size, data scale, and amount of computes. As the scale increases, some intriguing emerging properties are further observed in LLMs. In contrast, it is still not clear what is the right path to scale vision models, not to mention the emerging properties in such models.

>> Vision-centric or language-centric models. Currently, the boundary between vision and lan- guage is gradually dismissed. However, due to intrinsic differences between vision and language, it is still not clear whether we should further scale up the vision models and integrate language models or the combination of moderate vision models and LLMs is sufficient to address most (if not all) of the problems.

>> 域外计算机视觉。由于视觉数据的异质性质，我们用于训练模型的当前视觉数据几乎无法涵盖物理世界的全貌。尽管我们在构建开放世界的视觉模型方面付出了努力，但在应对新颖或长尾场景方面仍然面临重大挑战。

>> 视觉中的规模定律。如Kaplan等人（2020）和Hoffmann等人（2022）所讨论的，大型语言模型的性能随着模型规模、数据规模和计算量的增加而平稳提高。随着规模的扩大，LLMs中还观察到一些有趣的新特性。相比之下，仍然不清楚扩大视觉模型的正确路径是什么，更不用说这些模型中的新特性了。

>> 以视觉为中心或以语言为中心的模型。目前，视觉和语言之间的边界逐渐消失。然而，由于视觉和语言之间的固有差异，仍然不清楚我们是否应该进一步扩大视觉模型并集成语言模型，或者将适度的视觉模型与LLMs的组合已足以解决大多数（如果不是所有）问题。

With that being said, we are close yet still far away from an intelligent vision system that can perceive the world like humans. We hope the literature review in this chapter could provide an overall picture of the existing efforts, and inspire the pursuit of next-generation vision models.

总之，我们离能像人类一样感知世界的智能视觉系统已经很近，但仍然有很长的路要走。我们希望本章的文献综述能够提供现有工作的整体图景，并激发对下一代视觉模型的追求。

5、Large Multimodal Models: Training with LLM大型多模型：与LLM一起训练

In this chapter, we comprehensively explore large multimodal models (Alayrac et al., 2022; Ope- nAI, 2023a). We begin with Section 5.1 to delve into the background of such models, with the focus on the basics of image-to-text generative models and their representative model instances in vari- ous case studies. We also discuss the state-of-the-art OpenAI Multimodal GPT-4 (OpenAI, 2023a) and identify the existing research gaps in the field. To better understand the process of instruction tuning in large language models, Section 5.2 examines its importance and its role in self-instruct and open-source LLMs. Moving forward, we explore instruction-tuned large multimodal models in Section 5.3, shedding light on their basics, significance and applications. Additionally, Section 5.4 touches upon advanced topics in the realm of multimodal models to provide a deeper understanding of the subject. Finally, we assess the current progress in the field by evaluating how close we are to achieving the OpenAI Multimodal GPT-4 in Section 5.5, a major milestone in AI research.

在本章中，我们全面探讨了大型多模型（Alayrac等人，2022；OpenAI，2023a）。我们从

第5.1节开始深入探讨这些模型的背景，重点关注图像到文本生成模型的基础知识以及各种案例研究中的代表性模型实例。我们还讨论了最新的OpenAI Multimodal GPT-4（OpenAI，2023a）并确定了该领域的现有研究差距。

为了更好地理解大型语言模型中的指令调优过程，第5.2节探讨了其重要性以及在自我指导和开源LLMs中的作用。

接下来，我们在第5.3节中探讨了经过指令调优的大型多模型，阐明了它们的基础知识、重要性和应用。

此外，第5.4节涉及多模型模型领域的高级主题，以更深入地了解这一主题。最后，我们通过评估我们在第5.5节中是否接近实现OpenAI Multimodal GPT-4来评估该领域的当前进展，这是AI研究的一个重要里程碑。

5.1、Background 背景

5.1.1、Image-to-Text Generative Models图像到文本生成模型

目前LLM主要是一种图像到文本的生成模型：将图像作为输入/输出文本序列+采用编码器-解码器架构(图像编码器提取视觉特征/语言模型解码文本序列)+训练目标(通过文本自回归损失进行训练)

LMMs in their current form is primarily an image-to-text generative model, which takes images as input, and outputs a text sequence. One example is illustrated in Figure 5.1 (a) Left. All of the model variants share a very similar model architecture and training objective.

>> Model Architecture. As illustrated in Figure 5.1 (a) Right, the model typically consists of an image encoder to extract visual features, and a language model to decode the text sequence. The vision and language modalities can be optionally connected by trainable connection mod- ule. The image encoder and language model can be either trained from scratch or initialized from pre-trained models.

>> Training Objective. As illustrated in Figure 5.1 (b), it typically employs an auto-regressive loss on the output text tokens. For the attention map in the Transformers (Vaswani et al., 2017), image tokens can attend to each other, and the current text token attends to all image tokens and the previous text tokens.

当前形式下的LMM主要是一种图像到文本生成模型，它将图像作为输入，并输出文本序列。一个示例如图5.1（a）左所示。所有模型变体都共享非常相似的模型架构和训练目标。

>> 模型架构。如图5.1（a）右所示，该模型通常包括一个用于提取视觉特征的图像编码器，以及一个用于解码文本序列的语言模型。视觉和语言模态可以选择性地通过可训练的连接模块连接。图像编码器和语言模型可以从头开始训练，也可以从预训练模型初始化。

>> 训练目标。如图5.1（b）所示，通常会在输出文本标记上使用自回归损失。在Transformers的注意力映射中，图像标记可以相互关注，当前文本标记会关注所有图像标记和前面的文本标记。

5.1.2、Case Studies案例研究

LLM的网络结构的不同实践(在架构和预训练策略上有差异+但共同遵循自回归训练目标)：基于图像-文本对训练的GIT、BLIP2、基于交错的图像-文本序训练的Flamingo

GIT从零开始训练语言模型，而BLIP2冻结了预训练的图像编码器和语言模型权重，并训练轻量级的Q-former模块来连接它们。

Flamingo，它通过添加新的架构组件来连接冻结的预训练图像编码器和语言模型，并在训练过程中使用Perceiver Sampler模块降低计算复杂性，并使用Gated Transformer模块稳定训练。 Flamingo通过简单的少量样本学习可以直接适应视觉任务，无需额外的任务特定调整。

We use some prominent LMMs as examples to illustrate how the aforementioned network archi- tecture can be instantiated in different models, while maintaining the same auto-regressive training objective.

我们使用一些杰出的LMMs作为示例，以说明如何在不同模型中实例化上述网络架构，同时保持相同的自回归训练目标。

Case study I: LMM trained with image-text pairwise instances. Most LMMs are trained on a large number of image-text pairs, where each training sample is a pair. GIT (Wang et al., 2022a) and BLIP2 (Li et al., 2023h) are two large models that achieve state-of-the-art (SoTA) performance on many datasets. The comparisons are shown in Figure 5.2(a). GIT initializes the image encoder with contrastively pre-trained Florence model (Yuan et al., 2021), and trains the language model from scratch. On the other hand, BLIP2 freezes the weights of a pre-trained image encoder and a pre- trained language model, while trains a lightweight Q-former module to connect the image encoder and the language model.

案例研究I：使用图像-文本成对实例训练的LMM。大多数LMMs是在大量图像-文本对上训练的，其中每个训练样本都是一对。GIT（Wang等人，2022a）和BLIP2（Li等人，2023h）是两个在许多数据集上都取得了最先进（SoTA）性能的大型模型。比较如图5.2(a)所示。GIT使用对比性预训练的Florence模型（Yuan等人，2021）初始化图像编码器，并从头开始训练语言模型。另一方面，BLIP2冻结了预训练的图像编码器和预训练的语言模型的权重，同时训练了一个轻量级的Q-former模块，以连接图像编码器和语言模型。

Case study II:

LMM trained with interleaved image-text sequence instances. We use Flamingo (Alayrac et al., 2022) as an example, shown in Figure 5.2(b). It connects the frozen pre-trained image encoder and language model – by adding novel architectural components in be- tween. Specifically, Perceiver Sampler module helps reduce computational complexity, and Gated Transformer module helps to stabilize training during the initial stage. Flamingo is trained on a mix- ture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. After this training is done, Flamingo can be directly adapted to vision tasks via simple few-shot learning without any additional task-specific tuning.

案例研究II：使用交错的图像-文本序列实例进行训练的LMM。我们以Flamingo（Alayrac等人，2022）为例，如图5.2(b)所示。它通过在冻结的预训练图像编码器和语言模型之间添加新的架构组件来连接它们。具体来说，Perceiver Sampler模块有助于降低计算复杂性，而Gated Transformer模块有助于在初始阶段稳定训练。Flamingo是在只来自网络的互补大规模多模态数据的混合数据上进行训练的，而不使用任何为机器学习目的注释的数据。完成这一训练后，Flamingo可以通过简单的少样本学习直接适应视觉任务，而无需进行任何额外的任务特定调整。

Multimodal in-context-learning多模态内上下文学习：Flamingo通过提供少量示例实现跨任务转移学习，这种引人注目的上下文学习能力使Flamingo成为多模态领域中的GPT-3时刻

Beside the SoTA performance on dozens of academic bench- marks, probably the most appealing aspect of Flamingo is the emerging property: Multimodal In- Context-Learning. Specifically, given a couple of image-text pairs as examples, Flamingo can zero- shot task transfer to unseen problems, such as solving visual math problems. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples, without any additional training required. For example in Figure 5.3, two new tasks are presented to Flamingo. The top row provides two image-text pairs as the context in the prompt, where the text describes the name of the animal in the image, followed by the geographical information of the animal. Flamingo is able to understand the patterns presented in the examples, and output the corresponding informa- tion for a new image. In the bottom row, the text first shows the optical character recognition (OCR) result of the image, followed by the answer to the math problem. Flamingo follows the task instruc- tion illustrated in the multimodal context, and outputs the correct answer for a new math problem in the third image. This intriguing in-context learning capability makes Flamingo the GPT-3 mo- ment (Brown et al., 2020) in the multimodal domain.

除了在数十个学术基准上表现出的SoTA性能之外，Flamingo最吸引人的特性可能是新兴属性：多模态内上下文学习。具体而言，鉴于一对图像-文本对作为示例，Flamingo可以零样本任务转移到未见问题，例如解决视觉数学问题。这意味着Flamingo可以仅仅通过少量任务特定示例来解决许多困难的问题，无需任何额外的训练。例如，在图5.3中，向Flamingo呈现了两个新任务。顶部一行提供了两个图像-文本对作为提示的上下文，文本描述了图像中动物的名称，然后是动物的地理信息。Flamingo能够理解示例中呈现的模式，并为新图像输出相应的信息。在底部一行，文本首先显示了图像的光学字符识别（OCR）结果，然后是数学问题的答案。Flamingo遵循多模态上下文中所示的任务指令，并为第三张图像中的新数学问题输出正确答案。这种有趣的上下文学习能力使Flamingo成为多模态领域的GPT-3时刻（Brown等人，2020）。

5.1.3、OpenAI Multimodal GPT-4 and Research Gaps—OpenAI Multimodal GPT-4和研究差距：GPT-4引入了多模态输入的能力，这引发了如何在多模态空间进行指导和对齐研究的问题

揭示了当前LLM与GPT-4在多模态项目学习和对话学习上的差距：GPT-3(体现上下文学习和链式推理)→ChatGPT和InstructGPT(凸显指令跟随与人类调整的重要性)→GPT-4(突破语言范畴支持视觉输入)

In March 2023, OpenAI released GPT-4 (OpenAI, 2023a), with impressive capability in visual un- derstanding and reasoning. Though the model details are not revealed, there is no doubt that GPT-4 enables many new scenarios, based on the examples highlighted in the technique report. For in- stance, two popular visual examples are illustrated in Figure 5.4. The first one identifies the uncom- mon visual region and exhibits strong complex reasoning performance. The second one recognizes text in the image and captures the mere across image-text. For a while, the research community had no clue how this new ability is achieved (probably because they are not tightened to any established academic tasks/datasets), but all are determined that these are exciting results. It naturally raises a question: how can we build Multimodal GPT-4 like models?

2023年3月，OpenAI发布了GPT-4（OpenAI，2023a），具有印象深刻的视觉理解和推理能力。尽管没有透露模型的详细信息，但毫无疑问，GPT-4可以基于技术报告中突出显示的示例实现许多新场景。例如，图5.4中示例了两个流行的视觉示例。第一个识别了不寻常的视觉区域，并展示了强大的复杂推理性能。第二个识别了图像中的文本并捕捉了图像-文本之间的关系。有一段时间，研究界不知道这种新能力是如何实现的（可能是因为它们没有与任何已建立的学术任务/数据集相关联），但所有人都确定这些是令人兴奋的结果。这自然引发了一个问题：我们如何构建类似于Multimodal GPT-4的模型？

To answer it, we start to review the big models from OpenAI, by highlighting the most appealing properties for each model in Figure 5.5. There are several key observations: (i) GPT-2 (Radford et al., 2019) is the auto-regressive counterpart in the BERT era (Devlin et al., 2019) for the pre- train-then-finetune paradigm. Compared with GPT-2, GPT-3 (Brown et al., 2020) is a 175B model trained on web-scale text corpus, which exhibits two emerging properties with a frozen model: in-context-learning (Brown et al., 2020) and chain-of-thoughts (CoT) reasoning (Wei et al., 2022a). This means, without any additional training, the model can tackle a wide range of new problems with just a few task-specific examples and by properly prompting it step-by-step, respectively. It further leads to the modeling paradigm from task-specific finetuning to prompting frozen models, where the latter shows higher generalizability and lower adaptation cost in task transfer. (ii) ChatGPT and InstructGPT (Ouyang et al., 2022) show the importance of instruction-following and alignment with human intents for LLMs, by finetuning the base language model GPT-3/GPT-3.5 on high quality instruction-following data, and improving them with a reward model via reinforcement learning with human feedback. (iii) GPT-4 not only improves the language ability of previous models, but also allows visual signals as additional input for visual understanding and reasoning. We see that the newer generation model maintains/improves the existing properties of the previous ones, and enable new properties.

In other words, from GPT-3 to GPT-4, we see two new properties: instruction-following and multi- modal input. This reveals the gap between existing LMMs (e.g., Flamingo) and multimodal GPT-4: how to perform instruction-following and alignment research in the multimodal space, which is the focus of this chapter.

为了回答这个问题，我们开始回顾OpenAI的大型模型，突出每个模型在图5.5中最吸引人的特性。有几个关键观察：

GPT-2（Radford等人，2019）是BERT时代（Devlin等人，2019）的自回归对应物，用于预训练然后微调的范式。与GPT-2相比，GPT-3（Brown等人，2020）是一个训练在Web规模文本语料库上的175B模型，展示出冻结模型的两个新兴属性：多模态内上下文学习（Brown等人，2020）和思维链推理（CoT）（Wei等人，2022a）。这意味着，无需进行任何额外的训练，模型可以通过仅仅使用一些任务特定示例并逐步适应良好的提示来解决各种新问题。这进一步导致了从任务特定微调的建模范式转向提示冻结模型，后者在任务转移中具有更高的通用性和更低的适应成本。
（ii）ChatGPT和InstructGPT（Ouyang等人，2022）通过在高质量的指令遵循数据上微调基础语言模型GPT-3/GPT-3.5，并通过与人类反馈的强化学习改进它们，展示了LLMs的指令遵循和与人类意图一致性的重要性。

（iii）GPT-4不仅改进了先前模型的语言能力，还允许将视觉信号作为额外输入进行视觉理解和推理。我们看到新一代模型维护/改进了以前模型的现有属性，并启用了新属性。

换句话说，从GPT-3到GPT-4，我们看到了两个新属性：指令遵循和多模态输入。这揭示了现有LMMs（例如Flamingo）与多模态GPT-4之间的差距：如何在多模态空间中执行指令遵循和一致性研究，这是本章的重点。

5.2、Pre-requisite: Instruction Tuning in Large Language Models先决条件：大型语言模型中的指令调优

需要在训练中引入任务指令以提升模型通用性：传统的NLP数据表示使用seq2seq格式，但没有明确的任务指令，这导致模型难以在新任务上进行零迁移学习

Note that instruction-following is a notion originated in NLP. To study the intuition behind it and have a full picture of its history, we first revisit instruction tuning with LLMs.

Traditional language data. As a typical data instance in NLP, sequence-to-sequence (seq2seq) representation is widely adopted for many language tasks: each data instance consists of two parts: one sequence as the input and another sequence as the output. We provide two examples in Fig-ure 5.6 (a). Without any task instruction specified, we know they are translation and summarization tasks, respectively.

This seq2seq representation is also the conventional data format in NLP research, where task in- structions are implicit. Based on each data domain, individual models are trained. Or sometimes one model is trained with multi-task objectives over multiple data domain without specifying the task instructions. For both cases, the models are hard to generalize to new tasks in a zero-shot fash- ion, as they are not trained to understand task instructions, thus cannot distinguish and generalize what task to perform during testing time.

请注意，遵循指令是一个源自自然语言处理（NLP）的概念。为了研究其背后的直觉并全面了解其历史，我们首先重新审视了大型语言模型中的指令调优。

传统语言数据。作为NLP中的典型数据实例，序列到序列（seq2seq）表示广泛用于许多语言任务：每个数据实例包含两个部分：一个序列作为输入，另一个序列作为输出。我们在图5.6（a）中提供了两个示例。在没有指定任何任务指令的情况下，我们知道它们分别是翻译和摘要任务。

这种seq2seq表示也是NLP研究中的传统数据格式，其中任务指令是隐式的。根据每个数据域，会训练个体模型。或者有时会使用多任务目标在多个数据域上训练一个模型，而不指定任务指令。对于这两种情况，模型很难以零样本的方式推广到新任务，因为它们没有经过训练来理解任务指令，因此在测试时无法区分和推广要执行的任务。

指令语言数据：将任务指令明确加入模型训练，使模型能够通过任务组合在推理阶段执行多个任务，从而实现在未经训练的情况下解决新任务

Instructional language data.

Recently, researchers have started to explicitly add task instructions into the model training, as shown in Figure 5.6 (b). Interestingly, the task instruction of most NLP tasks can be expressed in natural language as well. It leads a new data format: instruction-input- output triplets. Based on the new format, one single model can be trained to perform multiple tasks, each with its specific instructions. Since models have observed many task instructions and many instances for each task during training, it is more natural and easier for them to generalize to new tasks by task composition in the inference stage.

For example, in the evaluation stage, a new task that requires both summarization and translation is provided in Figure 5.6 (c). Though the model has never seen this new task during training, it observes individual task basis, and learns to perform on new tasks. Note that we humans are always creating new tasks in our daily life, and presumably these new tasks would never been observed by models. It is thus appealing if a model is able to solve thousands of new tasks in the wild without training. This is partially why ChatGPT is becoming popular and prevalent so quickly.

指令语言数据。

最近，研究人员开始明确将任务指令添加到模型训练中，如图5.6（b）所示。有趣的是，大多数NLP任务的任务指令也可以用自然语言表达。这导致了一种新的数据格式：指令-输入-输出三元组。基于新格式，可以训练一个单一模型来执行多个任务，每个任务都有其具体的指令。由于模型在训练过程中观察到了许多任务指令和每个任务的许多实例，因此它们更容易在推理阶段通过任务组合来推广到新任务。

例如，在评估阶段，图5.6（c）提供了一个需要摘要和翻译的新任务。尽管模型在训练期间从未见过这个新任务，但它观察到了单个任务基础，并学会了在新任务上执行。请注意，我们人类在日常生活中总是在创建新任务，而且可以假设这些新任务从未被模型观察过。因此，如果一个模型能够在没有训练的情况下解决数千个新任务，那将是非常吸引人的。这部分是为什么ChatGPT迅速变得如此受欢迎和普及的原因。

5.2.1、Instruction Tuning指令调优：探索如何通过指令调优来使多模态语言模型（LLMs）能够遵循自然语言指令并完成现实世界的任务

两大方法：使用人类提供的任务指令和反馈进行微调、使用公共基准和数据集进行有监督微调

How can we collect a diverse set of high-quality instruction-following data? There are two gen- eral schemes. One is through human-human interaction, where humans (task providers) provide the annotation statement and requirements, based on which another group of humans complete the annotation tasks. Such a scheme is typically costly and time consuming. The other scheme is via human-machine interaction, where similarly humans provide the annotation statement and require- ments, but it is now the machines/models that complete the annotation tasks.

To enable LLMs to follow natural language instructions and complete real-world tasks, researchers have been exploring methods to instruction-tune LLMs. This is implemented by either finetuning the model on a wide range of tasks using human-annotated prompts and feedback (Ouyang et al., 2022), or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions (Wang et al., 2022f). Among these methods, Self-instruct tuning (Wang et al., 2022e) is a simple and effective method of aligning LLMs to human intent, by learning from instruction-following data generated by SoTA LLMs. It turns out that the line of instruction-tuning research has produced effective means to improve zero-shot and few-shot gen- eralization abilities of LLMs. Self-instruct leverages the in-context-learning ability of LLM. The pipeline is illustrated in Figure 5.7. Humans create a few examples (i.e. seed examples) as the con-text, and ask LLM such as GPT-3 or GPT-4 to create more instructions and responses that follow the requirements stated in the prompt. The machine-generated instruction-following data can be further selected to construct with the prompt for in-context-learning in the next data generation iteration. The procedure iterates until a given number of samples are collected. Due to the relatively lower cost and higher response speed of API calls (compared with human annotations), self-instruct is becoming more favorable in the research community.

我们如何收集多样化的高质量指令遵循数据？有两种一般方案。一种是通过人与人之间的互动，其中人类（任务提供者）提供注释语句和要求，然后另一组人类完成注释任务。这种方案通常成本高且耗时。另一种方案是通过人与机器之间的互动，类似地，人类提供注释语句和要求，但现在是机器/模型完成注释任务。

为了使LLMs能够遵循自然语言指令并完成现实世界的任务，研究人员一直在探索指导调整LLMs的方法。这可以通过使用人工注释的提示和反馈（Ouyang等人，2022）在广泛的任务上微调模型来实现，也可以通过使用手动或自动生成的指令扩充的公共基准和数据集进行监督微调（Wang等人，2022f）来实现。在这些方法中，自我指导调整（Self-instruct tuning）（Wang等人，2022e）是一种简单有效的方法，通过学习由SoTA LLMs生成的指令遵循数据来使LLMs与人的意图一致。流程如图5.7所示。人们创建一些示例（即种子示例）作为上下文，并要求像GPT-3或GPT-4这样的LLMs创建更多的指令和响应，以遵循提示中的要求。机器生成的指令遵循数据可以进一步选择以构建在下一个数据生成迭代中的上下文学习。该过程迭代直至收集到一定数量的样本。由于与人类注释相比，API调用的成本相对较低，响应速度更快，因此自我指导调整在研究界越来越受欢迎。

5.2.2、Self-Instruct and Open-Source LLMs自我指导和开源LLMs

开源社区涌现出众多开放式多模态语言模型，ChatGPT和GPT-4的成功为通过指令调优来改进开源LLMs提供了巨大机会

开源LLMs的崛起：开源社区出现了许多开放式多模态语言模型（LLMs）。

ChatGPT和GPT-4的成功：ChatGPT和GPT-4的成功为改进开源LLMs提供了重要机会。

LLaMA项目：LLaMA是一系列开源LLMs，性能匹配专有LLMs如GPT-3，并采用Self-instruct调整方法来教导LLaMA遵循指令。

指令调优研究：研究人员进行了多次尝试，使用不同数量和质量的指令遵循样本来进行指令调优研究。

前沿研究：一些前沿研究使用GPT-4作为教师模型来生成指令遵循样本的回应，并不断改进指令遵循数据以提高开源LLMs在对话中的对齐质量。

综合比较：进行了多项综合比较，以评估LLMs在多个基准测试上的性能。

The open-source community has witnessed a surge of open LLMs. The success of ChatGPT (Ope- nAI, 2022) and GPT-4 (OpenAI, 2023a) offers tremendous opportunities to improve open-source LLMs using instruction-tuning. Figure 5.8 compares several open-source instruction-tuned LLMs. LLaMA (Touvron et al., 2023) is a series of open-sourced LLMs, which match the performance of proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-instruct tuning has bHeoewnFqaruCicanklCyamaedlos pGtoe?dExgpilvoerinngitthsesSutapteeroiof IrnsptreurcftoiornmTaunncineg aonndOpleonwRecsoosutr.ce For example, to name a few early attempts in this line of research, Stanford Alpaca (Taori et al., 2023) uses 52K instruction- following samples generated by GPT-3.5, while Vicuna (Vicuna, 2023) uses around 500K high- quality instruction-following samples (150K conversions) between user and GPT (ShareGPT, 2023). To advance the SoTA of instruction-tuning for LLMs, Peng et al. (2023a) uses GPT-4 as the teacher to generate the responses to the Alpaca instructions. Many follow-up works (Zhang et al., 2023i) improve the instruction-following data to enable the open LLMs with better alignment quality in chat. For a comprehensive review, we refer the readers to a recent paper (Wang et al., 2023k), where a LLM Tulu is trained on a mix of several high-quality instruction data, and comprehensive comparisons are conducted across multiple benchmarks.

开源社区见证了开放LLMs的激增。ChatGPT（OpenAI，2022）和GPT-4（OpenAI，2023a）的成功为使用指令调优改进开源LLMs提供了巨大机会。图5.8比较了几个开源经过指令调优的LLMs。LLaMA（Touvron等人，2023）是一系列开源LLMs，可以与专有LLMs（如GPT-3）的性能相匹敌。为了教LLaMA遵循指令，Self-instruct调整方法已经被用于许多LLMs，以提高它们的指令遵循质量。要让LLaMA遵循自然语言指令并完成现实世界的任务，研究人员一直在探索指导调整LLMs的方法。这可以通过使用人工注释的提示和反馈（Ouyang等人，2022）在广泛的任务上微调模型来实现，也可以通过使用手动或自动生成的指令扩充的公共基准和数据集进行监督微调（Wang等人，2022f）来实现。在这些方法中，自我指导调整（Self-instruct tuning）（Wang等人，2022e）是一种简单有效的方法，通过学习由SoTA LLMs生成的指令遵循数据来使LLMs与人的意图一致。流程如图5.7所示。人们创建一些示例（即种子示例）作为上下文，并要求像GPT-3或GPT-4这样的LLMs创建更多的指令和响应，以遵循提示中的要求。机器生成的指令遵循数据可以进一步选择以构建在下一个数据生成迭代中的上下文学习。该过程迭代直至收集到一定数量的样本。由于与人类注释相比，API调用的成本相对较低，响应速度更快，因此自我指导调整在研究界越来越受欢迎。

Quick assessment of LLM chatbots—LLM聊天机器人的快速评估：开源模型的执行能力已接近于当下最先进的私有模型，基于开源的LLaMA家族+Vicuna-Instructions-801数据集+GPT-4进行评分

To study the quality of LLM Chatbots, we consider Vicuna- Instructions-801 (Vicuna, 2023), a dataset with 80 questions that baseline models (Touvron et al., 2023) find challenging. Besides generic instructions, the instructions fall into 8 categories, including knowledge, math, Fermi, counterfactual, roleplay, generic, coding, writing and common-sense. To quantitatively compare the performance, GPT-4 is used to rate the response from score 1 to 10 for any two given chatbots, then compute the relative score. Surprisingly, it turns out this evaluation metric is quite consistent across different settings. The open-source LLaMA family seems to perform closely to SoTA proprietary chatbots.

为了研究LLM聊天机器人的质量，我们考虑了Vicuna-Instructions-801（Vicuna，2023）这个数据集，其中包含80个基线模型（Touvron等人，2023）认为具有挑战性的问题。除了通用指令外，指令还分为8个类别，包括知识、数学、费米、反事实、角色扮演、通用、编码、写作和常识。为了定量比较性能，我们使用GPT-4来为给定的两个聊天机器人的响应评分从1到10，然后计算相对分数。令人惊讶的是，这个评估指标在不同设置下相当一致。开源LLaMA系列似乎在性能上接近SoTA专有聊天机器人。

Further discussions进一步的讨论：三个研究方向，数据驱动AI、开源LLMs与专有LLMs之间的差距辩论、基础LLMs的发展

There are several important topics on LLMs that we have not covered in this chapter, but are worthwhile future exploring.

>> Data-centric AI. We emphasize that the development of these open-source LLMs is data- centric (Mazumder et al., 2022), rather than model-centric, so that we hope the readers could align with this perspective when discussing the topic. As the training objectives and network architectures are becoming similar or even identical to GPT-like models, the key differential factor is data. For example, behaviors of the aforementioned LLMs are determined by the instruction tuning data.

>> False promise? There is a debate on that the open LLMs could catch up with the proprietary LLMs is a false promise (Gudibande et al., 2023). To align the discussions, we argue that there are two distinctive abilities for LLMs: the instruction-following ability to know which task to perform, and massive knowledge storage to complete the task with high quality. Imitation models are good at the former, by mimicking ChatGPT’s style but perform poorly in terms of factuality in their responses. In Gudibande et al. (2023), the authors conclude that there exists a substantial capabilities gap between open and closed LLMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LLMs. They also advocate that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LLMs. However, unfortunately, the resources to train such base LLMs are only available in a few industry labs. It seems more promising for most academic research labs to explore the opportunities in alignment research with affordable resources, or explore the techniques to reduce the compute barriers.

>> Base LLMs. Developing more capable or commercial usable LLMs is of great value. Besides LLaMA, the open-source community has developed variants of base LLMs such as LLaMA-2, OpenLLaMA (Geng and Liu, 2023), MPT (Team, 2023) and Falcon (Penedo et al., 2023), or released the training recipe (Computer, 2023).

关于LLMs，有一些重要的主题我们在本章中没有涵盖，但值得未来探讨。

>> 数据中心的AI。我们强调，这些开源LLMs的发展是以数据为中心的（Mazumder等人，2022），而不是以模型为中心的，因此我们希望读者在讨论这个主题时能够与这一观点保持一致。由于训练目标和网络架构正在变得相似甚至相同于类似GPT的模型，因此关键的区别因素是数据。例如，上述LLMs的行为由指令调优数据决定。

>> 虚假承诺？有人辩称，开源LLMs可以赶上专有LLMs是虚假的承诺（Gudibande等人，2023）。为了使讨论保持一致，我们认为LLMs有两个明显的能力：遵循指令的能力，知道执行哪个任务，以及大规模知识存储的能力，以高质量完成任务。模仿模型在前者方面表现出色，通过模仿ChatGPT的风格，但在响应的真实性方面表现不佳。在Gudibande等人（2023）中，作者得出结论，开源和封闭LLMs之间存在实质性的能力差距，目前的方法只能通过大量的模仿数据或使用更强大的基础LLMs来弥合这一差距。他们还主张，改进开源模型的最有价值的举措是解决开发更好的基础LLMs的困难挑战。然而，不幸的是，用于训练此类基础LLMs的资源只在少数行业实验室中可用。对于大多数学术研究实验室来说，探索使用可负担得起的资源进行对齐研究或探索降低计算壁垒的技术的机会似乎更有希望。

>> 基础LLMs。开发更强大或商业可用的LLMs具有巨大的价值。除了LLaMA之外，开源社区还开发了基础LLMs的变种，如LLaMA-2、OpenLLaMA（Geng和Liu，2023）、MPT（Team，2023）和Falcon（Penedo等人，2023），或者发布了训练配方（Computer，2023）。

5.3、Instruction-Tuned Large Multimodal Models指导调整的大型多模态模型

如何利用开源资源构建多模态GPT-4的最小原型

In this section, we illustrate how to build the minimum prototype of multimodal GPT-4 with open- source resources. Specially, we use LLaVA (Liu et al., 2023c) as the running example, a similar idea is also proposed in its con-current work MiniGPT-4 (Zhu et al., 2023a).

The research in the multimodal space has often been inspired by the latest advances in NLP in recent years. One successful recipe is to explore what would happen if the most intriguing and suc- cessful NLP ideas are borrowed for the vision-and-language community, for example, self-instruct. However, the unique challenge with self-instruct in multimodal research is that there is no strong multimodal teacher publicly available. Therefore, the research question becomes: how can we use language models such as language-only GPT-4 to create multimodal instruction following data.

在本节中，我们将演示如何使用开源资源构建多模态GPT-4的最小原型。特别地，我们以LLaVA（Liu等人，2023c）为运行示例，类似的想法也在其并行工作MiniGPT-4（Zhu等人，2023a）中提出。

近年来，多模态领域的研究常常受到自然语言处理（NLP）最新进展的启发。一个成功的方法是探索如果最引人注目和成功的NLP思想被借用到视觉与语言社区中会发生什么，例如自我指导。然而，多模态研究中自我指导面临的独特挑战是没有强大的多模态教师公开可用。因此，研究问题变成了：我们如何使用纯文本的GPT-4之类的语言模型来创建多模态的指导遵循数据。

Data Creation数据创建：提高模型的多模态能力=将图像转换为符号序列表示+采用图像的标题和边界框信息+三种类型的指令遵循数据

Instead of directly feeding images into OpenAI GPT-4, we use their symbolic sequence representa- tions shown in Figure 5.9 (a). In LLaVA, both captions and bounding boxes are considered, due to the following reasons: (i) it is empirically found that GPT-4 can understand both well, in contrast to the poor performance of ChatGPT in understanding bounding box coordinates. (ii) They are often complementary to each other and hence can represent the image as informative as possible.

As shown in Figure 5.9 (b), three types of instruction-following data are considered: (i) multi- turn conversations so that users can chat with the model; (ii) detailed description so that long-form responses can be generated from the model; and (iii) complex reasoning, which is more about the implication of the image, rather than the image content. For example, “what challenge do these people face?”, which requires to first recognize that the image is about a SUV in the parking area, and there are quite a few luggage placed on the ground, and then to infer that the challenge is how the luggage can be packed into the SUV due to the tight space of the trunk. In total, 158K samples are collected over three types. To summarize, the spirit is that whatever tasks one wants the model to perform in the serving stage, it is important to create the corresponding instruction-following data for training.

与直接将图像馈送到OpenAI GPT-4不同，我们使用它们在图5.9（a）中显示的符号序列表示。在LLaVA中，考虑了标题和边界框，原因如下：（i）经验表明GPT-4可以很好地理解两者，而ChatGPT在理解边界框坐标方面性能较差。（ii）它们通常互补，并且可以尽可能详细地表示图像。

如图5.9（b）所示，考虑了三种类型的指导遵循数据：（i）多轮对话，以便用户可以与模型聊天；（ii）详细说明，以便可以从模型生成长篇回应；以及（iii）复杂推理，更多地涉及图像的含义，而不仅仅是图像内容。例如，“这些人面临什么挑战？”这需要首先识别图像是关于停车区域的SUV，并且地面上放置了相当多的行李，然后推断挑战是如何将行李装进SUV的行李箱的有限空间。总共收集了三种类型的158K个样本。总结来说，无论用户在服务阶段想让模型执行什么任务，都很重要为训练创建相应的指导遵循数据。

Network Architecture and Training网络架构和训练：LLaVA的网络架构是一个通用的图像到文本生成模型，通过将预训练的CLIP ViT-L/14视觉编码器和大型语言模型Vicuna连接起来，并采用两阶段指令调优过程进行训练(特征对齐的预训练+端到端微调)

As illustrated in Figure 5.10, the network architecture of LLaVA is an instantiation of the general image-to-text generative model framework introduced in Figure 5.1 of Section 5.1. Specifically, LLaVa connects the pre-trained CLIP ViT-L/14 visual encoder (Radford et al., 2021) and large lan- guage model Vicuna (Vicuna, 2023), via a simple projection matrix (i.e., the linear projection layer). A two-stage instruction-tuning procedure is adopted to train the model. (i) Stage 1: pre-training for feature alignment. Only the projection matrix is updated, based on a subset of CC3M (Changpinyo et al., 2021). (ii) Stage 2: finetuning end-to-end. Both the projection matrix and LLM are updated on the proposed multimodal instruction-following data for daily user-oriented applications.

如图5.10所示，LLaVA的网络架构是图5.1中介绍的通用图像到文本生成模型框架的一个实例，该框架在第5.1节中介绍。具体来说，LLaVa通过一个简单的投影矩阵（即线性投影层）连接了经过预训练的CLIP ViT-L/14视觉编码器（Radford等人，2021）和大型语言模型Vicuna（Vicuna，2023）。采用了两阶段指导调整过程来训练模型。

（i）第一阶段：特征对齐的预训练。仅更新投影矩阵，基于CC3M的子集（Changpinyo等人，2021）。

（ii）第二阶段：端到端的微调。在为面向日常用户的应用提出的多模态指导遵循数据上，同时更新投影矩阵和LLM。

Performance性能：LLaVA是一个多模态聊天模型，通过自我指导方法在多模态指令遵循数据上进行微调，在多个任务和领域中展现了良好的性能

Visual chat: towards building multimodal GPT-4 level chatbot. LLaVA is finetuned on the gen- erated multimodal instruction-following data, which contains a diverse set of task instructions and responses for daily user-oriented applications. It is empirically observed that finetuning the linear projection layer only is sufficient for the chat demo/scenarios, though it requires longer training time. To evaluate the model performance, an evaluation dataset named LLaVA-Bench is constructed, with two subsets: (i) LLaVA-Bench (COCO): 30 unseen COCO images with 90 new language-image instructions, (ii) LLaVA-Bench (In-the-Wild): 24 images with 60 questions. Each image can be as- sociated with three types of instructions: conversation, detailed description and complex reasoning.The ground-truth answers are collected by manually re-writing GPT-4 output. We test LLaVA and use language-only GPT-4 to rate their responses from score 1 to 10. Overall, LLaVA achieves 85.1% relative score compared with ground-truth on LLaVA-Bench (COCO), and 73.5% on LLaVA-Bench (In-the-Wild). On the latter dataset, Google Bard (July 19, 2023) and Microsoft BingChat (June 29, 2023) achieves 77.8% and 71.5%, respectively. It indicates the effectiveness of the proposed self-instruct method in multimodal settings. One examples is shown in Table 5.1.

Science QA: New SoTA with the synergy of LLaVA with GPT-4. LLaVA is finetuned on a multimodal reasoning dataset in the science domain (Lu et al., 2022b). LLaVA alone achieves 90.92% in accuracy. We further explores with language-only GPT-4 as the judge, to predict the final answer based on its own previous answers and the LLaVA answers. This “GPT-4 as judge” scheme yields a new SoTA of 92.53%.

OCR in the wild: An emerging property. LLaVA has never been explicitly trained on OCR data,i.e. images that contains scene text that is described in the corresponding caption. Surprisingly, the model shows strong zero-shot OCR task transfer ability in the wild.

视觉对话：朝着构建多模态GPT-4级聊天机器人。LLaVA在生成的多模态指导遵循数据上进行了微调，其中包含了各种各样的面向日常用户应用的任务指导和响应。经验观察表明，仅微调线性投影层对于聊天演示/场景已经足够，尽管需要更长的训练时间。为了评估模型性能，构建了一个名为LLaVA-Bench的评估数据集，包括两个子集：（i）LLaVA-Bench（COCO）：30个未见过的COCO图像，带有90个新的语言-图像指令；（ii）LLaVA-Bench（In-the-Wild）：24个图像，带有60个问题。每个图像可以与三种类型的指令相关联：对话、详细说明和复杂推理。地面真实答案是通过手动重写GPT-4输出来收集的。我们测试了LLaVA，并使用仅语言的GPT-4来为其响应评分，从1到10。总体而言，LLaVA在LLaVA-Bench（COCO）上相对分数达到了85.1%，在LLaVA-Bench（In-the-Wild）上达到了73.5%。在后一个数据集上，Google Bard（2023年7月19日）和Microsoft BingChat（2023年6月29日）分别达到了77.8%和71.5%，这表明了在多模态环境中提出的自我指导方法的有效性。表5.1中显示了一个示例。

科普问答：LLaVA与GPT-4的协同效应实现新的SoTA。LLaVA在科学领域的多模态推理数据集上进行了微调（Lu等人，2022b）。LLaVA单独实现了90.92%的准确率。我们进一步尝试了以仅语言的GPT-4作为评判者，根据其自己的先前答案和LLaVA的答案来预测最终答案。这种“GPT-4作为评判者”的方案取得了92.53%的新SoTA。

域外OCR：一种新兴的性能。LLaVA从未明确在OCR数据上进行训练，即包含在相应标题中描述的场景文本的图像。令人惊讶的是，该模型在域外表现出了强大的零样本OCR任务迁移能力。

5.4、Advanced Topics高级主题

近期指令调优多模态语言模型研究呈现出蓬勃发展的势头，涌现出多个新模型和研究方向，将对多模态语言理解和生成领域产生重要影响。

The history of recent instruction-tuned LMMs are illustrated in Figure 5.11 (a). Due to the popularity of ChatGPT and GPT-4, instruction-tuned LMM appears as an emerging line of research in the past three months after GPT-4 was proposed. Alpaca (Taori et al., 2023) and Vicuna (Vicuna, 2023) were proposed to make LLaMA more instruction-following in the language domain in March. In two weeks, MiniGPT-4 (Zhu et al., 2023a) and LLaVA (Liu et al., 2023c) were proposed to make Vicuna to see and chat about the visual world. In ten days, LLaMA-Adapter v2 (Gao et al., 2023b) and mPlug-OWL (Ye et al., 2023b) started to compare performance with MiniGPT-4/LLaVA, indicating the beginning of model evolution. The data points in April are relatively sparse. In May, a large number of LMM papers appeared on arXiv, which improve this line of research from many different aspects. The momentum is till going in June.

It is easy to lose track of all the recent papers for the readers, so as well in our literature re- view. To better organize the literature, we group them based on specific research topics, shown in Figure 5.11 (b). The early LMMs with billions of parameters include GPT-4 (OpenAI, 2023a), Flamingo (Alayrac et al., 2022), PaLM-E (Driess et al., 2023) and KOSMOS-1 (Huang et al., 2023b). In contrast to these proprietary LMMs, LLaVA and MiniGPT-4 open the opportunities to build LMMs with open-source resource. We will discuss several topics as below, in addition to the extensions of RLHF (Gunjal et al., 2023), dense prediction (Wang et al., 2023h; Zang et al., 2023; Chen et al., 2023d), video (Zhang et al., 2023f; Luo et al., 2023c; Li et al., 2023i), image generation (Koh et al., 2023) and embodied agent (Mu et al., 2023).

近期指导调整的大型多模态模型的历史在图5.11（a）中有所说明。由于ChatGPT和GPT-4的流行，指导调整的大型多模态模型在GPT-4提出后的过去三个月中成为一个新兴的研究领域。在3月份，Alpaca（Taori等人，2023）和Vicuna（Vicuna，2023）提出了使LLaMA在语言领域更符合指导的方法。两周后，MiniGPT-4（Zhu等人，2023a）和LLaVA（Liu等人，2023c）提出了使Vicuna能够看到并与视觉世界进行交互的方法。十天后，LLaMA-Adapter v2（Gao等人，2023b）和mPlug-OWL（Ye等人，2023b）开始与MiniGPT-4/LLaVA进行性能比较，标志着模型演进的开始。4月份的数据点相对稀疏。到了5月，arXiv上出现了大量关于LMM的论文，从许多不同的方面改进了这一研究领域。这个势头一直持续到了6月。

对于读者来说，很容易追踪所有最近的论文，就像我们的文献综述一样。为了更好地组织文献，我们根据特定的研究主题对它们进行了分组，如图5.11（b）所示。拥有数十亿参数的早期LMM包括GPT-4（OpenAI，2023a），Flamingo（Alayrac等人，2022），PaLM-E（Driess等人，2023）和KOSMOS-1（Huang等人，2023b）。与这些专有的LMM不同，LLaVA和MiniGPT-4为使用开源资源构建LMM提供了机会。除了RLHF（Gunjal等人，2023）、密集预测（Wang等人，2023h；Zang等人，2023；Chen等人，2023d）、视频（Zhang等人，2023f；Luo等人，2023c；Li等人，2023i）、图像生成（Koh等人，2023）和体验代理（Mu等人，2023）的扩展之外，我们将讨论以下几个主题。

More Modalities (Beyond VL)更多的模态（超越VL）：近期研究致力于将多模态语言模型框架扩展到包括更多的感知模态，如声音、图像、视频等，进一步拓展了多模态语言理解和生成的研究领域。

While LMM extends LLM by adding the vision modality, it is natural to further extend the frame- work to include more modalities beyond vision and language. Following this spirit, several at- tempts have been made, including ChatBridge (Zhao et al., 2023e), PandaGPT (Su et al., 2023), SpeechGPT (Zhang et al., 2023d) and X-LLM (Chen et al., 2023c). PandaGPT leverages Image- Bind to add more modalities into LMMs. The ImageBind model (Girdhar et al., 2023) learns a single, shared representation space for text, image/video, audio and sensors that record depth (3D), thermal (infrared radiation), or inertial measurement units (IMU), which calculate motion and po- sition. ImageBind provides a holistic understanding of the visual world that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move. By training a projection layer for one modality in LMM, the model can zero-shot transfer to infer over other modalities, thanks to the shared multimodal embedding space. Another representative model is SpeechGPT, where language and speech modalities are enabled for both inputs and out- puts. Despite of rich model variations, the idea to connect diverse modalities is similar to LMM that adds images into LLMs. NExT-GPT (Wu et al., 2023c) connects an LLM with multimodal adap- tors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. The LMM framework has also been successfully extended to speech (Zhao et al., 2023c), 3D (Wang et al., 2023l; Hong et al., 2023), and point cloud (Xu et al., 2023c).

虽然LMM通过添加视觉模态扩展了LLM，但自然而然地可以进一步扩展框架，包括超越视觉和语言的更多模态。在这个精神指导下，已经进行了一些尝试，包括ChatBridge（Zhao等人，2023e）、PandaGPT（Su等人，2023）、SpeechGPT（Zhang等人，2023d）和X-LLM（Chen等人，2023c）。PandaGPT利用ImageBind将更多模态引入LMM。ImageBind模型（Girdhar等人，2023）学习了一个单一的、共享的表示空间，用于文本、图像/视频、音频和记录深度（3D）、热（红外辐射）或惯性测量单元（IMU）的传感器，用于计算运动和位置。ImageBind提供了对视觉世界的整体理解，将照片中的对象与它们的声音、3D形状、温度以及它们的运动方式相连接。通过为LMM中的一种模态训练一个投影层，由于共享的多模态嵌入空间，该模型可以零样本传递到其他模态以推断。另一个代表性的模型是SpeechGPT，其中语言和语音模态都被用于输入和输出。尽管模型变化丰富，但连接不同模态的想法类似于将图像添加到LLM中。NExT-GPT（Wu等人，2023c）将LLM与多模态适配器和不同的扩散解码器连接起来，使NExT-GPT能够以文本、图像、视频和音频的任意组合感知输入并生成输出。LMM框架还成功地扩展到了语音（Zhao等人，2023c）、3D（Wang等人，2023l；Hong等人，2023）和点云（Xu等人，2023c）领域。

Improving Visual Instruction Data Quantity and Quality改进视觉指导数据的数量和质量：

Given the convergence of model architectures to GPT-like network, the performance of LMM is primarily determined by its training data. Therefore, it is cricial to improve the quantity and quality of visual instruction tuning data. SVIT (Zhao et al., 2023a) follows the same data generation pipeline as in LLaVA, but further includes region description to prompt GPT-4, in addition to the caption and box data as shown in Figure 5.9 (a). The data is scaled up to 3.2 million, which is 20 times larger than the data used in LLaVA.

Unlike existing studies that primarily focus on positive instruction samples, LRV-Instruction (Liu et al., 2023a) includes both positive and negative instructions for more robust instruction-tuning. Other examples along this line include LLaVAR (Zhang et al., 2023o) that adds OCR-related instruction-tuning data for text-rich image understanding, and StableLLaVA (Li et al., 2023o) that considers model-synthesized images for image-dialogue data. Polite Flamingo (Chen et al., 2023b) trains LLM to re-write the instruct data. Instead of leveraging GPT-4 for data generation, VIGC (Wang et al., 2023a) considers to utilize LMM to generate instruction-tuning data and progres-sively enhance its quality on-the-fly. Similar to the “less is more” observation in LIMA (Zhou et al., 2023a) from the NLP domain, InstructionGPT-4 shows that the quality of the instruction-tuning data is more important than its quantity, where they finetune a better version of MiniGPT-4 with 200 high-quality samples (6%), selected from the 3500 samples used in the original MiniGPT-4.

鉴于模型架构趋于类似于GPT的网络，LMM的性能主要由其训练数据决定。因此，改进视觉指导调整数据的数量和质量至关重要。SVIT（Zhao等人，2023a）遵循与LLaVA相同的数据生成流水线，但进一步包括区域描述以提示GPT-4，除了图5.9（a）中显示的标题和框数据外。数据扩展到了320万个，比LLaVA中使用的数据大20倍。

不同于现有研究主要关注正面指导样本的研究，LRV-Instruction（Liu等人，2023a）包括正面和负面指导，以实现更强大的指导调整。其他沿这一方向的示例包括LLaVAR（Zhang等人，2023o），它为文本丰富的图像理解添加了与OCR相关的指导调整数据，以及StableLLaVA（Li等人，2023o），它考虑了模型合成的图像用于图像对话数据。Polite Flamingo（Chen等人，2023b）训练LLM以重新编写指导数据。与利用GPT-4进行数据生成不同，VIGC（Wang等人，2023a）考虑了利用LMM生成指导调整数据并在运行时逐步提高其质量。与NLP领域的LIMA中的“少即是多”的观察类似，InstructionGPT-4表明指导调整数据的质量比其数量更重要，他们使用了200个高质量样本（6%）来微调更好的MiniGPT-4版本，这些样本从原始的MiniGPT-4中选择而来（共有3500个样本）。

Multitask Instruct with Established Academic Datasets/Tasks建立学术数据集/任务的多任务指导：指令调优可以通过两种不同的方式实现=通过对多样化任务进行模型微调使用人工注释的提示和反馈+通过使用公共基准和数据集进行监督微调

As discussed earlier in Section 5.2, instruction tuning in the language domains is implemented in two different ways: finetuning the model on a wide range of tasks using human-annotated prompts and feedback (Ouyang et al., 2022), or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions (Wang et al., 2022f). The former is good at user-oriented daily life tasks, and the latter is good at achieving decent performance on established benchmarks. LLaVA and MiniGPT-4 fall into the former class. Several other works either target for the latter class or combine both classes, including MultiInstruct (Xu et al., 2022b), mPlug-OWL (Ye et al., 2023b), InstructBLIP (Dai et al., 2023b), Multimodal-GPT (Gong et al., 2023), Instruction-ViT (Xiao et al., 2023) and Qwen-VL (Bai et al., 2023a).

For example, MultiInstruct is an early attempt before open-source LLaMA for instruction tun-ing with multimodal datasets. InstructBLIP is a recent work that combines chat and benchmark instruction-following data. As shown in Figure 5.12, InstructBLIP transforms 26 publicly available datasets, covering a wide variety of tasks and capabilities, into instruction tuning format. Trained on 13 held-in datasets, InstructBLIP attains SoTA zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Qwen-VL scales up both image-text pair data for pre-traning and academic datasets for multi-task pre-traning, and achieve excellent performance on many tasks.

正如在第5.2节中早期讨论的那样，语言领域的指导调整是以两种不同的方式实现的：通过在广泛的任务上微调模型，使用人工注释的提示和反馈（Ouyang等人，2022），或者通过使用公共基准和数据集进行监督微调，并用手工或自动生成的指导进行增强（Wang等人，2022f）。前者适用于面向用户的日常生活任务，后者适用于在已建立的基准测试中取得良好的性能。LLaVA和MiniGPT-4属于前一类。其他一些作品既针对后一类，又结合了两类作品，包括MultiInstruct（Xu等人，2022b）、mPlug-OWL（Ye等人，2023b）、InstructBLIP（Dai等人，2023b）、Multimodal-GPT（Gong等人，2023）、Instruction-ViT（Xiao等人，2023）和Qwen-VL（Bai等人，2023a）。例如，MultiInstruct是在开源LLaMA之前对多模态数据集进行指导调整的早期尝试。InstructBLIP是一项最新的工作，将聊天和基准指导调整数据相结合。如图5.12所示，InstructBLIP将26个公开可用的数据集转化为指导调整格式，涵盖了各种各样的任务和能力。在13个保留的数据集上进行训练后，InstructBLIP在所有13个保留的数据集上获得了SoTA的零样本性能，远远超过了BLIP-2和更大的Flamingo模型。Qwen-VL扩大了图像-文本对数据的规模以进行预训练，并在多任务预训练的学术数据集上取得了出色的性能。

Multimodal In-Context-Learning多模态上下文学习：

Similar to the behavior of LLMs, which can address a language task by processing examples of the task in their text prompt, multimodal in-context-learning refers to an visual and text interface that can steer the model towards solving a multimodal task. Given a few example pairs of visual inputs and expected text responses composed in the multimodal prompt, the model can be queried with a ques-tion about a new image or video, and then generate an answer. The direction to extend in-context-learning from language to multi-modalities has been explored, including OpenFlamingo (Awadalla et al., 2023), Otter (Li et al., 2023d), M3IT (Li et al., 2023j), MetaVL (Monajatipoor et al., 2023) and Sparkles (Huang et al., 2023d). OpenFlamingo (Awadalla et al., 2023) is an open source version of DeepMind’s Flamingo model, trained on Multimodal C4 dataset (Zhu et al., 2023b), which is a billions-scale corpus of inter-leaved image-text data. To explicitly enhance the multimodal in-context-learning ability of LMMs, MIMIC-IT (Li et al., 2023c) dataset is constructed, which is 2.4M multimodal instruction instances with in-context examples. By tuning OpenFlamingo on MIMIC-IT, a new model Otter is obtained with a stronger instruction-following ability. Using two image-text pairs as the context, Otter learns the concise answer style demonstrated by the examples, otherwise a tedious response is generated.

与LLMs的行为类似，LLMs可以通过处理任务示例的文本提示来解决语言任务，多模态上下文学习是指多模态任务的视觉和文本接口，可以引导模型解决多模态任务。通过给定几个视觉输入和预期的文本响应示例对组成的多模态提示，可以向模型提问关于新图像或视频的问题，然后生成答案。从语言到多模态的上下文学习扩展方向已经被探索，包括OpenFlamingo（Awadalla等人，2023）、Otter（Li等人，2023d）、M3IT（Li等人，2023j）、MetaVL（Monajatipoor等人，2023）和Sparkles（Huang等人，2023d）。OpenFlamingo（Awadalla等人，2023）是DeepMind的Flamingo模型的开源版本，训练在Multimodal C4数据集（Zhu等人，2023b）上，这是一个亿级别的交织图像-文本数据的语料库。为了明确提高LLMs的多模态上下文学习能力，构建了MIMIC-IT（Li等人，2023c）数据集，其中包含240万个多模态指导实例和上下文示例。通过在MIMIC-IT上调整OpenFlamingo，得到了一个具有更强的指导遵循能力的新模型Otter。使用两个图像-文本对作为上下文，Otter学习了示例中展示的简洁答案风格，否则将生成冗长的响应。

Parameter-Efficient Training参数高效训练：精细调整成本过高→参数高效训练和模型量化是减小内存占用的有效方法

While finetuning very large models often leads to high performance, it is prohibitively expensive; For example, regular 16-bit finetuning of a LLaMA-65B model (Touvron et al., 2023) requires more than 780 GB of GPU memory (Dettmers et al., 2023). Therefore, it is critical to reduce the memory footprint of LLMs/LMMs, especially when it comes to improve the accessibility of large models to a wider community.

Parameter-efficient training is an effective approach for LMM adaptation. It freezes most of the model parameters, and only allows a fraction of trainable parameters to update with domain-specific data. For example, LLaMA Adapter v2 (Gao et al., 2023b) and LAVIN (Luo et al., 2023a) only have 14M and 3.8M trainable parameters, compared with 7B/13B LLM parameters, respectively. Another efficient training method is quantization. The recent QLoRA (Dettmers et al., 2023) finetunes 65B LLaMA for 24 hours on a single GPU, achieving 99.3% of the performance level of ChatGPT. Since instruction tuning typically involves a small amount of data, it makes parameter-efficient training or model quantization the practical approach, especially when with limited GPU resources. Both LoRA (Hu et al., 2021) and QLoRA are supported in LLaVA codebase to allow LMM training with less GPUs. It is empirically shown in Lu et al. (2023d) that LoRA/QLoRA can achieve similar per-formance with full-modal tuning when scaling LLaVA to 33B and 65B, when training with around 150K instruct data and evaluating with LLaVA-Bench.

尽管对非常大的模型进行微调通常会导致高性能，但这是 prohibitively expensive 的；例如，对LLaMA-65B模型（Touvron等人，2023）进行常规的16位微调需要超过780 GB的GPU内存（Dettmers等人，2023）。因此，降低LLMs/LMMs的内存占用对于提高大型模型对更广泛的社区的可访问性至关重要。参数高效训练是一种有效的LMM适应方法。它冻结了大部分模型参数，只允许一小部分可训练参数随领域特定数据更新。例如，LLaMA Adapter v2（Gao等人，2023b）和LAVIN（Luo等人，2023a）只有1400万和380万可训练参数，而LLMs参数为700亿/130亿。另一种高效的训练方法是量化。最近的QLoRA（Dettmers等人，2023）在单个GPU上对65B LLaMA进行了24小时的微调，实现了ChatGPT性能水平的99.3%。由于指导调整通常涉及少量数据，因此在GPU资源有限的情况下，参数高效训练或模型量化是实际的方法。LLaVA代码库中都支持LoRA（Hu等人，2021）和QLoRA，允许以更少的GPU进行LMM训练。实验证明，在将LLaVA扩展到33B和65B时，使用LoRA/QLoRA可以实现与完全模态调整相似的性能，训练约15万个指导数据并使用LLaVA-Bench进行评估时。

Benchmarks基准测试：通过对多个评估指标和数据集开展实验，结果显示开源模型在某些数据集上已与SOTA相当

While LMMs have shown excellent visual recognition and reasoning in an open-set manner with free-form text across many scenarios, the evaluation of LMMs is becoming an urgent and challeng-ing problem. Several related benchmarks have been developed to evaluate various aspects of LMMs, ranging from their specific abilities including OCR (Liu et al., 2023k), hallucination (POPE (Li et al., 2023l) and HaELM (Wang et al., 2023d)) and adversarial robustness (Zhao et al., 2023d), to comprehensive evaluation such as LAMM (Yin et al., 2023), LVLM-eHub (Xu et al., 2023b). We summarize the LMM evaluation benchmarks in Table 5.2. Among them, LLaVA-Bench is the first attempt to designed open-world visual chat benchmark specifically for LMM. Recently, early multi-modal experiments have been conducted to compare open-source LMM with commercial ones such as BingChat and Bard and LLaVA-Bench (Liu et al., 2023c) and LVLM-eHub (Shao et al., 2023).

It is surprising that LMMs shows strong zero-shot OCR performance in the wild, without explicitly training on text recognition data. To shed light on the hidden mystery of OCR in LMMs, a com-prehensive empirical study is conducted in Liu et al. (2023k) to compare open-source LMMs on 24 academic text recognition datasets, shown in Figure 5.13. Three observations are highlighted:(i) LLaVA consistently outperforms MiniGPT-4 on 21 out of 24 datasets, despite that the training data in LLaVA is an order of magnitude smaller. (ii) Training with significantly more training data leads to higher OCR performance, as demonstrated by BLIP2 (Li et al., 2023h) and mPLUG-Owl.(iii) In most cases, supervised SoTA results significantly outperform zero-shot LMM. However, it is worth noting that in the WordArt dataset (Xie et al., 2022a), which primarily features challenging artistic text, BLIP2 surpasses supervised SoTA. This reveals the potential of LMM in recognizing more complex text types.

尽管LMMs在许多场景中以自由形式文本跨越许多场景展现出出色的视觉识别和推理能力，但LMMs的评估正在变得越来越紧迫和具有挑战性。已经开发了几个相关的基准测试，用于评估LMMs的各个方面，从它们的特定能力（包括OCR（Liu等人，2023k）、幻觉（POPE（Li等人，2023l）和HaELM（Wang等人，2023d）以及对抗性稳健性（Zhao等人，2023d））到综合评估，如LAMM（Yin等人，2023）、LVLM-eHub（Xu等人，2023b）。我们在表5.2中总结了LMM评估基准。其中，LLaVA-Bench是专门为LMM设计的首个开放式视觉对话基准。最近，已经进行了早期的多模态实验，以比较开源LMM与商业LMM（如BingChat和Bard）以及LLaVA-Bench（Liu等人，2023c）和LVLM-eHub（Shao等人，2023）。

令人惊讶的是，LMMs在域外表现出强大的零样本OCR性能，而没有明确在文本识别数据上进行训练。为了揭示LMMs中OCR的隐秘之处，Liu等人（2023k）进行了一项综合的经验研究，比较了24个学术文本识别数据集上的开源LMMs，如图5.13所示。三个观察结果被强调：(i)尽管LLaVA的训练数据规模比MiniGPT-4小一个数量级，但LLaVA在24个数据集中的21个数据集上始终优于MiniGPT-4。（ii）使用更多的训练数据会导致更高的OCR性能，正如BLIP2（Li等人，2023h）和mPLUG-Owl所示。 (iii) 在大多数情况下，监督SoTA的结果明显优于零样本LMM。然而，值得注意的是，在WordArt数据集（Xie等人，2022a）中，这个数据集主要包含具有挑战性的艺术文本，BLIP2超越了监督SoTA，这显示了LMM在识别更复杂的文本类型方面的潜力。

Applications应用：通过在专业领域如医学等训练专项模型，采用自监督学习方法训练出能够开放回答图像研究问题的对话助

The success of ChatGPT/GPT-4 in the general domain has inspired the interests in building assistants in the vertical domains such as medicine, gaming and education. Such domain-specific assistants can have the several advantages over the general domain counterpart: (i) training with high-quality domain-speicifc data makes the assistants more helpful; (ii) the model size can be smaller, with lower severing cost; and (iii) the sensitive user prompt data can be maintained internally by serving the model locally, to avoid privacy issue.

ChatGPT/GPT-4在通用领域的成功激发了在医学、游戏和教育等垂直领域构建助手的兴趣。这些特定领域的助手相对于通用领域的助手具有以下几个优势：（i）使用高质量的领域特定数据进行训练可以使助手更有帮助；（ii）模型大小可以更小，服务器成本更低；（iii）敏感的用户提示数据可以在本地为模型提供服务，以避免隐私问题。

To improve text recognition ability of LMM, OCR-specific models have been developed, including BLIVA (Hu et al., 2023), LLaVAR (Zhang et al., 2023o), mPlug-DocWL (Ye et al., 2023a). LMMs have been recently explored in the biomedical domain (Sun et al., 2023c; Zhang et al., 2023m; Li et al., 2023e), where conversational generative AI has demonstrated remarkable promise for empow-ering biomedical practitioners. LLaVA-Med (Li et al., 2023e) is a cost-efficient approach for train-ing a vision-language conversational assistant that can answer open-ended research questions about biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then finetune a large general-domain vision-language model LLaVA using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the image-caption pairs as is, then learns open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. In Figure 5.14, we provide examples of the biomed vi-sual conversations with different chatbots. LLaVA-Med precisely answers the questions requiring biomedical knowledge, while LLaVA behaves like a layperson, that hallucinates based on common-sense. LLaVA-Med has inspired several generalist biomedical AI models, including Google Med-PaLM-M (Tu et al., 2023), Stanford Med-Flamingo (Moor et al., 2023) and radiology generalist (Wu et al., 2023b).

为了提高LMM的文本识别能力，已经开发了OCR特定的模型，包括BLIVA（Hu等人，2023）、LLaVAR（Zhang等人，2023o）和mPlug-DocWL（Ye等人，2023a）。LMMs最近在生物医学领域得到了探索（Sun等人，2023c；Zhang等人，2023m；Li等人，2023e），在这个领域，对话生成AI已经展示出了为生物医学从业者提供强大帮助的潜力。LLaVA-Med（Li等人，2023e）是一种用于训练视觉-语言对话助手的成本效益方法，可以回答关于生物医学图像的开放性研究问题。其关键思想是利用从PubMed Central中提取的大规模、广覆盖的生物医学图示标题数据集，使用GPT-4从标题中自行生成开放性指导遵循数据，然后使用一种新的课程学习方法对大规模的通用领域视觉-语言模型LLaVA进行微调。具体来说，模型首先学习使用图像-标题对齐生物医学词汇，然后使用GPT-4生成的指导遵循数据来学习开放性对话语义，广泛模仿普通人逐渐获得生物医学知识的方式。在图5.14中，我们提供了不同聊天机器人的生物医学视觉对话示例。LLaVA-Med精确回答需要生物医学知识的问题，而LLaVA则表现得像一个普通人，根据常识进行幻想。LLaVA-Med启发了一些通用生物医学AI模型，包括Google Med-PaLM-M（Tu等人，2023）、Stanford Med-Flamingo（Moor等人，2023）和放射学通用医生（Wu等人，2023b）。

5.5、How Close We Are To OpenAI Multimodal GPT-4?我们距离OpenAI多模态GPT-4有多近？

很大差距：需要继续提升能力和降低计算门槛

尽管开源社区已初步实现建立多模态语言模型的最小原型，但与GPT-4在能力规模和水平上还有很大差距，需要继续提升能力和降低计算门槛，同时各自根据实力开展技能扩展与性能优化才能进一步推动这一领域的发展。

With all the works mentioned above, are we close to (or, even surpassing) OpenAI Multimodal GPT-4? It is encouraging to see that the open-source community has quickly developed a variety of models and prototypes for various new capabilities. For example, LLaVA/Mini-GPT4 paves the way towards building multimodal chatbots, with some examples that reproduce the results in OpenAI GPT-4 technique report; CM3leon (Yu and et al, 2023), Emu (Sun et al., 2023a), GILL (Koh et al., 2023) extends LMMs for end-to-end image generation, to the best of our knowledge, this is a capability that the current GPT-4 does not exhibit. From the perspective of enabling new capabilities with the minimum prototypes, the open-source community seems close to OpenAI Multimodal GPT- 4, by exploring the baby steps towards building the general-purpose multimodal assistant.

However, there is still a clear large gap in terms of scaling a given capability, e.g., for the visual reasoning capability that we have observed in LLaVA. There are two more visual examples from OpenAI technical report, to correctly answer the questions, it requires models to understand multiple high-resolution images and long sequence text depicted in the image, as well as responding with domain knowledge. It requires much more compute and more powerful language models, which are not available to most people.

通过上述提到的所有工作，我们是否接近（甚至超越）OpenAI多模态GPT-4？令人鼓舞的是，开源社区迅速开发了各种不同新功能的模型和原型。例如，LLaVA/Mini-GPT4为构建多模态聊天机器人铺平了道路，其中一些示例重现了OpenAI GPT-4技术报告中的结果；CM3leon（Yu等人，2023）、Emu（Sun等人，2023a）、GILL（Koh等人，2023）扩展了LMM以实现端到端的图像生成，据我们所知，这是当前的GPT-4没有展示的能力。从通过最小原型来启用新功能的角度来看，开源社区似乎接近OpenAI多模态GPT-4，通过探索朝着构建通用多模态助手迈出的初步步伐。

然而，就在扩展给定功能方面，例如我们在LLaVA中观察到的视觉推理能力，仍然存在明显的巨大差距。在OpenAI技术报告中还有另外两个视觉示例，要正确回答这些问题，需要模型理解多幅高分辨率图像和图像中描绘的长序列文本，以及以领域知识进行回应。这需要更多的计算资源和更强大的语言模型，这些对大多数人来说是不可用的。

In summary, we have presented the background and strong capabilities of LMM, reviewed instruc-tion tuning in LLMs, and showed how to build a prototype such as LLaVA and MiniGPT-4 using open-source resources. We also summarized the most recent papers emerged on this line of research to help those who are interested to gain the momentum to start the journey of LMM research. To discuss the next steps to work on as a community, one sustainable suggestion can be that those with resources can continue focusing on the scaling success and study new emerging properties, while others focus on prototypes for new functionalities and evaluation, as well as developing techniques to reduce the computational barriers and thus allow easier accessibility to large models.

总之，我们介绍了LMM的背景和强大功能，回顾了LLMs中的指导调整，并展示了如何使用开源资源构建原型，例如LLaVA和MiniGPT-4。我们还总结了最近出现的关于这一研究领域的论文，以帮助那些有兴趣开始LMM研究之旅的人获得动力。要讨论作为一个社区要继续进行的下一步工作，一个可持续的建议可能是，那些具有资源的人可以继续关注扩展成功并研究新兴属性，而其他人可以专注于新功能和评估的原型，以及开发减少计算障碍的技术，从而更容易访问大型模型。

6、Multimodal Agents:Chaining Tools with LLM 多模态智能代理：与LLM协同工作

提出新的建模范式：将多个工具或专家与LLMs协同链接以解决复杂的开放问题，不需要训练，只需要示例教导

Large Language Models (LLMs) (Chowdhery et al., 2022; OpenAI, 2023a) have shown intriguing properties generalizing to user prompts in various domains, and rapidly adapting to new scenarios, using in-context learning with a few examples. Inspired by such strong capabilities, researchers are now exploring a new modeling paradigm that shifts from standalone models for solving finite, pre-defined problems, into synergistically chaining multiple tools or experts with LLMs to solve complicated, open problems. Unlike what has been introduced in Chapter 5, such a system can be built without any training involved, just by using a few demonstration examples to teach the LLM to generate proper calling to existing tools.

大型语言模型（LLMs）（Chowdhery等人，2022；OpenAI，2023a）已经展示了一些有趣的特性，可以泛化到各个领域的用户提示，并通过几个例子使用上下文学习快速适应新的场景。受到这种强大能力的启发，研究人员现在正在探索一种新的建模范式，从解决有限预的、预定义问题的独立模型，转向为协同链接多个工具或具有LLMs的专家来解决复杂的、开放的问题。与第5章中介绍的不同，这样的系统可以在不涉及任何训练的情况下构建，只需使用少量示范示例来教导LLM生成对现有工具的适当调用即可。

In this chapter, we review the fast-evolving literature on chaining different multimodal experts with LLMs to solve complicated multimodal understanding problems, referred to as multimodal agents. We start with an overview on the evolution of this modeling paradigm in Section 6.1, highlighting the differences between traditional approaches and the new modeling paradigm of chaining tools with LLM. Section 6.2 gives a general overview of multimodal agents. Pivoting on an exemplary multimodal agent MM-REACT (Yang* et al., 2023), Section 6.3 comprehensively reviews how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incorporate the latest and strongest LLM and potentially millions of tools. Finally, in Section 6.4, we end the chapter with discussions on advanced topics, such as how to improve/evaluate multimodal agents, the diverse applications powered by multimodal agents.

在本章中，我们将回顾有关将不同多模态专家与LLMs协同工作以解决复杂的多模态理解问题的快速发展文献，称为多模态智能体。我们从

第6.1节中对这种建模范式的演变进行概述，强调传统方法与使用LLMs协同工具的新建模范式之间的差异。

第6.2节概述了多模态智能体的总体概述。以典型的多模式代理MM-REACT (Yang* et al.， 2023)为中心，

第6.3节全面回顾了如何构建多模态智能体，它在多模态理解方面的新兴能力，以及如何轻松扩展以包含最新和最强大的LLM和潜在的数百万工具。最后，在

第6.4节中，我们以高级主题的讨论结束本章，例如如何改进/评估多模态智能体，多模态智能体驱动的各种应用。