​AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之统一的视觉模型、加持LLMs的大型多模态模型



AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之简介

AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之视觉理解、视觉生成

AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之统一的视觉模型、加持LLMs的大型多模态模型

AGI之MFM:《Multimodal Foundation Models: From Specialists to General-Purpose Assistants多模态基础模型:从专家到通用助手》翻译与解读之与LLM协同工作的多模态智能体、结论和研究趋势

4、Unified Vision Models统一的视觉模型






Towards a unified vision model朝着统一视觉模型迈进:三大类研究=衔接视觉与语言的桥梁(如CLIP)+统一多任务建模+类似LLM的可提示接口

4.2、From Closed-Set to Open-Set Models从封闭集模型到开放集模型



维度1:Model initialization模型初始化:


CLIP增强:使用预训练的CLIP模型来辅助模型训练,知识蒸馏CLIP特征(如ViLD)、利用CLIP模型提供特征和分数(例如MaskCLIP和Mask-Adapted CLIP)


维度2:Model design模型设计


端到端模型:将目标检测视为textual grounding语言地面匹配+整体训练

维度3:Model pre-training模型预训练:三种学习方法




4.2.1、Object Detection and Grounding目标检测和定位

目标检测(识别和定位感兴趣的对象):基于区域的方法(R-CNN/Fast R-CNN/Faster R-CNN)→提高实时性(YOLO系列)→基于Transformer架构(如DETR/DINO/Group DETR/Co-DETR)



4.2.2、Image Segmentation and Referring图像分割和引用




Open-Vocabulary Segmentation开放词汇分割:将基础模型的丰富视觉-语义知识转移到特定的分割任务中,如(LSeg/OpenSeg、GroupViT、DenseCLIP、MaskCLIP、FC-CLIP、ODISE)


Referring Segmentation指代分割(一种开放式词汇的任务):使用多模态融合策略设计的模型来处理目标数据集—如CLIPSeg(扩展文本查询)、LAVT(增强跨模态交)→PolyFormer(将掩模转换为多边形)

Unified Segmentation统一分割:将所有分割任务统一到单一框架中,如X-Decoder(重定义任务+使用通用的编码器-解码器结构)、UNINEXT(早期融合策略来统一不同的分割任务)

Figure 4.6: (a) CV task landscape

4.3、From Task-Specific Models to Generic Models从特定任务模型到通用模型




4.3.1、I/O Unification—I/O统一

类别1—Sparse and discrete outputs稀疏和离散输出




类别2—Dense and continuous outputs密集和连续输出



扩散增强:使用已有的稳定扩散模型来构建通用的视觉模型,如Prompt Diffusion和InstructDiffusion

4.3.2、Functionality Unification功能统一

Multi-task learning多任务学习

Vision models视觉模型:探索使用CNN在不同视觉任务间学习(如Cross-stitch/UberNet),但都难以建立任务间协同关系来提升模型效果,Taskonomy通过学习视觉任务间关系提供深刻启发

Multi-modal models多模态模型:Transformer模型的兴起促进了多任务多模态发展,早期工作主要通过共享部分和任务专门头等方式将多个视觉语言任务联合,但未充分利用任务间协同关系。12in1(通过共享底层特征和专门任务头将多个视觉语言任务联合)、UniT/E2E-VLP(扩展到视觉任务+允许端到端训练)

Unified learning统一学习:借助Transformer和开放集模型的发展,任务间障碍渐渐淡化,使得不同模态输入可以学习共享语义空间




4.4、From Static to Promptable Models从静态到可提示模型

4.4.1、Multi-modal Prompting多模态提示

Spatial prompting空间提示

Visual prompting视觉提示


4.4.2、In-context Prompting上下文提示



Painter→SegGPT :Painter通过预测连续像素输出实现不同任务统一,比如为分割任务用颜色表示不同个体、SegGPT基于Painter去专注图像分割应用



4.5、Summary and Discussion总结与讨论



5、Large Multimodal Models: Training with LLM大型多模型:与LLM一起训练

5.1、Background 背景

5.1.1、Image-to-Text Generative Models图像到文本生成模型


5.1.2、Case Studies案例研究


Multimodal in-context-learning多模态内上下文学习:Flamingo通过提供少量示例实现跨任务转移学习,这种引人注目的上下文学习能力使Flamingo成为多模态领域中的GPT-3时刻

5.1.3、OpenAI Multimodal GPT-4 and Research Gaps—OpenAI Multimodal GPT-4和研究差距:GPT-4引入了多模态输入的能力,这引发了如何在多模态空间进行指导和对齐研究的问题


5.2、Pre-requisite: Instruction Tuning in Large Language Models先决条件:大型语言模型中的指令调优



5.2.1、Instruction Tuning指令调优:探索如何通过指令调优来使多模态语言模型(LLMs)能够遵循自然语言指令并完成现实世界的任务


5.2.2、Self-Instruct and Open-Source LLMs自我指导和开源LLMs


Quick assessment of LLM chatbots—LLM聊天机器人的快速评估:开源模型的执行能力已接近于当下最先进的私有模型,基于开源的LLaMA家族+Vicuna-Instructions-801数据集+GPT-4进行评分

Further discussions进一步的讨论:三个研究方向,数据驱动AI、开源LLMs与专有LLMs之间的差距辩论、基础LLMs的发展

5.3、Instruction-Tuned Large Multimodal Models指导调整的大型多模态模型


Data Creation数据创建:提高模型的多模态能力=将图像转换为符号序列表示+采用图像的标题和边界框信息+三种类型的指令遵循数据

Network Architecture and Training网络架构和训练:LLaVA的网络架构是一个通用的图像到文本生成模型,通过将预训练的CLIP ViT-L/14视觉编码器和大型语言模型Vicuna连接起来,并采用两阶段指令调优过程进行训练(特征对齐的预训练+端到端微调)


5.4、Advanced Topics高级主题


More Modalities (Beyond VL)更多的模态(超越VL):近期研究致力于将多模态语言模型框架扩展到包括更多的感知模态,如声音、图像、视频等,进一步拓展了多模态语言理解和生成的研究领域。

Improving Visual Instruction Data Quantity and Quality改进视觉指导数据的数量和质量:

Multitask Instruct with Established Academic Datasets/Tasks建立学术数据集/任务的多任务指导:指令调优可以通过两种不同的方式实现=通过对多样化任务进行模型微调使用人工注释的提示和反馈+通过使用公共基准和数据集进行监督微调

Multimodal In-Context-Learning多模态上下文学习:

Parameter-Efficient Training参数高效训练:精细调整成本过高→参数高效训练和模型量化是减小内存占用的有效方法



5.5、How Close We Are To OpenAI Multimodal GPT-4?我们距离OpenAI多模态GPT-4有多近?


6、Multimodal Agents:Chaining Tools with LLM 多模态智能代理:与LLM协同工作



4、Unified Vision Models统一的视觉模型

In this chapter, we discuss the unification of vision models. We start with an overview of the chal- lenges in the unification of vision models and the most recent efforts towards this goal in Section 4.1. What follows are detailed discussions on (i) how to transform closed-set models to open-set ones in Section 4.2; (ii) how to unify different granularities of vision tasks in Section 4.3; and (iii) how to build a more promptable interface for vision in Section 4.4. Finally, we summarize the chapter and discuss future trends in Section 4.5.







Before talking about general-purpose unified vision systems, we revisit how language models and natural language processing (NLP) have evolved in the past years. Before 2018, different NLP tasks are addressed with different task-specific models, such as translation (Bahdanau et al., 2015), semantic parsing (Berant et al., 2013), summarization (Allahyari et al., 2017), and so on. With the emergence of the transformer architecture (Vaswani et al., 2017), language models for different NLP tasks are unified with a decoder-only architecture, e.g., the GPT models (Brown et al., 2020). Afterwards, the GPT models learned using the next word prediction task are further finetuned to follow human instructions. This leads to ChatGPT 1, which fundamentally changes our expectations on what AI systems can do. The evolution as depicted in Figure 1.1 motivates us to wonder whether we can build a general-purpose vision system in a similar manner.

在讨论通用统一视觉系统之前,我们回顾了自然语言处理(NLP)和语言模型在过去几年中的发展。在2018年之前,不同的NLP任务使用不同的任务特定模型来解决,比如翻译(Bahdanau等人,2015)、语义解析(Berant等人,2013)、摘要生成(Allahyari等人,2017)等等。随着transformer 架构(Vaswani等人,2017)的出现,不同NLP任务的语言模型采用了仅解码器的统一架构,例如GPT模型(Brown等人,2020)。随后,使用下一个词预测任务训练的GPT模型进一步微调以遵循人类指令。这导致了ChatGPT 1的诞生,从根本上改变了我们对AI系统可以做什么的期望。如图1.1所示的的进化过程促使我们思考是否可以用类似的方式构建一个通用的视觉系统。



That computer vision tasks vary greatly presents a great challenge to build a unified vision model. First, vision tasks have different types of inputs, ranging from static images (Rus- sakovsky et al., 2015) to sequential videos (Miech et al., 2019), from pure vision inputs such as image dehazing (He et al., 2010) to multi-modality inputs that include e.g., vision and language An- tol et al. (2015). Second, different granularities are required for different tasks, such as image-level tasks like image classification (He et al., 2016) and captioning (Vinyals et al., 2016), region-level tasks like object detection (Girshick, 2015) and grounding (Plummer et al., 2015), and pixel-level tasks like image segmentation (He et al., 2017), super-resolution (Wang et al., 2020), etc. As a result, the outputs of vision systems are also of different formats, such as spatial information like edges, boxes, and masks, semantic information like class labels, multi-label tags, or detailed descriptions. In addition to the challenges in modeling, there are also challenges with data. First, the cost of annotation varies greatly among different types of labels. As shown in Figure 4.6, these labels are at different levels of granularity and semantic richness, ranging from whole images, regions (box an- notations), to masks (pixel annotations). Second, it is in general much more costly to collect image data than text data. So, the scale of vision data is often much smaller than that of text corpora.


首先,视觉任务具有不同类型的输入,从静态图像(Rus- sakovsky等人,2015)到序列视频(Miech等人,2019),从纯视觉输入(如图像去雾)(He等人,2010)到多模态输入,包括视觉和语言An- tol等人(2015)。

其次,不同的任务需要不同的粒度,例如图像级任务,如图像分类(He等人,2016)和字幕生成(Vinyals等人,2016),区域级任务,如目标检测(Girshick,2015)和grounding (Plummer等人,2015),像素级任务,如图像分割(He等人,2017)、超分辨率(Wang等人,2020)等。



Towards a unified vision model朝着统一视觉模型迈进:三大类研究=衔接视觉与语言的桥梁(如CLIP)+统一多任务建模+类似LLM的可提示接口

Despite these challenges, there is a growing interest in the com- puter vision community to develop a general-purpose, unified vision system, in particular for visual understanding tasks. As illustrated in Figure 4.1, we group these efforts in three categories:

>> Bridging vision and language. By extending closed-set classification to open-world recogni- tion, the contrastive language-image models like CLIP (Radford et al., 2021) demonstrate impres- sive zero-shot transferability for different vision tasks. These models learn the mapping between raw visual signals and rich semantics and can power various open-vocabulary vision recognition tasks (Zhong et al., 2022b; Gu et al., 2022; Li et al., 2022f; Ghiasi et al., 2022b).

>> Unified multi-task modeling. Traditional task-specific vision models are trained using task- specific data. It is often prohibitively expensive to develop a model for a new task. Thus, it is desirable to develop a unified vision model that can perform well across many vision tasks (Yang et al., 2022c; Lu et al., 2022a; Zou et al., 2023a; Chen et al., 2022c).

>> LLM-like promptable interface. LLMs can take different language and in-context prompts as in- puts and produce user-desired outputs without finetuning. A general-purpose vision model should have possessed the same in-context learning capability to align the output to various user intents without changing its model parameters (Bar et al., 2022; Kirillov et al., 2023; Zou et al., 2023b; Wang et al., 2023j; Balazˇevic´ et al., 2023).


>> 衔接视觉与语言的桥梁。通过将封闭集分类扩展到开放集识别,对比语言-图像模型如CLIP(Radford等人,2021)展示了对不同视觉任务的令人印象深刻的零样本可迁移性。这些模型学习了原始视觉信号和丰富语义之间的映射关系,可以支持各种开放词汇的视觉识别任务(Zhong等人,2022b;Gu等人,2022;Li等人,2022f;Ghiasi等人,2022)。

>> 统一多任务建模传统的任务特定视觉模型是使用任务特定数据训练的。为一项新任务开发模型的成本往往高得令人望而却步。因此,需要开发一种能够在许多视觉任务中表现良好的统一视觉模型是可取的(Yang等人,2022c;Lu等人,2022a;Zou等人,2023a;Chen等人,2022c)。

>> 类似LLM的可提示接口。LLMs可以接受不同的语言和上下文提示作为输入,并在不微调的情况下产生用户期望的输出。通用视觉模型应该具备相同的上下文学习能力,以将输出与各种用户意图对齐,而不改变其模型参数(Bar等人,2022;Kirillov等人,2023;Zou等人,2023b;Wang等人,2023j;Balazˇevic´等人,2023)。

In what follows, we will elaborate the detailed techniques and methods in each category.


4.2、From Closed-Set to Open-Set Models从封闭集模型到开放集模型


Traditionally, visual recognition is formulated as a classification problem that maps raw visual data (e.g., images) to discrete text labels. For example, image classification predicts a label from a pre- defined close set for a whole image (Deng et al., 2009), and object detection identifies the objects, defined in a close set, within an image (Lin et al., 2014). However, such closed-set models can hardly transfer to other tasks where the close set (or vocabulary) is insufficient. For example, it is difficult to apply an object detector trained using the Microsoft COCO object set 2 to detect Minecraft objects. Recently, CLIP (Radford et al., 2021) addresses the limitation of closed-set models by introducing a contrastive language-image pre-training method to train an open-set model. As illustrated in Figure 4.2 (a), instead of learning the mapping from input to labels, CLIP learns an aligned visual-semantic space using hundreds of millions of image-text pairs. Mathematically,the traditional vision tasks optimize the log-likelihood of assigning label y = c to an image, often represented as a feature vector u ∈ RP :

where w ∈ RK×P is the projection matrix. Instead of using a pre-determined project matrix w, the CLIP method uses a text encoder Enctext to for the projection:

where v plays the role of w in Eq. (4.1). The reason why a text encoder can help achieve open-set recognition is that all textual concepts are embedded in the same feature space through large-scale pre-training, and the feature distributions are coherent to the semantic meanings without the need of a pre-defined vocabulary. As such, the aligned visual-semantic space can be easily transferred to a wide range of image recognition tasks in a zero-shot manner. Please refer to Chapter 2 for a detailed discussion. In the following, we focus our discussion on the region-level and pixel-level models.

传统上,视觉识别被制定为一个分类问题,它将原始视觉数据(例如图像)映射到离散的文本标签。例如,图像分类从预定义的封闭集中预测一个整个图像的标签(Deng等人,2009),目标检测则在图像中识别出封闭集中定义的对象(Lin等人,2014)。然而,这样的封闭集模型很难迁移到其他任务,其中封闭集(或词汇表)不足够。例如,使用Microsoft COCO对象集2训练的目标检测器很难用于检测Minecraft中的对象。







After the release of the CLIP model (Radford et al., 2021), a number of open-set vision models have been developed using large amounts of text-image pairs for visual understanding at different levels of granularity (Yang et al., 2022b; Zhang et al., 2023e; Li et al., 2022f; Ghiasi et al., 2022a), rang- ing from image-level tasks (e.g., image classification Deng et al. (2009), image-text retrieval, image captioning Chen et al. (2015)), region-level localization (e.g., object detection and phrase ground- ing Plummer et al. (2015)), to pixel-level grouping tasks (e.g., image segmentation and referring segmentation Long et al. (2015); Kirillov et al. (2019); Hafiz and Bhat (2020)). These models can be categorized along the following three dimensions: model initialization, design and training.



维度1:Model initialization模型初始化

There are different initialization methods for open-set model training.

>> CLIP initialized. Many recent open-set models are trained by using a pre-trained model such as CLIP for initialization since a pre-trained model already provides a well-aligned (but often coarse-grained) visual-semantic feature space. For example, OVR-CNN (Zareian et al., 2021) and RegionCLIP (Zhong et al., 2022b) use a CLIP-style pre-trained ResNet (He et al., 2016) as the vision encoder and a pre-trained RPN (Ren et al., 2015) to extract regional features. Like- wise, MaskCLIP (Zhou et al., 2022a) and FreeSeg (Qin et al., 2023b) exploit the CLIP model to extract dense labels for pixels. FC-CLIP (Yu et al., 2023a) uses a frozen convolution network ConvNeXt (Liu et al., 2022b) in CLIP to encode input images of various resolutions.


>> CLIP初始化。最近的许多开放集模型是通过使用预训练模型(如CLIP)进行初始化而训练的,因为预训练模型已经提供了一个良好对齐(但通常是粗粒度的)的视觉-语义特征空间。例如,OVR-CNN(Zareian等人,2021)和RegionCLIP(Zhong等人,2022b)使用类似CLIP的预训练ResNet(He等人,2016)作为视觉编码器,并使用预训练的RPN(Ren等人,2015)提取区域特征。同样,MaskCLIP(Zhou等人,2022a)和FreeSeg(Qin等人,2023b)利用CLIP模型提取像素的密集标签FC-CLIP(Yu等人,2023a)使用CLIP中的冻结卷积网络ConvNeXt(Liu等人,2022b)来编码各种分辨率的输入图像

CLIP增强:使用预训练的CLIP模型来辅助模型训练,知识蒸馏CLIP特征(如ViLD)、利用CLIP模型提供特征和分数(例如MaskCLIP和Mask-Adapted CLIP)

例如通过知识蒸馏(knowledge-distillation)将模型与对齐的CLIP特征相结合(例如ViLD),或者在模型训练过程中依赖预训练的CLIP模型提供特征和分数(例如MaskCLIP和Mask-Adapted CLIP)。

>> CLIP augmented. Instead of initializing a model with CLIP parameters, other methods initialize the model parameters as usually (e.g., setting random values to model parameters), but use the pre-trained CLIP to help model training.For example, ViLD (Gu et al., 2022) augments the model with aligned CLIP features via knowledge-distillation. MaskCLIP (Ding et al., 2022b) and Mask- Adapted CLIP Liang et al. (2023a) rely on the pre-trained CLIP model to provide features and scores, respectively, during the course of model training.

>> CLIP增强。与使用CLIP参数初始化模型参数不同,其他方法通常初始化模型参数(例如,将模型参数设置为随机值),但使用预训练的CLIP来帮助模型训练。例如,ViLD(Gu等人,2022)通过知识蒸馏使用与CLIP特征对齐的方法来增强模型。MaskCLIP(Ding等人,2022b)和Mask-Adapted CLIP(Liang等人,2023a)在模型训练过程中依赖于预训练的CLIP模型提供特征和分数



>> Other works learn a visual-semantic feature space using supervised pre-trained models or from scratch. For example, GLIP (Li et al., 2022f) and OpenSeeD (Zhang et al., 2023e) use a pre-trained BERT model (Devlin et al., 2019) and the CLIP text encoder, respectively, and use a vision back- bone pre-trained on ImageNet for image encoding. Though these separately pre-trained image and text encoders do not explicitly learn the alignment between image and language, it turns out that these models still give good representations for images and texts, and are instrumental to efficient model training. Differently, GroupViT (Xu et al., 2022a) is trained jointly using an open-set se- mantic segmentation task and a global image-text alignment task from scratch. ODISE (Xu et al., 2023a) exploits pre-trained Stable Diffusion models (Rombach et al., 2022) to extract compact masks.

>> 其他方法使用监督预训练模型或从头开始学习视觉-语义特征空间。例如,GLIP(Li等人,2022f)和OpenSeeD(Zhang等人,2023e)分别使用了预训练的BERT模型(Devlin等人,2019)和CLIP文本编码器,以及在ImageNet上预训练的视觉骨干进行图像编码。尽管这些分别预训练的图像和文本编码器并未明确学习图像和语言之间的对齐,但事实证明,这些模型仍然为图像和文本提供了良好的表示,并有助于高效的模型训练。


维度2:Model design模型设计

Two-stage models通常分离了定位和识别,采用预训练的网络进行目标定位和提取掩模,然后使用预训练的CLIP模型度量视觉内容和语言概念之间的相似性,其优势在于能够继承开放式语义理解能力,无需额外训练,从而将建模训练集中在优秀的定位网络上。

Open-set models can be either multi-stage or end-to-end.

>> Two-stage models. These models usually follow the design of the pre-DETR based models (Ren et al., 2015; He et al., 2017), which decouples localization and recognition. For object detection, a region proposal network is typically pre-trained for localizing the object of interest (Zhong et al., 2022b; Gu et al., 2021), and a mask proposal network for extracting masks (Ghiasi et al., 2022a; Yao et al., 2022a). Given the localization results, a pre-trained CLIP model is used to measure the similarity between visual contents and language concepts. A clear advantage for two-stage models is that they can inherit the open-set semantic understanding capacity without additional training so as to devote modeling training to requiring a well-performed localization network.


>> 两阶段模型。这些模型通常遵循基于pre-DETR的模型(Ren等人,2015;He等人,2017)的设计,将定位和识别解耦。对于目标检测,通常区域建议网络进行预训练定位感兴趣的目标(Zhong等人,2022b;Gu等人,2021),以及用于提取掩码掩码提议网络(Ghiasi等人,2022a;Yao等人,2022a)。在获得定位结果后,通常会使用预训练的CLIP模型来度量视觉内容语言概念之间的相似性。两阶段模型的明显优势在于,它们可以继承开放集语义理解能力无需额外的训练,从而将建模训练专注于要求性能良好的定位网络。

端到端模型:将目标检测视为textual grounding语言地面匹配+整体训练

End-to-end models与两阶段模型不同,采用DETRe等一阶段模型,直接在图像文本对上进行端到端训练,形式化为文本定位,可进一步增强视觉语言交互或采用DETRe样式的模型设计,适用于目标检测和分割任务。

>> End-to-end models. Different from two-stage models, the end-to-end models follow the DETR- based methods (Carion et al., 2020; Cheng et al., 2022) or other one-stage models (Dai et al., 2021). GLIP (Li et al., 2022f) is one of the representative works. GLIP formulates object detec- tion as textual grounding and is trained end-to-end on image-text pairs with detection and ground- ing labels. Follow-up works enhance GLIP by enabling deeper vision-language interactions (Liu et al., 2023h) or using DETR-like model design (Zang et al., 2022; Minderer et al., 2022). For segmentation, both ZegFormer (Ding et al., 2022a) and OpenSeeD (Zhang et al., 2023e) exploit a DETR-like architecture and predict the masks and categories based on the outputs of their de- coders.

>> 端到端模型。与两阶段模型不同,端到端模型遵循DETR- based方法(Carion等人,2020;Cheng等人,2022)或其他单阶段模型(Dai等人,2021)。GLIP(Li等人,2022f)是代表性的作品之一。GLIP将目标检测形式化为文本定位,并在图像-文本对上进行端到端训练,包括检测和定位标签。后续工作通过使视觉-语言交互加深入(Liu等人,2023h)或使用DETR-like模型设计(Zang等人,2022;Minderer等人,2022)来增强GLIP。对于分割,ZegFormer(Ding等人,2022a)和OpenSeeD(Zhang等人,2023e)都利用了DETR-like架构,并根据它们的解码器的输出来预测掩码和类别。

维度3:Model pre-training模型预训练:三种学习方法

There are mainly three learning methods for pre-training open-set vision models.

>> Supervised learning. By converting label supervision to language supervision, many works di- rectly leverage the existing supervised annotations for training open-set models. For example, OVR-CNN (Zareian et al., 2021) trains a model with COCO categories and then evaluates its per- formance on novel categories. Likewise, ViLD (Gu et al., 2021) trains and evaluates two separate models on COCO and LVIS datasets, respectively. Following a similar protocol, many works train the open-set segmentation models on a subset of annotated segmentation data and evaluate the generalization ability on held-out data (Ding et al., 2022a,b; Zhang et al., 2023e; Xu et al., 2023a).


>> 监督学习。通过将标签监督转换为语言监督,许多工作直接利用现有的监督注释来训练开放集模型。例如,OVR-CNN(Zareian等人,2021)训练了一个带有COCO类别的模型,然后评估其在新颖类别上的性能。同样,ViLD(Gu等人,2021)在COCO和LVIS数据集上分别训练和评估了两个单独的模型。遵循类似的协议,许多工作在带注释的分割数据子集上训练开集分割模型,并评估在保留数据上的泛化能力(Ding等人,2022a,b;Zhang等人,2023e;Xu等人,2023a)。


>> Semi-supervised learning. One might use both annotated data and unlabeled or weakly-labeled data. For example, both RegionCLIP (Zhong et al., 2022b) and GLIP (Li et al., 2022f) use a teacher model to extract fine-grained region-text alignments from image-text pairs to augment the training data for better open-set detection performance. Differently, OpenSeg (Ghiasi et al., 2022b) exploits Localized Narrative datasets (Pont-Tuset et al., 2020) as weakly-labeled data, which provides coarse correspondence between language phrases and strokes in images. Empir- ically, such semi-supervised learning methods often help improve models’ generalization ability because they can effectively leverage rich semantics from noisy data.

>> 半监督学习。可以同时使用带注释的数据和未标记或弱标记的数据。例如,RegionCLIP(Zhong等人,2022b)和GLIP(Li等人,2022f)都使用教师模型从图像-文本对中提取细粒度的区域-文本对齐来增强训练数据,以获得更好的开放集检测性能。不同地,OpenSeg(Ghiasi等人,2022b)利用本地化叙事数据集(Pont-Tuset等人,2020)作为弱标记数据,提供图像中语言短语笔画之间粗略对应关系。从经验上看,这些半监督学习方法通常有助于提高模型的泛化能力,因为它们可以有效地利用来自嘈杂数据的丰富语义信息。


>> Weakly-supervised learning. Some works solely use weakly-labeled data for modeling. For example, GroupViT (Xu et al., 2022a) uses a contrastive learning method where all supervisions for model training are from positive and negative image-text pairs. Following the same contrastive learning method, SegCLIP (Luo et al., 2023b) uses a gathering mechanism to learn to merge image patches through the training on image-text pairs.

>> 弱监督学习。一些工作仅使用弱标记数据进行建模。例如,GroupViT(Xu等人,2022a)使用对比学习方法,其中所有模型训练的监督都来自正负图像-文本对。遵循相同的对比学习方法,SegCLIP(Luo等人,2023b)使用收集机制来学习通过图像-文本对的训练来合并图像块

Below, we review recent models developed for region-level and pixel-level tasks.


4.2.1、Object Detection and Grounding目标检测和定位

目标检测(识别和定位感兴趣的对象):基于区域的方法(R-CNN/Fast R-CNN/Faster R-CNN)→提高实时性(YOLO系列)→基于Transformer架构(如DETR/DINO/Group DETR/Co-DETR)

Object detection is a fundamental task in computer vision that involves identifying and localizing objects of interest within an image or a video sequence (Viola and Jones, 2001). Over the years, various techniques and algorithms have been developed to improve the accuracy and efficiency of object detection. In the past, region-based approaches such as R-CNN Girshick et al. (2015), Fast R-CNN (Girshick, 2015) and Faster R-CNN (Ren et al., 2015) have been fostering the development of advanced techniques for object detection. To improve real-time performance, YOLO (Redmon et al., 2016) proposes a single neural network that simultaneously predicts object classes and bound- ing box coordinates. Some improvements are made by either using multiple feature maps at differ- ent scales (Liu et al., 2016) or introducing a focal loss to address the class imbalance problem in dense object detection scenarios (Lin et al., 2017). After the emergence of Transformer (Vaswani et al., 2017), DETR (Carion et al., 2020) applies the transformer architecture to object detection, treating it as a set prediction problem. Since DETR, a number of methods have been proposed to improve transformer-based detection models from various aspects, such as DINO (Zhang et al., 2022a), Group DETR (Chen et al., 2022b), and Co-DETR (Zong et al., 2023).


过去,基于区域的方法,如R-CNN(Girshick等人,2015)、Fast R-CNN(Girshick,2015)和Faster R-CNN(Ren等人,2015)一直在促进目标检测的高级技术的发展。


Transformer(Vaswani等人,2017)出现之后,DETR(Carion等人,2020)将Transformer架构应用于目标检测,将其视为一个集合预测问题。自从DETR以来,已经提出了许多方法,以从各个方面改进基于Transformer的检测模型,例如DINO(Zhang等人,2022a)、Group DETR(Chen等人,2022b)和Co-DETR(Zong等人,2023)。


Open-set object detection models aim to detect arbitrary concepts beyond the vocabulary provided in training data. Three main evaluation settings have been developed in the literature:

>> Zero-shot object detection. Similar to zero-shot image classification (Xian et al., 2018), zero- shot object detection restricts the object classes used for training, and evaluates models’ transferra- bility to novel classes. Methods falling in this category mainly focus on evaluating how a model leverages pre-trained concept embeddings (e.g., word2vec (Mikolov et al., 2013)) and learns good visual-semantic alignments (Bansal et al., 2018; Rahman et al., 2020; Zhu et al., 2019, 2020).

>> Strict open-vocabulary object detection. First introduced in OV-RCNN (Zareian et al., 2021), this setting differs from zero-shot object detection in that there is no limit on the training vocab- ulary as long as it does not cover any target classes. Under this protocol, some representative works are ViLD (Gu et al., 2021), RegionCLIP (Zhong et al., 2022a) which leverage large-scale language-image models (Radford et al., 2021; Jia et al., 2021), and Detic (Zhou et al., 2022b) that learns from image-label data.

>> Generalized open-vocabulary object detection. Some recent works like GLIP (Li et al., 2022f), and OWL-VIT (Minderer et al., 2022) advocate a more flexible setting to evaluate the dataset or task transferrability for object detection models. This setting allows vocabulary overlap between training and test sets, e.g., Objects365 for training while COCO for evaluation. This is arguably a more practical setting than the two settings described above in that models can be trained using any arbitrary set of training data and their detection performance evaluated in the wild (Li et al., 2022b).


>> 零样本目标检测。类似于零样本图像分类(Xian等人,2018),零样本目标检测限制了用于训练的对象类别,并评估模型对新类别的可转移性。属于这一类别的方法主要关注模型如何利用预训练的概念嵌入(例如,word2vec(Mikolov等人,2013))并学习良好的视觉-语义对齐(Bansal等人,2018;Rahman等人,2020;Zhu等人,2019,2020)。

>> 严格的开放词汇目标检测。首次在OV-RCNN(Zareian等人,2021)中引入,该设置与零样本目标检测不同之处在于,只要不涵盖任何目标类,训练词汇就没有限制。在此协议下,一些代表性的工作包括ViLD(Gu等人,2021)、RegionCLIP(Zhong等人,2022a),它们利用了大规模的语言-图像模型(Radford等人,2021;Jia等人,2021),以及从图像标签数据中学习Detic(Zhou等人,2022b)。

>> 通用的开放词汇目标检测。一些最近的工作,如GLIP(Li等人,2022f)和OWL-VIT(Minderer等人,2022),提倡一种更灵活的设置来评估目标检测模型的数据集或任务可转移性。该设置允许训练集和测试集之间存在词汇重叠,例如在训练时使用Objects365,而在评估时使用COCO。这可以说是比上述两种设置实用的设置,因为模型可以使用任意一组训练数据进行训练,并在实际场景中评估其检测性能(Li等人,2022b)。



Object grounding can be considered as a generalized open-set object detection task (Plummer et al., 2015; Kazemzadeh et al., 2014; Chen et al., 2019; Deng et al., 2018). In this task, models take a sentence and an image as input and localize objects that are associated with the noun phrases. Recently, M-DETR (Kamath et al., 2021) employs a transformer-based architecture to build an end- to-end modulated detector to detect objects in an image given a raw text query. Unlike previous works where models are trained on specific datasets, the network is pre-trained with 1.3M pairs of text and images, sourced from multi-modal datasets where the connections between text phrases and corresponding image objects are labeled. Inspired by M-DETR, GLIP (Li et al., 2022f) casts object detection as a grounding problem, and jointly learns a model using object detection and grounding data for open-set scenarios. Following this line of research, DetCLIPv2 (Yao et al., 2023) proposes a simple joint learning method where multiple tasks are converted into a word-region alignment task, and then a model is trained end-to-end on a corpus consisting of object detection data, grounding data and image-text pairs. Grounding-DINO (Liu et al., 2023h) is a state-of-the-art grounded object detection method, where the object detector is composed of components: a backbone, a neck, and a head, and inject language conditions at every stage. A combined text and image backbone is employed to extract features at multiple scales, which are then passed on to the neck. The text and image characteristics generated by the neck are subsequently used for language-driven query selection. Grounding-SAM is developed by combining Grounding-DINO with SAM (Kirillov et al., 2023). As shown in Figure 4.4, an image and a group of concepts are first fed into Grounding-DINO to produce the boxes, and then the boxes are used as prompts for SAM to predict masks for each box.







4.2.2、Image Segmentation and Referring图像分割和引用


Image segmentation is a long-standing and challenging vision problem. There are mainly three sub- tasks, including semantic (Long et al., 2015), instance (Hafiz and Bhat, 2020), and panoptic (Kirillov et al., 2019) segmentation. Semantic segmentation cares about the per-pixel semantic within an im- age (Long et al., 2015; Chen et al., 2017, 2022j), whereas instance segmentation groups pixels of the same semantic meaning into objects. Models for both tasks have evolved from CNN-based ar- chitectures (Long et al., 2015) to transformer-based ones (Chen et al., 2022j), and from two-stage models (He et al., 2017) and one-stage models (Bolya et al., 2019; Tian et al., 2020b) to the recent query-based approaches (Dong et al., 2021; Zou et al., 2022). With the capability of per-pixel and instance-level understanding, a natural step was taken to formulate panoptic segmentation (Kirillov et al., 2019; Wang et al., 2021a; Cheng et al., 2022). Most recently, Mask2Former (Cheng et al., 2022) proposed to address all three tasks with a unified encoder-decoder architecture. Nevertheless, all these works cope with a limited number of categories. In the following, we will review the most recent works on open-set image segmentation and referring segmentation.




Open-Vocabulary Segmentation开放词汇分割将基础模型丰富视觉-语义知识转移到特定的分割任务,如(LSeg/OpenSeg、GroupViT、DenseCLIP、MaskCLIP、FC-CLIP、ODISE)

Open-Vocabulary Segmentation. Recently, a number of methods have been proposed to trans- fer or distill the rich visual-semantic knowledge from foundation models (Radford et al., 2021; Jia et al., 2021) to specific segmentation tasks. Prominent examples include LSeg (Li et al., 2022a), OpenSeg (Ghiasi et al., 2022a), and Huynh et al. (2022). Instead of using existing mod- els, GroupViT Xu et al. (2022a) performs language-image pre-training from scratch with a bottom- up grouping ViT (Dosovitskiy et al., 2021), while DenseCLIP (Rao et al., 2022) demonstrates the superiority of foundation models in finetuning settings compared with supervised models. Re- cently, MaskCLIP (Ding et al., 2022b) is proposed to tackle open-vocabulary panoptic and se- mantic segmentation simultaneously by leveraging CLIP, and achieves impressive performance on ADE20K (Zhou et al., 2017) and PASCAL (Mottaghi et al., 2014; Everingham and Winn, 2011).Instead of using the ViT backbone, a recent work called FC-CLIP (Yu et al., 2023a) exploits a convolutional CLIP backbone (i.e., ConvNeXt trained by OpenCLIP (Ilharco et al., 2021)) as both a feature extractor and a vision encoder. Based on a simplified pipeline, FC-CLIP shows plausi- ble efficiency and lefts the state of the art on various open-vocabulary segmentation benchmarks. Rather than only using CLIP, a recent work ODISE (Xu et al., 2023a) leverages text-to-image diffu- sion models, and shows that the latent features in the pre-trained UNet can provide useful compact segmentation information for open-vocabulary segmentation.

开放词汇分割。最近,已经提出了许多方法,以将基础模型(Radford等人,2021;Jia等人,2021)的丰富视觉-语义知识转移到特定的分割任务中。杰出的例子包括LSeg(Li等人,2022a)、OpenSeg(Ghiasi等人,2022a)和Huynh 等人(2022)。GroupViT(Xu等人,2022a)不是使用现有的模型,而是从头开始进行语言-图像预训练,采用自下而上的分组ViT(Dosovitskiy等人,2021),而DenseCLIP(Rao等人,2022)则在微调设置中展示了基础模型相对于监督模型的优越性





A big challenge in open-vocabulary segmentation is the lack of segmentation data annotated with semantic labels. Thus far, most of the works are still using COCO segmentation annotations. A few recent works attempt to leverage object detection data as the extra supervision to augment the training of segmentation models, such as OpenSeeD (Zhang et al., 2023e) (shown in Figure 4.5) and DataSeg (Gu et al., 2023). In addition to these new modeling techniques, new datasets have been developed to mitigate this problem, including curating multi-domain segmentation datasets (Lam- bert et al., 2020), collecting high-quality annotations (Lu et al., 2023c) or scaling up to billions of masks (Kirillov et al., 2023).


Referring Segmentation指代分割(一种开放式词汇的任务):使用多模态融合策略设计的模型来处理目标数据集—CLIPSeg(扩展文本查询)、LAVT(增强跨模态交)→PolyFormer(将掩模转换为多边形)

引用分割本身就是开放词汇的任务,相关工作主要通过多模态融合提升效果,如利用视觉查询网络进行查询分割,或者在视觉变压器结构中增强交叉模态交互等方法。近年来,一些工作也将掩码表示方式从像素转化为多边形,或利用端到端语言驱动模型联合进行对象检测和分割,取得了目前最优的 referring segmentation 性能。


Referring Segmentation by design is open-vocabulary. Models are usually designed specifi- cally to learn from target datasets using various multimodal fusion strategies (Hu et al., 2016; Liu et al., 2017; Margffoy-Tuay et al., 2018; Ye et al., 2019a; Yu et al., 2016; Wu et al., 2022a). CLIPSeg (Lu¨ddecke and Ecker, 2022) extends a textual query to a visual query and shows supe- rior performance not only on referring segmentation but also on semantic segmentation.

Since the emergence of vision transformers, works like LAVT (Yang et al., 2022e) enhance the cross-modal interactions from the very beginning, which leads to a decent performance on RefCOCO (Yu et al., 2016), RefCOCO+ (Yu et al., 2016) and G-Ref (Mao et al., 2016; Nagaraja et al., 2016). Differently, PolyFormer (Liu et al., 2023e) converts masks into polygons and asks the transformer decoder to de- code a sequence of polygon coordinates. Inspired by Pix2Seq (Chen et al., 2022c), a similar method in object detection, PolyFormer presents an alternative way to represent masks for state-of-the-art referring segmentation. As we discussed earlier, one can also compose Grounding DINO (Liu et al., 2023h) with SAM (Kirillov et al., 2023) for referring segmentation.

设计上,指代分割是开放词汇的。通常,模型专门设计用于从目标数据集中使用各种多模态融合策略进行学习(Hu et al., 2016;Liu et al., 2017;margffy - tuay等人,2018;Ye et al., 2019;Yu et al., 2016;Wu et al., 2022a)。CLIPSeg(Lu¨ddecke和Ecker,2022)将文本查询扩展为视觉查询,并在指代分割以及语义分割方面表现出卓越性能。

自从出现视觉transformers以来,像LAVT(Yang等,2022e)这样的工作从一开始就增强了跨模态互动,这导致在RefCOCO(于等,2016年),RefCOCO+(于等,2016年)和G-Ref(毛等,2016年;Nagaraja等,2016年)上取得了不错的性能。与此不同,PolyFormer(刘等,2023e)将掩码转换为多边形,并要求transformer 解码器解码一系列多边形坐标。受到物体检测中的Pix2Seq(Chen等,2022c)的启发,PolyFormer提出了一种替代方法来表示最先进的参考分割的掩模。正如我们之前讨论的那样,还可以将Grounding DINO(刘等,2023h)与SAM(Kirillov等,2023)组合用于指代分割

Unified Segmentation统一分割:将所有分割任务统一到单一框架中,如X-Decoder(重定义任务+使用通用的编码器-解码器结构)UNINEXT(早期融合策略来统一不同的分割任务)


Given the above methods for open-vocabulary and referring segmenta- tion, an open question is how to unify all segmentation tasks in a single framework. Recently, X-Decoder (Zou et al., 2023a) uses a generalized encoder-decoder architecture to unify all these segmentation tasks. The referring segmentation task is reformulated as a conditioned panoptic seg- mentation that takes some textual phrases as input to the decoder. UNINEXT (Yan et al., 2023) is another work that attempts to unify all instance-level segmentation in images and videos. Different from X-Decoder, UNINEXT uses early fusion to fuse the various prompts and vision features, which are then fed to the transformer encoder-decoder.



UNINEXT(严等,2023)是另一个试图统一图像和视频中所有实例级别分割的工作。与X-Decoder不同,UNINEXT使用早期融合来融合各种提示和视觉特征,然后将其输入到transformer 编码器-解码器中。

Figure 4.6: (a) CV task landscape

Figure 4.6: (a) CV task landscape: CV tasks can span different axes, including modality, space and time, which renders significant challenges to unify all of them in a single model. Image credit: Yuan et al. (2021). (b) The data scale pyramid: In particular, datasets in different tasks usually contain different types of supervision. Image-level datasets like ImageNet (Deng et al., 2009) and LAION Schuhmann et al. (2021) have annotations that have rich semantics coverage but are coarse- grained, while pixel-level datasets like COCO panoptic segmentation (Chen et al., 2015) provides fine-grained annotations but with limited concepts.


(b)数据规模金字塔:特别是,不同任务中的数据集通常包含不同类型的监督。像ImageNet(Deng等人,2009年)和LAION Schuhmann等人(2021年)这样的图像级数据集具有丰富的语义覆盖,但粒度较粗,而像COCO全景分割(Chen等人,2015年)这样的像素级数据集提供了精细的标注,但概念有限

4.3、From Task-Specific Models to Generic Models从特定任务模型到通用模型



1) 视觉任务的碎片化,涵盖不同领域、粒度和模态的任务,难以开发一个统一的模型;

2) 数据规模的不同,不同任务的人工标注数据规模差异巨大,导致统一模型的构建具有挑战性。

Above we have discussed the recent efforts of transforming closed-set models to open-set ones for detection and segmentation. Until recently, however, most vision tasks have been separately tackled with specialized model designs, preventing the synergy of tasks across different granularities or domains from being exploited. This is arguably due to two reasons:

>> Vision tasks are fragmented. As shown in Figure 4.6 (a), computer vision tasks span across different axes including space, time, and modality. From the space aspect, it can be image-level, region-level and pixel-level tasks as we discussed before. Along the time axis, we need to tackle not only static images but also temporal video sequences. Regarding the modality, the inputs and outputs can be images, texts, or other types (e.g., human pose, depth map). Such diverse task formats significantly impede the development of a unified model for all tasks.

>> Data scales are different. In addition to the complicated task landscape, the scarcity of hu- man annotations and their different scales for different tasks also make building a unified model challenging. In Figure 4.6 (b), we can see a clear pyramid of data scale, where different lay- ers of human annotations have different semantics. More specifically, image-text datasets like LAION Schuhmann et al. (2021) contain up to 2B samples, while object detection datasets like Objects365 (Shao et al., 2019) have 1.7M images in total. More significant gap is observed in segmentation datasets due to the high cost of annotating masks.


>> 视觉任务是碎片化的。如图4.6(a)所示,计算机视觉任务跨越不同的轴,包括空间时间模态




>> 数据规模不同。除了复杂的任务环境之外,人工标注的稀缺性和不同任务的标注规模也给统一模型的构建带来了挑战。在图4.6(b)中,我们可以看到一个清晰的数据规模金字塔,其中不同层次的人工注释具有不同的语义。更具体地说,像LAION Schuhmann等人(2021)这样的图像-文本数据集包含高达20亿个样本,而像Objects365(Shao等人,2019)这样的目标检测数据集总共有170万张图像。由于标注掩码的成本较高,在分割数据集上观察到更显著的差距。


Despite the aforementioned challenges, we are now witnessing a growing interest in building unified, general-purpose models that can learn from and be applied to a diverse set of vision and vision- language tasks, thanks to the versatility of transformers (Vaswani et al., 2017). These attempts can be grouped into two main categories:

>> I/O Unification. Following the development of unified LLMs, a number of recent works reformu- late many vision tasks as a sequence-to-sequence problem (Wang et al., 2022b; Yang et al., 2022c; Chen et al., 2022d; Lu et al., 2022a). They typically use a tokenizer to tokenize the original inputs and outputs (I/O) in different modalities used in various tasks into a coherent sequence (visual or text) tokens and then exploit a unified, sequence-to-sequence model.

>> Functionality Unification. In addition to I/O unification, one might built a generic model via functionality unification. Extending multi-task learning methods (Lu et al., 2020; Gupta et al., 2022a; Hu and Singh, 2021a), many recent use a coherent encoder-decoder architectures (Yu et al., 2022a; Zhang et al., 2022b; Zou et al., 2023a). This line of work usually does not need task-specific or modality-specific tokenizers but requires a sophisticated model design to accommodate various tasks.

尽管存在上述挑战,但由于transformers的多功能性,我们现在看到人们对构建统一的通用模型越来越感兴趣,这些模型可以学习并应用于各种视觉和视觉语言任务(Vaswani et al., 2017)。这些尝试可以分为两大类:




Figure 4.7 illustrates the difference between the two categories of unification methods. For I/O uni- fication, the I/O unification module always generates a sequence of tokens, and exploits a separate decoder to decode the final outputs for different tasks. For functionality unification, the functional unification module generates heterogeneous outputs for different task, e.g., semantic outputs and spatial outputs. Then, these different types of outputs are combined to produce the final task-specific outputs. Both unification methods strive to make use of synergy across tasks with different levels of granularity. For example, coarse-grained data is expected to contribute to rich semantic under- standing required by fine-grained tasks, while fine-trained data to enhance the grounding ability for coarse-grained tasks. In the following, we review some recent works of these two categories.


4.3.1、I/O UnificationI/O统一

‌This line of work is mainly inspired by LLMs that unify many NLP tasks as sequential modeling. In the vision domain. the methods of building generic models via I/O unification can be grouped into two categories depending on the tasks of interest and output formats.


类别1—Sparse and discrete outputs稀疏和离散输出


For vision tasks that produce sparse or discrete token outputs, we can easily exploit a language tokenizer, such as byte-pair encoding (BPE) (Sennrich et al., 2016), for I/O unification. In contrast, spatial outputs like boxes, masks, or human skeletons can be formulated as a sequence of numeric coordinates which are then tokenized into discrete tokens (Cho et al., 2021; Yang et al., 2022c; Liu et al., 2023e). As a result, the decoded output tokens are interleaved with organic textual tokens and numeric textual tokens to support a wide range of tasks. Without the loss of generality, the decoding process is formulated as auto-regressive generation and the model trained with the objective function defined as:

is the discrete token sequence of length T , and v is the visual feature. Below, we review some representative works.




UniTab (Yang et al., 2022c) unifies text and box output in a sequence decoding manner. As shown in Figure 4.8 (a), the box coordinates are represented by numerical numbers with <> and then a special token <obj> is used to encompass the location information. In this way, the model can unify a variety of tasks that require textual and location outputs, including image captioning (Chen et al., 2015), grounded captioning (Plummer et al., 2015), visual grounding, object localization and visual question answering (Antol et al., 2015). The model is trained in three stages: pre-training, multi-task finetuning, and task-specific finetuning.



Pix2SeqV2 (Chen et al., 2022d) slightly differs from UniTab in that it unifies two different vi- sion tasks: referring segmentation and keypoint detection. Following Pix2Seq (Chen et al., 2022c), Pix2SeqV2 represents objects in an image as [ymin, xmin, ymax, xmax, text]. Then, it introduces a unique task prompt for each task, which contains task type information or a combination of task types and specific locations. For mask decoding, a mask contour is converted into a polygon and then its coordinates extracted from the polygon (Castrejon et al., 2017). A similar strategy is also used for referring segmentation, as in Polyformer (Liu et al., 2023e).




Recent works have also explored building a generic decoding interface based on LLMs, which are pre-trained on large amounts of text data and human instructions. Kosmos- 2 (Peng et al., 2023b) exploits the pretrained LLMs of Kosmos-1 (Huang et al., 2023b) and augments the grounded multi-modal data by collecting a web-scale grounded image-text pair dataset (GRIT) consisting of 91M images. VisionLLM (Wang et al., 2023h) appends an even larger LLM (e.g., LLaMa (Touvron et al., 2023)) on top of an image tokenizer, as shown in Figure 4.9. The resultant model exhibits a very strong vision-language reasoning capacity and decent localization ability for object detection, segmentation, etc. Some other works that combine LLMs with grounding are DetGPT (Pi et al., 2023) and GPT4ROI (Zhang et al., 2023k). To further equip the model with the segmentation capability, both BubaGPT (Zhao et al., 2023c) and LISA (Lai et al., 2023) use an extra referring segmentation model to segment images by taking texts or embeddings as input, respectively. PaLI-X (Chen et al., 2023g) is by far the largest unified model that can cope with multilingual vision and vision-language tasks.



类别2—Dense and continuous outputs密集和连续输出


There are also some tasks that require dense and continuous outputs, such as image segmentation (He et al., 2017), depth estimation (Mertan et al., 2022), image inpainting and editing (Elharrouss et al., 2020; Brooks et al., 2023). Except for segmentation masks which can be approximated by poly- gons (Liu et al., 2023e; Chen et al., 2022d), most dense and continuous outputs cannot be easily converted into discrete tokens due to the high-dimensional space. Thus, we have to resort to an image-oriented tokenizer. Akin to the language tokenizer, an image tokenizer encodes raw images and extracts discrete tokens spanning the visual feature space. The most representative work is VQ-VAE (Oord et al., 2017; Razavi et al., 2019). As shown in Figure 4.10 (a), VQ-VAE learns an encoder ze, a decoder zq and a discrete codebook e = {e1, ..., eK } consisting of K embeddings. Given the input x, the posterior categorical probability q(z|x) is defined as:

where the decoder zq takes x (or its representation ek ) as input to predict class label. As a variant of VQ-VAE, VQ-GAN uses a discriminator and the perceptual loss (Larsen et al., 2016; Lamb et al., 2016) to maintain a good balance between output quality and model efficiency (via high compression rate). In Figure 4.10 (b), we see that the discriminator is applied at the patch level to regularize the decoding of images at high resolution. Below, we discuss some most recent works that attempt to unify different vision and multi-modal tasks that involve dense outputs.

还有一些任务需要密集和连续的输出,例如图像分割(He等人,2017年),深度估计(Mertan等人,2022年),图像修复和编辑(Elharrouss等人,2020年;Brooks等人,2023年)。除了可以通过多边形(Liu等人,2023e;Chen等人,2022d)逼近的分割掩码之外,由于高维空间,大多数密集和连续的输出不能容易地转换为离散的token。因此,我们必须借助图像标记器。类似于语言标记器,图像标记器对原始图像进行编码,并提取跨越视觉特征空间的离散token。最具代表性的工作是VQ-VAE(Oord等人,2017年;Razavi等人,2019年)。如图4.10 (a)所示,VQ-VAE学习一个编码器ze、一个解码器zq和一个离散码本e = {e1,…, eK}由K个嵌入组成。给定输入x,后验分类概率q(z|x)定义为:

其中解码器zq以x(或其表示ek)作为输入来预测类标号。作为VQ-VAE的变体,VQ-GAN使用了鉴别器和感知损失(Larsen et al., 2016;Lamb等人,2016)在输出质量和模型效率(通过高压缩率)之间保持良好的平衡。如图4.10(b)所示,判别器应用于高分辨率图像的解码以规范化图像。以下,我们将讨论一些最近的工作,这些工作试图统一涉及密集输出的不同视觉和多模态任务。


UViM (Kolesnikov et al., 2022) is one of the first works that employ a dense decoding process to unify various core vision tasks, including panoptic segmentation, depth estimation and colorization. The learning process consists of two stages: (i) Base encoder-decoder f and restricted oracle Ω are learned to predict outputs given input images, where f takes raw image as input and Ω takes the desired output as input to decode the oracle code; (ii) Instead of using the desired output as input to the oracle Ω, the model learns a language model to produce the oracle code for the input raw image. Notably, the encoder-decoder model used here is trained with VQ-VAE objectives. As the first step to unify vision tasks with a single model, UViM shows promising results on three vision tasks.


(i)基本编码器-解码器f和受限的oracle Ω被学习来根据输入图像预测输出,其中f以原始图像作为输入,Ω以所需的输出作为输入来解码oracle代码;

(ii)模型不使用所需的输出作为oracle Ω的输入,而是学习一个语言模型来为输入原始图像生成oracle代码。值得注意的是,这里使用的编码器-解码器模型是通过VQ-VAE目标函数进行训练的。作为统一不同任务的第一步,UViM在三个视觉任务上展现出了令人鼓舞的结果。


Unified-IO (Lu et al., 2022a) is another representative work. Compared to UVIM, it scales to many more vision tasks and datasets. Unlike the training procedure of UViM, Unified-IO first trains differ-ent VQ-VAE models for different tasks, as depicted in Figure 4.11 left. After obtaining all VQ-VAE encoder-decoders, 90 datasets are combined to train another transformer encoder-decoder end-to- end, as shown on the right side. Similar to previous works, it also uses a language decoder to obtain the organic and numeric texts to generate coordinate outputs. After the second-stage pre-training, the model achieves state of the art on the GRIT benchmark (Gupta et al., 2022c) and exhibits compelling compositionality, although the performance still lags behind the strongest models on common tasks. As a follow-up, a soft-token strategy is proposed in Ning et al. (2023) to improve the accuracy for next token decoding. In addition, a masked modeling strategy is proposed to learn robust representa- tions. Evaluated on instance segmentation and depth estimation, the model achieves state-of-the-art performance on NYUv2 (Silberman et al., 2012) and competitive performance on segmentation. A recent work uses image inpainting as the general task to unify different pixel-level vision tasks (Bar et al., 2022). Given the target discrete tokens produced by VQ-GAN, the method exploits a masked autoencoder to decode the missed image regions, using the task input-output examples as prompts. Painter (Wang et al., 2023i) extends this pipeline to facilitate more vision tasks and obtains compet- itive performance on various standard benchmarks.

Unified-IO(Lu等人,2022a)是另一个代表性的工作。与UVIM相比,它扩展到了更多的视觉任务和数据集。与UViM的训练过程不同,Unified-IO首先为不同的任务训练了不同的VQ-VAE模型,如图4.11左侧所示。在获得所有VQ-VAE编码器-解码器后,将90个数据集组合起来,端到端训练另一个transformer 编解码器,如图所示。与之前的工作类似,它还使用语言解码器来获取有机文本和数字文本,以生成坐标输出。在第二阶段的预训练之后,该模型在GRIT基准测试(Gupta等人,2022c)上取得了最先进的性能,并展示出令人满意的组合性,尽管在常见任务上的性能仍然落后于最强大的模型。作为后续工作,Ning等人(2023年)提出了一种token策略,以提高下一个token解码的准确性。此外,提出了一种掩码建模策略来学习强大的表示。在实例分割和深度估计上进行评估,该模型在NYUv2(Silberman等人,2012年)上取得了最先进的性能,并在分割上展现了竞争性能。


扩散增强:使用已有的稳定扩散模型来构建通用的视觉模型,如Prompt Diffusion和InstructDiffusion

Diffusion-augmented. Unlike the above works that learn their own decoding models, some re- cent works utilize the off-the-shelf stable diffusion model to build generalist vision models. For example, Prompt Diffusion (Wang et al., 2023m) initializes a model using Stable Diffusion and ControlNet (Zhang and Agrawala, 2023), and trains the in-context image-to-image model jointly on six different vision-language tasks, including segmentation, depth estimation, etc. InstructDiffu- sion Geng et al. (2023) also uses the diffusion model but explicitly introduces task-specific instruc- tions to the diffusion process. Moreover, it uses task-specific training and human alignment training to enable a generalist interface for vision tasks.

扩散增强。与上述工作不同,一些最近的工作利用现成的稳定扩散模型构建通用的视觉模型。例如,Prompt Diffusion(Wang等人,2023m)使用Stable Diffusion和ControlNet(Zhang和Agrawala,2023)初始化一个模型,并在六个不同的视觉语言任务(包括分割、深度估计等)上联合训练上下文图像到图像模型。InstructDiffusion Geng等人(2023)也使用扩散模型,但明确地引入了任务特定的指令来指导扩散过程。此外,它使用任务特定的训练和人类对齐训练来实现视觉任务的通用接口。

4.3.2、Functionality Unification功能统一

Unlike I/O unification, functionality unification attempts to unify different tasks based on the task characteristics, with the awareness that they are neither fully isolated nor fully aligned. At a high level, vision tasks produce three types of outputs: (i) location outputs, (ii) semantic outputs, and pixel-level outputs. For example, both object detection and phrase grounding need to localize objects in the image, while both generic segmentation and referring segmentation produce masks. On the other hand, many tasks require semantic (or text) outputs to represent either concept names or textual descriptions.






Multi-task learning多任务学习

Some early works explore multi-task learning methods for unifying different vision or vision- language tasks.


Vision models视觉模型:探索使用CNN在不同视觉任务间学习(如Cross-stitch/UberNet),但都难以建立任务间协同关系来提升模型效果,Taskonomy通过学习视觉任务间关系提供深刻启发

介绍了几项使用CNN处理多任务学习的工作。Cross-stitch Networks和UberNet尝试设计CNN结构适应不同任务,但难将任务间关系整合提升效果。Taskonomy通过学习每个任务特定模型,然后将其映射到潜空间来研究任务间关系,发现表面法线估计等任务间存在紧密关联。它以任务内在关系为导向,为多任务视觉建模提供深入见解。总体来说,这些工作侧重CNN结构设计,但难构建任务协同学习机制。Taskonomy通过任务本身关联性研究,在一定程度上弥补了此不足。

A few works explore using CNNs for learning with different vision tasks at dif- ferent levels. For example, Cross-stitch Networks (Misra et al., 2016) develops a strategy to split different numbers of layers from the top in CNNs so as to adapt to different vision tasks. Results show that the best-performing multi-task architecture depends on the tasks of interest and can hardly generalize to new tasks. UberNet (Kokkinos, 2017) takes one step further to use a single universal CNN architecture and sophisticatedly design a routing mechanism to save the memory and comput- ing cost, as shown in Figure 4.12 (a). Both works require some tweaking to the CNN architecture so that they can adapt to different levels of tasks and loss types. But they unfortunately fail to build the synergy across tasks to improve model performance. Taskonomy (Zamir et al., 2018) specifically studies the relationship among vision tasks. It first trains task-specific models for each individual task and then performs transfer modeling across tasks in the latent space. The task affinity is then calculated in the latent space, providing us with the taskonomy. The result shows that vision tasks have different affinities for different groups, as shown in Figure 4.12 (b). For example, surface nor- mal estimation is heavily related to reshaping and point matching. Curvature extraction is related to image segmentation tasks. This study provides deep insights for multi-task vision modeling (Xu et al., 2018; Crawshaw, 2020).


Cross-stitch Networks(Misra等人,2016)开发了一种策略,从CNNs的顶部分割不同数量的层,以适应不同的视觉任务。结果显示,最佳的多任务架构依赖于感兴趣的任务,很难推广到新的任务。




Multi-modal models多模态模型:Transformer模型的兴起促进了多任务多模态发展,早期工作主要通过共享部分和任务专门头等方式将多个视觉语言任务联合,但未充分利用任务间协同关系。12in1(通过共享底层特征和专门任务头将多个视觉语言任务联合)、UniT/E2E-VLP(扩展到视觉任务+允许端到端训练)

The emergence of Transformers significantly facilitates the advancement of multi-task multi-modal learning. Among them, 12in1 (Lu et al., 2020) is one of the pioneering works that combine 12 vision-language tasks in a single BERT-based architecture. It uses task- specific heads for individual tasks and a commonly shared trunk ViLBERT (Lu et al., 2019). Results show that multi-task learning can achieve substantial improvements over single-task learning while reducing the model parameters significantly. Later on, UniT (Hu and Singh, 2021b) exploits an encoder-decoder architecture and expands to vision-only tasks like object detection. Additionally, it allows end-to-end training on the task pool without relying on pre-trained detectors. Similar to 12in1, it also uses a task-specific head for each task, motivated by the empirical result that sharing the same head usually hurts performance. Likewise, E2E-VLP (Xu et al., 2021) proposes an end-to- end pipeline for both localization tasks and text generation. Both UniT and E2E-VLP demonstrate the versatility of the encoder-decoder architecture of DETR (Carion et al., 2020). Following the same spirit, GPV (Gupta et al., 2022b) proposes an end-to-end task-agnostic architecture for differ- ent vision and vision-language tasks. It uses DETR to extract boxes and region features and then exploits a cross-attention module for fusion, followed by a vision decoder and a language decoder for decoding different outputs.

The above vision and multi-modal models unify different tasks by incorporating different modules or heads designed to cope with different tasks, and can hardly achieve synergy across tasks. In the following, we discuss recent model unification research that aims to make the best use of synergy among various vision and multi-modal tasks.



UniTE2E-VLP都展示了DETR(Carion等人,2020)的编码器-解码器架构的多才多艺。遵循同样的精神,GPV (Gupta等人,2022b)针对不同的视觉和视觉语言任务提出了一种端到端任务无关的架构。它使用DETR来提取边界框和区域特征,然后利用交叉注意模块进行融合,然后使用视觉解码器和语言解码器对不同的输出进行解码


Unified learning统一学习:借助Transformer和开放集模型的发展,任务间障碍渐渐淡化,使得不同模态输入可以学习共享语义空间

The barrier across tasks is gradually blurred thanks to the use of Transformers (Vaswani et al., 2017) and the development of open-set models as we discussed earlier. It is now possible to bind inputs from different modalities to learn a shared semantic space. A number of works (Zhang et al., 2022b; Zou et al., 2023a; Li et al., 2023g) have recently been proposed to unify vision and vision- language tasks by using one model for all. After pre-training, the single model can be applied to tackle all tasks in a zero-shot manner and the performance can be further improved via task-specific finetuning. Note that unified learning in this context differs from previous works of large-scale pre- training. Like GPT which serves as a universal language interface after pre-training, a unified vision model is not only a representation learning engine but also an interface that supports as many tasks as possible in a zero-shot manner. Below, we review a few representative works.




GLIPv2 (Zhang et al., 2022b) is proposed by extending GLIP (Li et al., 2022f) to support a wide range of vision and vision-language tasks, including grounded captioning, visual question asnwer- ing, etc. GLIPv2 seamlessly integrates localization pre-training and Vision-Language Pre-training (VLP) through three distinct pre-training tasks: (i) phrase grounding, which serves as a vision- language adaptation of detection tasks; (ii) region-word contrastive learning, introducing a novel contrastive task at the region-word level; and (iii) masked language modeling. In a zero-shot man- ner, this pre-trained model can be applied to different tasks and attain plausible performance across the board. Unlike previous works (e.g., GPV (Gupta et al., 2022b)), it merges the localization mod- ule and vision-language matching module in a coherent manner, which makes model training from fused data much more efficient and effective.






X-Decoder (Zou et al., 2023a) follows the generic design of encoder-decoder architecture. Given an input image, it first uses an image encoder to extract features at multiple scales. Afterward, a text encoder is used to encode a textual query into a sequence of embeddings. The visual features, textual queries and the non-semantic or latent queries are fed to a decoder to predict the outputs. Three critical designs are proposed to empower the generalization ability of X-Decoder to a variety of vision and vision-language tasks: (i) It defines two types of queries and outputs. Specifically, the queries for the decoder are categorized into latent queries and text queries, which undertake generic vision and vision-language tasks, respectively. Likewise, the output is categorized into pixel-level masks and semantic embeddings; (ii) A single text encoder is exploited to encode the textual corpus from all tasks. The common text encoder is used to encode referring phrases, text descriptions, and image captions in the task of referring segmentation, image-text retrieval and image captioning, respectively; (iii) It fully decouples the image and text encoder, and use all the outputs as queries. As such, it can learn from both intra-image supervisions and inter-image ones, which is essential to learn stronger pixel-level representations and support different granularity of tasks. As shown in Figure 4.13, the pre-trained model can support different tasks by taking different routing while sharing the same suite of parameters.






Uni-Perceiver-v2 (Li et al., 2023g) is another generalist model that unifies vision and vision- language tasks. Similar to X-Decoder, the model exploits a vision encoder, a text encoder and a general decoder. Differently, it introduces a region proposal network on top of the vision backbone to explicitly predict the boxes and masks, which are then encoded as “queries” for the general de- coder. To jointly train on datasets at different levels, it introduces a unified max-likelihood estimation strategy for tasks with localization and without localization.


4.4、From Static to Promptable Models从静态到可提示模型

The success of Large Language Models (LLMs) such as ChatGPT (OpenAI, 2023b) have shown the importance of modern AI models in interacting with humans, and have provided a glimpse of AGI (Bubeck et al., 2023). The ability to interact with humans requires a user-friendly interface that can take as many types of human inputs as possible and generate responses that humans can easily understand. In NLP, such a universal interaction interface has emerged and evolved for a while from early models like GPT (Brown et al., 2020) and T5 (Raffel et al., 2020), to more advanced techniques like prompting (Shin et al., 2020; Zhao et al., 2021; Li and Liang, 2021) and chain-of- thought (Wei et al., 2022a; Kojima et al., 2022; Schick et al., 2023). However, most vision models are still static in that they are less flexible than LLMs to various prompts. Most recently, a number of works have proposed to enhance the static vision models with the capabilities to support: (i) multi-modal prompting; (ii) in-context prompting.

ChatGPT (OpenAI, 2023b)等大型语言模型(LLMs)的成功表明了现代人工智能模型在与人类交互中的重要性,并提供了AGI的一瞥(Bubeck等人,2023)。与人类互动的能力需要一个用户友好的接口,可以接受尽可能多类型的人类输入,并生成人类容易理解的响应。

在NLP中,这样一个通用的交互界面已经出现并发展了一段时间,从早期的模型,如GPT (Brown等人,2020)和T5 (rafael等人,2020),到更先进的技术,如提示(Shin等人,2020;赵等,2021;Li and Liang, 2021)和思维链(Wei et al., 2022a;小岛等人,2022;Schick et al., 2023)。然而,大多数视觉模型仍然是静态的,因为它们对各种提示的灵活性不如LLMs 。最近,一些工作提出了增强静态视觉模型的能力,以支持:



4.4.1、Multi-modal Prompting多模态提示

Vision is different from language by nature. To enable a smooth interaction between humans and AI, a model requires not only language prompts but also other types of prompts to complement the missing information or resolve the ambiguity in language. Recently, a number of works have explored how to combine or augment language prompts with other types of prompts, such as spatial prompts (Kirillov et al., 2023), visual prompts (Zou et al., 2023b) and other modalities (Girdhar et al., 2023; Liu et al., 2023f). In the following, we review some representative works.


最近,一些研究探索了如何将语言提示与其他类型的提示结合或增强,如空间提示(Kirillov等人,2023)、视觉提示(Zou等人,2023b)和其他方式(Girdhar等人,2023;Liu et al., 2023f)。下面,我们回顾一些有代表性的作品。

Spatial prompting空间提示

Vision is rooted in the physical world, and as such it is not only semantic but also spatial by nature. Spatial prompting can be considered as a way to modulate the vision models through the inputs of location information, which could be a point, a box, or an arbitrary stroke, etc. Such clues have been heavily used in UI designs of computers (e.g., mouse) and mobile devices (e.g., touch screen). In computer vision, interactive segmentation (Mortensen and Barrett, 1998; McGuinness and O’connor, 2010; Chen et al., 2021c, 2022i) naturally requires such capability so that the model can take multiple clicks from users and gradually refine the segmentation mask. However, most of these works are still designed task-specifically and lack enough flexibility to support different types of spatial prompts.

SAM (Kirillov et al., 2023) is one of the pioneering works that propose a convenient spatial prompt- ing interface and learn a foundation model for image segmentation. As shown in Figure 4.14, the model can take points or boxes as the prompts, and segment images in arbitrary granularity. The ability to segment images following the user instructions from humans makes it readily a foundation to build many more models and applications (Zhang et al., 2023c). To name a few, a number of works (Ma and Wang, 2023; Roy et al., 2023) start with SAM and train a promptable segmentation model for the medical domain. Spatial prompting is particularly beneficial in that the textual anno- tations for medical images are usually limited and hard to interpret. Similar cases also happen in other industry domains (Tang et al., 2023a). To further improve point prompting, SAMAug (Dai et al., 2023a) proposes to refine the points using the max entropy criterion and saliency map, which can help to determine the most informative locations the model should look at.


SAM (Kirillov et al., 2023)是提出方便的空间提示界面并学习图像分割基础模型的开创性工作之一。如图4.14所示,该模型可以采用点或框作为提示,并以任意粒度分割图像。能够根据来自人类的用户指令分割图像的能力,使其成为构建许多更多模型和应用的基础(Zhang等人,2023c)。举个例子,一些工作(Ma和Wang,2023;Roy等人,2023)以SAM为基础,训练一个医疗领域的提示分割模型。空间提示尤其有益,因为医学图像的文本注释通常是有限的,难以解释。类似的案例也发生在其他行业领域(Tang et al., 2023a)。


Visual prompting视觉提示

In many cases, textual descriptions of objects are not necessarily clear to con- vey the information. For example, given an unrecognizable or indescribable object, people may fail to express themselves clearly about the object. In this case, showing one or a few examples would be more informative and straightforward. With this idea, a lineup of works have studied exemplar- based visual modeling, such as image-to-image retrieval (Yoon et al., 2021; Datta et al., 2008; Zhang et al., 2018), image co-segmentation (Joulin et al., 2010; Jerripothula et al., 2016) and visual object tracking (Yilmaz et al., 2006; Luo et al., 2021; Wu et al., 2013). Most recently, this strategy has been formulated as visual prompting in that different types of visual inputs are usually encoded to some unified format and then fed into a Transformer architecture, as shown in LLMs.


SEEM (Zou et al., 2023b) is one of the representative works that enable visual prompting to a vision model for image segmentation. As shown in Figure 4.15, SEEM differs from the aforementioned SAM and can take visual prompts by drawing points, boxes, and strokes on an image that can be the target image or another reference image. It develops a new module called a visual sampler that can extract visual features from an image according to the locations specified by users. Based on the visual sampler, the model can even take another reference image as input without any training like that. As a result, it shows impressive performance not only for various image segmentation tasks but also for video object segmentation in a zero-shot manner.


PerSAM (Zhang et al., 2023h) develops a personalized segmentation model on top of SAM and takes one shot as the input. It learns a specific model that takes a source image plus a mask as input and then predicts the mask for a target image. To extract the visual prompts, mask pooling is taken and used as the input tokens to the decoder of PerSAM. It also proposes a way to extract the positive and negative priors based on feature matching to facilitate pre-trained SAM models with comprehensive clues. Like most prompt learning methods in LLMs, a plausible feature for PerSAM is that it can be easily attained by some off-the-shelf models like SAM. SAM-PT (Rajicˇ et al., 2023) further applies this strategy to video object segmentation. Inspired by the spatial prompting in SAM, it exploits a point-tracking system (Harley et al., 2022) to track different points (both positive and negative ones) and then ask SAM to segment the image given the points. It exhibits strong point tracking performance as well as segmentation.

PerSAM(Zhang等人,2023h)在SAM之上开发了一个个性化分割模型,并以one shot为输入。它学习一个特定的模型,该模型将源图像加上掩码作为输入,然后预测目标图像的掩码。为了提取视觉提示,采用掩码池并将其用作PerSAM解码器的输入令牌。提出了一种基于特征匹配提取正先验和负先验的方法,以便为预训练的SAM模型提供全面的线索。与LLMs中的大多数提示学习方法一样,PerSAM的一个合理特点是,它可以很容易地通过像SAM这样的现成模型获得



Some other works combine a wide range of visual prompting types. For example, Painter (Wang et al., 2023i) reformulates different vision tasks (e.g., depth estimation, image seg- mentation) all as prompting and learns a decoder in an in-context learning manner. The prompts are combinations of raw images and the corresponding dense annotations (e.g., depth or segmentation maps). In contrast, Prismer (Liu et al., 2023f) makes use of many off-the-shelf vision models to ex- tract different information from the raw images and then feed the information to a vision-language model. To facilitate the interplay across multiple modalities, ImageBind (Girdhar et al., 2023) learns a universal alignment among image/video, language, audio and depth. Once the embedding space is learned, it can be used to compose different types of prompts by simply doing the summations.

其他一些作品结合了广泛的视觉提示类型。例如,Painter (Wang et al., 2023i)将不同的视觉任务(如深度估计、图像分割)都重新表述为提示,并以上下文学习的方式学习解码器。提示是原始图像和相应的密集注释(例如深度或分割地图)的组合。



4.4.2、In-context Prompting上下文提示


The in-context learning capability has been observed in many LLMs such as GPT-3 (Radford et al., 2019), which makes the model more configurable via prompting without any model parameter up- dates. In contrast, till now, the in-context learning capability for vision models is still less studied. Flamingo (Alayrac et al., 2022) is one of the pioneering works that demonstrate in-context language generation for multi-modal inputs, which is acquired by learning from interleaved image-text pair data. Likewise, Kosmos-1 (Huang et al., 2023b) is another work that takes visual inputs as a foreign language so that the in-context learning ability in LLMs can be naturally translated to multi-modal inputs. However, both methods take multi-modal data as inputs but merely generate texts as outputs. As we discussed earlier, vision tasks require outputs of different types beyond texts. How to endow the in-context learning ability for vision systems is still an open question. Below, we review some recent attempts toward the goal.





Visual prompting via inpainting is proposed in Bar et al. (2022) to teach the model to predict dense outputs, such as edges, masks, depths, etc. as shown in Figure 4.16. Given an input image x ∈ RH×W ×3 and a binary mask m ∈ {0, 1}H×W , an inpainting model is to predict the missing region y = f (x, m). The authors exploit a pre-trained VQ-GAN to encode the original image into discrete tokens and ask another ViT encoder to predict the masked regions. To make sure the model understands the visual “context” in the images, the authors collected a new dataset called Computer Vision Figures dataset which consists of 88k images from Arxiv papers. After pre-training, the model is used to predict the content at the bottom-right corner.

Bar等人(2022)提出了通过修复进行视觉提示的方法,以教模型预测密集输出,例如边缘、掩码、深度等,如图4.16所示。给定一个输入图像x∈RH×W×3和一个二进制掩码m∈{0,1}H×W,修复模型是为了预测缺失区域y = f (x, m)。作者利用一个预训练的VQ-GAN将原始图像编码成离散的标记,并要求另一个ViT编码器预测掩码区域。为了确保模型理解图像中的视觉“上下文”,作者收集了一个名为“计算机视觉图表数据集”的新数据集,其中包含来自Arxiv论文的88,000张图像。在预训练之后,该模型用于预测底部右下角的内容。

Painter→SegGPT :Painter通过预测连续像素输出实现不同任务统一,比如为分割任务用颜色表示不同个体、SegGPT基于Painter去专注图像分割应用

Concurrently, Painter (Wang et al., 2023i) extends a similar idea of visual in-context learning to more diverse datasets and benchmarks. Unlike Bar et al. (2022), it predicts the output in the contin- uous pixel space instead of discrete tokens. For different tasks, the authors define rules to convert the output spaces into image spaces. For example, it uses different colors to represent different individ- ual instances in the image for the segmentation task. After unifying the input and output format, the authors use vanilla ViT as the encoder and masked image modeling (He et al., 2022a). A follow-up work called SegGPT (Wang et al., 2023j) is built on top of Painter and designed specifically for image segmentation tasks. The pre-trained model can be easily extended for exemplar-based image segmentation tasks.

与此同时,Painter(Wang等人,2023i)将视觉上下文学习的类似思想扩展到更多不同的数据集和基准。与Bar等人(2022)不同,它预测连续像素空间的输出,而不是离散符号。对于不同的任务,作者定义了将输出空间转换为图像空间的规则。例如,它使用不同的颜色来表示图像中不同的个体实例,以用于分割任务。在统一输入和输出格式之后,作者使用普通的ViT作为编码器和掩码图像建模(He等人,2022a)。后续工作SegGPT (Wang et al., 2023j)建立在Painter之上,专门为图像分割任务设计。预训练模型可以很容易地扩展到基于样本的图像分割任务。


Hummingbird (Balazˇevic´ et al., 2023) resorts to a different method for in-context visual learning. Instead of using masked modeling, the authors propose to leverage attention across target and source images to aggregate the information. As shown in Figure 4.18, the models take multiple input images (first row) and corresponding semantic label maps (second row). Given a query image, it first finds the nearest neighbor feature locations in the prompt images for the query points and then projects the same matches to the semantic label maps so as to aggregate the label for the target query. This strategy is akin to earlier works that build classification models based on K-nearest-neighbor but differently applied to dense prediction tasks.



In-context learning is arguably an appealing feature. On one hand, there are a number of works that attempt to bridge vision with LLM so as to inherit the in-context learning capability such as Flamingo (Alayrac et al., 2022) and Kosmos-1 (Huang et al., 2023b). On the other hand, researchers resort to pure vision-based in-context learning to address vision-specific tasks such as image segmentation, depth estimation, etc. Thus far, there is no single model that can take multi- modal inputs and predict different types of outputs as well in an in-context learning manner, which may render a promising future direction along this line.


4.5、Summary and Discussion总结与讨论


To the end, an illustrative summary of the works that have been covered in this chapter is shown in Figure 4.19. There is a clear trend in the vision community to build open-world, unified and interactive vision models. Nevertheless, there are still some intrinsic differences between vision and language. First, vision differs from language in that it captures the physical world with raw signals. We need to develop some sophisticated tokenization methods to compress the raw data into compact “tokens”. In the language domain, this can be easily done by using some well-established heuristic tokenizers (Sennrich et al., 2016). Second, unlike language, vision data itself is not labeled and thus difficult to convey information or knowledge. It always requires human labors to annotate the visual contents in either a semantic or spatial manner. Third, language data is homogeneous while vision data and tasks are heterogeneous. Last but not least, storing vision data is much more costly than language data. For example, GPT-3 consumes 45 TB of training data, while the ImageNet dataset which contains 1.3M images costs more than hundreds of gigabytes. When it comes to video data like Howto100M (Miech et al., 2019), the storage cost already exceeds that of training corpus for GPT-3. All these differences cast some open questions that need to be addressed in the vision community, detailed below.


首先,视觉不同于语言,因为它通过原始信号捕捉物理世界。我们需要开发一些复杂的标记方法,将原始数据压缩成紧凑的“标记”。在语言领域,这可以通过使用一些完善的启发式标记器轻松完成(Sennrich et al., 2016)。






>> Computer vision in the wild. Due to the heterogeneous nature, the current vision data we use for training models can hardly cover the full picture of the physical world. Despite the effort in building open-set vision models, we are still facing significant challenges in coping with novel or long-tail scenarios.

>> Scaling law in vision. As discussed in Kaplan et al. (2020); Hoffmann et al. (2022), the perfor- mance of large language models improves smoothly with the increase of model size, data scale, and amount of computes. As the scale increases, some intriguing emerging properties are further observed in LLMs. In contrast, it is still not clear what is the right path to scale vision models, not to mention the emerging properties in such models.

>> Vision-centric or language-centric models. Currently, the boundary between vision and lan- guage is gradually dismissed. However, due to intrinsic differences between vision and language, it is still not clear whether we should further scale up the vision models and integrate language models or the combination of moderate vision models and LLMs is sufficient to address most (if not all) of the problems.

>> 域外计算机视觉。由于视觉数据的异质性质,我们用于训练模型的当前视觉数据几乎无法涵盖物理世界的全貌。尽管我们在构建开放世界的视觉模型方面付出了努力,但在应对新颖或长尾场景方面仍然面临重大挑战。

>> 视觉中的规模定律。如Kaplan等人(2020)和Hoffmann等人(2022)所讨论的,大型语言模型的性能随着模型规模、数据规模和计算量的增加而平稳提高。随着规模的扩大,LLMs中还观察到一些有趣的新特性。相比之下,仍然不清楚扩大视觉模型的正确路径是什么,更不用说这些模型中的新特性了。

>> 以视觉为中心或以语言为中心的模型。目前,视觉和语言之间的边界逐渐消失。然而,由于视觉和语言之间的固有差异,仍然不清楚我们是否应该进一步扩大视觉模型并集成语言模型,或者将适度的视觉模型与LLMs的组合已足以解决大多数(如果不是所有)问题。

With that being said, we are close yet still far away from an intelligent vision system that can perceive the world like humans. We hope the literature review in this chapter could provide an overall picture of the existing efforts, and inspire the pursuit of next-generation vision models.


5、Large Multimodal Models: Training with LLM大型多模型:与LLM一起训练

In this chapter, we comprehensively explore large multimodal models (Alayrac et al., 2022; Ope- nAI, 2023a). We begin with Section 5.1 to delve into the background of such models, with the focus on the basics of image-to-text generative models and their representative model instances in vari- ous case studies. We also discuss the state-of-the-art OpenAI Multimodal GPT-4 (OpenAI, 2023a) and identify the existing research gaps in the field. To better understand the process of instruction tuning in large language models, Section 5.2 examines its importance and its role in self-instruct and open-source LLMs. Moving forward, we explore instruction-tuned large multimodal models in Section 5.3, shedding light on their basics, significance and applications. Additionally, Section 5.4 touches upon advanced topics in the realm of multimodal models to provide a deeper understanding of the subject. Finally, we assess the current progress in the field by evaluating how close we are to achieving the OpenAI Multimodal GPT-4 in Section 5.5, a major milestone in AI research.


第5.1节开始深入探讨这些模型的背景,重点关注图像到文本生成模型的基础知识以及各种案例研究中的代表性模型实例。我们还讨论了最新的OpenAI Multimodal GPT-4(OpenAI,2023a)并确定了该领域的现有研究差距



此外,第5.4节涉及多模型模型领域的高级主题,以更深入地了解这一主题。最后,我们通过评估我们在第5.5节中是否接近实现OpenAI Multimodal GPT-4来评估该领域的当前进展,这是AI研究的一个重要里程碑。

5.1、Background 背景

5.1.1、Image-to-Text Generative Models图像到文本生成模型


LMMs in their current form is primarily an image-to-text generative model, which takes images as input, and outputs a text sequence. One example is illustrated in Figure 5.1 (a) Left. All of the model variants share a very similar model architecture and training objective.

>> Model Architecture. As illustrated in Figure 5.1 (a) Right, the model typically consists of an image encoder to extract visual features, and a language model to decode the text sequence. The vision and language modalities can be optionally connected by trainable connection mod- ule. The image encoder and language model can be either trained from scratch or initialized from pre-trained models.

>> Training Objective. As illustrated in Figure 5.1 (b), it typically employs an auto-regressive loss on the output text tokens. For the attention map in the Transformers (Vaswani et al., 2017), image tokens can attend to each other, and the current text token attends to all image tokens and the previous text tokens.


>> 模型架构。如图5.1(a)右所示,该模型通常包括一个用于提取视觉特征的图像编码器,以及一个用于解码文本序列的语言模型。视觉和语言模态可以选择性地通过可训练的连接模块连接。图像编码器和语言模型可以从头开始训练,也可以从预训练模型初始化。

>> 训练目标。如图5.1(b)所示,通常会在输出文本标记上使用自回归损失。在Transformers的注意力映射中,图像标记可以相互关注,当前文本标记会关注所有图像标记和前面的文本标记。

5.1.2、Case Studies案例研究



Flamingo,它通过添加新的架构组件来连接冻结的预训练图像编码器和语言模型,并在训练过程中使用Perceiver Sampler模块降低计算复杂性,并使用Gated Transformer模块稳定训练。 Flamingo通过简单的少量样本学习可以直接适应视觉任务,无需额外的任务特定调整。

We use some prominent LMMs as examples to illustrate how the aforementioned network archi- tecture can be instantiated in different models, while maintaining the same auto-regressive training objective.


Case study I: LMM trained with image-text pairwise instances. Most LMMs are trained on a large number of image-text pairs, where each training sample is a pair. GIT (Wang et al., 2022a) and BLIP2 (Li et al., 2023h) are two large models that achieve state-of-the-art (SoTA) performance on many datasets. The comparisons are shown in Figure 5.2(a). GIT initializes the image encoder with contrastively pre-trained Florence model (Yuan et al., 2021), and trains the language model from scratch. On the other hand, BLIP2 freezes the weights of a pre-trained image encoder and a pre- trained language model, while trains a lightweight Q-former module to connect the image encoder and the language model.


Case study II:

LMM trained with interleaved image-text sequence instances. We use Flamingo (Alayrac et al., 2022) as an example, shown in Figure 5.2(b). It connects the frozen pre-trained image encoder and language model – by adding novel architectural components in be- tween. Specifically, Perceiver Sampler module helps reduce computational complexity, and Gated Transformer module helps to stabilize training during the initial stage. Flamingo is trained on a mix- ture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. After this training is done, Flamingo can be directly adapted to vision tasks via simple few-shot learning without any additional task-specific tuning.

案例研究II:使用交错的图像-文本序列实例进行训练的LMM。我们以Flamingo(Alayrac等人,2022)为例,如图5.2(b)所示。它通过在冻结的预训练图像编码器和语言模型之间添加新的架构组件来连接它们。具体来说,Perceiver Sampler模块有助于降低计算复杂性,而Gated Transformer模块有助于在初始阶段稳定训练。Flamingo是在只来自网络的互补大规模多模态数据的混合数据上进行训练的,而不使用任何为机器学习目的注释的数据。完成这一训练后,Flamingo可以通过简单的少样本学习直接适应视觉任务,而无需进行任何额外的任务特定调整。

Multimodal in-context-learning多模态内上下文学习:Flamingo通过提供少量示例实现跨任务转移学习,这种引人注目的上下文学习能力使Flamingo成为多模态领域中的GPT-3时刻

Beside the SoTA performance on dozens of academic bench- marks, probably the most appealing aspect of Flamingo is the emerging property: Multimodal In- Context-Learning. Specifically, given a couple of image-text pairs as examples, Flamingo can zero- shot task transfer to unseen problems, such as solving visual math problems. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples, without any additional training required. For example in Figure 5.3, two new tasks are presented to Flamingo. The top row provides two image-text pairs as the context in the prompt, where the text describes the name of the animal in the image, followed by the geographical information of the animal. Flamingo is able to understand the patterns presented in the examples, and output the corresponding informa- tion for a new image. In the bottom row, the text first shows the optical character recognition (OCR) result of the image, followed by the answer to the math problem. Flamingo follows the task instruc- tion illustrated in the multimodal context, and outputs the correct answer for a new math problem in the third image. This intriguing in-context learning capability makes Flamingo the GPT-3 mo- ment (Brown et al., 2020) in the multimodal domain.


5.1.3、OpenAI Multimodal GPT-4 and Research Gaps—OpenAI Multimodal GPT-4和研究差距:GPT-4引入了多模态输入的能力,这引发了如何在多模态空间进行指导和对齐研究的问题


In March 2023, OpenAI released GPT-4 (OpenAI, 2023a), with impressive capability in visual un- derstanding and reasoning. Though the model details are not revealed, there is no doubt that GPT-4 enables many new scenarios, based on the examples highlighted in the technique report. For in- stance, two popular visual examples are illustrated in Figure 5.4. The first one identifies the uncom- mon visual region and exhibits strong complex reasoning performance. The second one recognizes text in the image and captures the mere across image-text. For a while, the research community had no clue how this new ability is achieved (probably because they are not tightened to any established academic tasks/datasets), but all are determined that these are exciting results. It naturally raises a question: how can we build Multimodal GPT-4 like models?

2023年3月,OpenAI发布了GPT-4(OpenAI,2023a),具有印象深刻的视觉理解和推理能力。尽管没有透露模型的详细信息,但毫无疑问,GPT-4可以基于技术报告中突出显示的示例实现许多新场景。例如,图5.4中示例了两个流行的视觉示例。第一个识别了不寻常的视觉区域,并展示了强大的复杂推理性能。第二个识别了图像中的文本并捕捉了图像-文本之间的关系。有一段时间,研究界不知道这种新能力是如何实现的(可能是因为它们没有与任何已建立的学术任务/数据集相关联),但所有人都确定这些是令人兴奋的结果。这自然引发了一个问题:我们如何构建类似于Multimodal GPT-4的模型?

To answer it, we start to review the big models from OpenAI, by highlighting the most appealing properties for each model in Figure 5.5. There are several key observations: (i) GPT-2 (Radford et al., 2019) is the auto-regressive counterpart in the BERT era (Devlin et al., 2019) for the pre- train-then-finetune paradigm. Compared with GPT-2, GPT-3 (Brown et al., 2020) is a 175B model trained on web-scale text corpus, which exhibits two emerging properties with a frozen model: in-context-learning (Brown et al., 2020) and chain-of-thoughts (CoT) reasoning (Wei et al., 2022a). This means, without any additional training, the model can tackle a wide range of new problems with just a few task-specific examples and by properly prompting it step-by-step, respectively. It further leads to the modeling paradigm from task-specific finetuning to prompting frozen models, where the latter shows higher generalizability and lower adaptation cost in task transfer. (ii) ChatGPT and InstructGPT (Ouyang et al., 2022) show the importance of instruction-following and alignment with human intents for LLMs, by finetuning the base language model GPT-3/GPT-3.5 on high quality instruction-following data, and improving them with a reward model via reinforcement learning with human feedback. (iii) GPT-4 not only improves the language ability of previous models, but also allows visual signals as additional input for visual understanding and reasoning. We see that the newer generation model maintains/improves the existing properties of the previous ones, and enable new properties.

In other words, from GPT-3 to GPT-4, we see two new properties: instruction-following and multi- modal input. This reveals the gap between existing LMMs (e.g., Flamingo) and multimodal GPT-4: how to perform instruction-following and alignment research in the multimodal space, which is the focus of this chapter.


  1. GPT-2(Radford等人,2019)是BERT时代(Devlin等人,2019)的自回归对应物,用于预训练然后微调的范式。与GPT-2相比,GPT-3(Brown等人,2020)是一个训练在Web规模文本语料库上的175B模型,展示出冻结模型的两个新兴属性:多模态内上下文学习(Brown等人,2020)和思维链推理(CoT)(Wei等人,2022a)。这意味着,无需进行任何额外的训练,模型可以通过仅仅使用一些任务特定示例并逐步适应良好的提示来解决各种新问题。这进一步导致了从任务特定微调的建模范式转向提示冻结模型,后者在任务转移中具有更高的通用性和更低的适应成本。



5.2、Pre-requisite: Instruction Tuning in Large Language Models先决条件:大型语言模型中的指令调优


Note that instruction-following is a notion originated in NLP. To study the intuition behind it and have a full picture of its history, we first revisit instruction tuning with LLMs.

Traditional language data. As a typical data instance in NLP, sequence-to-sequence (seq2seq) representation is widely adopted for many language tasks: each data instance consists of two parts: one sequence as the input and another sequence as the output. We provide two examples in Fig-ure 5.6 (a). Without any task instruction specified, we know they are translation and summarization tasks, respectively.

This seq2seq representation is also the conventional data format in NLP research, where task in- structions are implicit. Based on each data domain, individual models are trained. Or sometimes one model is trained with multi-task objectives over multiple data domain without specifying the task instructions. For both cases, the models are hard to generalize to new tasks in a zero-shot fash- ion, as they are not trained to understand task instructions, thus cannot distinguish and generalize what task to perform during testing time.





Instructional language data.

Recently, researchers have started to explicitly add task instructions into the model training, as shown in Figure 5.6 (b). Interestingly, the task instruction of most NLP tasks can be expressed in natural language as well. It leads a new data format: instruction-input- output triplets. Based on the new format, one single model can be trained to perform multiple tasks, each with its specific instructions. Since models have observed many task instructions and many instances for each task during training, it is more natural and easier for them to generalize to new tasks by task composition in the inference stage.

For example, in the evaluation stage, a new task that requires both summarization and translation is provided in Figure 5.6 (c). Though the model has never seen this new task during training, it observes individual task basis, and learns to perform on new tasks. Note that we humans are always creating new tasks in our daily life, and presumably these new tasks would never been observed by models. It is thus appealing if a model is able to solve thousands of new tasks in the wild without training. This is partially why ChatGPT is becoming popular and prevalent so quickly.




5.2.1、Instruction Tuning指令调优:探索如何通过指令调优来使多模态语言模型(LLMs)能够遵循自然语言指令并完成现实世界的任务


How can we collect a diverse set of high-quality instruction-following data? There are two gen- eral schemes. One is through human-human interaction, where humans (task providers) provide the annotation statement and requirements, based on which another group of humans complete the annotation tasks. Such a scheme is typically costly and time consuming. The other scheme is via human-machine interaction, where similarly humans provide the annotation statement and require- ments, but it is now the machines/models that complete the annotation tasks.

To enable LLMs to follow natural language instructions and complete real-world tasks, researchers have been exploring methods to instruction-tune LLMs. This is implemented by either finetuning the model on a wide range of tasks using human-annotated prompts and feedback (Ouyang et al., 2022), or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions (Wang et al., 2022f). Among these methods, Self-instruct tuning (Wang et al., 2022e) is a simple and effective method of aligning LLMs to human intent, by learning from instruction-following data generated by SoTA LLMs. It turns out that the line of instruction-tuning research has produced effective means to improve zero-shot and few-shot gen- eralization abilities of LLMs. Self-instruct leverages the in-context-learning ability of LLM. The pipeline is illustrated in Figure 5.7. Humans create a few examples (i.e. seed examples) as the con-text, and ask LLM such as GPT-3 or GPT-4 to create more instructions and responses that follow the requirements stated in the prompt. The machine-generated instruction-following data can be further selected to construct with the prompt for in-context-learning in the next data generation iteration. The procedure iterates until a given number of samples are collected. Due to the relatively lower cost and higher response speed of API calls (compared with human annotations), self-instruct is becoming more favorable in the research community.


为了使LLMs能够遵循自然语言指令并完成现实世界的任务,研究人员一直在探索指导调整LLMs的方法。这可以通过使用人工注释的提示和反馈(Ouyang等人,2022)在广泛的任务上微调模型来实现,也可以通过使用手动或自动生成的指令扩充的公共基准和数据集进行监督微调(Wang等人,2022f)来实现。在这些方法中,自我指导调整(Self-instruct tuning)(Wang等人,2022e)是一种简单有效的方法,通过学习由SoTA LLMs生成的指令遵循数据来使LLMs与人的意图一致。流程如图5.7所示。人们创建一些示例(即种子示例)作为上下文,并要求像GPT-3或GPT-4这样的LLMs创建更多的指令和响应,以遵循提示中的要求。机器生成的指令遵循数据可以进一步选择以构建在下一个数据生成迭代中的上下文学习。该过程迭代直至收集到一定数量的样本。由于与人类注释相比,API调用的成本相对较低,响应速度更快,因此自我指导调整在研究界越来越受欢迎。

5.2.2、Self-Instruct and Open-Source LLMs自我指导和开源LLMs








The open-source community has witnessed a surge of open LLMs. The success of ChatGPT (Ope- nAI, 2022) and GPT-4 (OpenAI, 2023a) offers tremendous opportunities to improve open-source LLMs using instruction-tuning. Figure 5.8 compares several open-source instruction-tuned LLMs. LLaMA (Touvron et al., 2023) is a series of open-sourced LLMs, which match the performance of proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-instruct tuning has bHeoewnFqaruCicanklCyamaedlos pGtoe?dExgpilvoerinngitthsesSutapteeroiof IrnsptreurcftoiornmTaunncineg aonndOpleonwRecsoosutr.ce For example, to name a few early attempts in this line of research, Stanford Alpaca (Taori et al., 2023) uses 52K instruction- following samples generated by GPT-3.5, while Vicuna (Vicuna, 2023) uses around 500K high- quality instruction-following samples (150K conversions) between user and GPT (ShareGPT, 2023). To advance the SoTA of instruction-tuning for LLMs, Peng et al. (2023a) uses GPT-4 as the teacher to generate the responses to the Alpaca instructions. Many follow-up works (Zhang et al., 2023i) improve the instruction-following data to enable the open LLMs with better alignment quality in chat. For a comprehensive review, we refer the readers to a recent paper (Wang et al., 2023k), where a LLM Tulu is trained on a mix of several high-quality instruction data, and comprehensive comparisons are conducted across multiple benchmarks.

开源社区见证了开放LLMs的激增。ChatGPT(OpenAI,2022)和GPT-4(OpenAI,2023a)的成功为使用指令调优改进开源LLMs提供了巨大机会。图5.8比较了几个开源经过指令调优的LLMs。LLaMA(Touvron等人,2023)是一系列开源LLMs,可以与专有LLMs(如GPT-3)的性能相匹敌。为了教LLaMA遵循指令,Self-instruct调整方法已经被用于许多LLMs,以提高它们的指令遵循质量。要让LLaMA遵循自然语言指令并完成现实世界的任务,研究人员一直在探索指导调整LLMs的方法。这可以通过使用人工注释的提示和反馈(Ouyang等人,2022)在广泛的任务上微调模型来实现,也可以通过使用手动或自动生成的指令扩充的公共基准和数据集进行监督微调(Wang等人,2022f)来实现。在这些方法中,自我指导调整(Self-instruct tuning)(Wang等人,2022e)是一种简单有效的方法,通过学习由SoTA LLMs生成的指令遵循数据来使LLMs与人的意图一致。流程如图5.7所示。人们创建一些示例(即种子示例)作为上下文,并要求像GPT-3或GPT-4这样的LLMs创建更多的指令和响应,以遵循提示中的要求。机器生成的指令遵循数据可以进一步选择以构建在下一个数据生成迭代中的上下文学习。该过程迭代直至收集到一定数量的样本。由于与人类注释相比,API调用的成本相对较低,响应速度更快,因此自我指导调整在研究界越来越受欢迎。

Quick assessment of LLM chatbotsLLM聊天机器人的快速评估:开源模型的执行能力已接近于当下最先进的私有模型,基于开源的LLaMA家族+Vicuna-Instructions-801数据集+GPT-4进行评分

To study the quality of LLM Chatbots, we consider Vicuna- Instructions-801 (Vicuna, 2023), a dataset with 80 questions that baseline models (Touvron et al., 2023) find challenging. Besides generic instructions, the instructions fall into 8 categories, including knowledge, math, Fermi, counterfactual, roleplay, generic, coding, writing and common-sense. To quantitatively compare the performance, GPT-4 is used to rate the response from score 1 to 10 for any two given chatbots, then compute the relative score. Surprisingly, it turns out this evaluation metric is quite consistent across different settings. The open-source LLaMA family seems to perform closely to SoTA proprietary chatbots.


Further discussions进一步的讨论三个研究方向,数据驱动AI、开源LLMs与专有LLMs之间的差距辩论、基础LLMs的发展

There are several important topics on LLMs that we have not covered in this chapter, but are worthwhile future exploring.

>> Data-centric AI. We emphasize that the development of these open-source LLMs is data- centric (Mazumder et al., 2022), rather than model-centric, so that we hope the readers could align with this perspective when discussing the topic. As the training objectives and network architectures are becoming similar or even identical to GPT-like models, the key differential factor is data. For example, behaviors of the aforementioned LLMs are determined by the instruction tuning data.

>> False promise? There is a debate on that the open LLMs could catch up with the proprietary LLMs is a false promise (Gudibande et al., 2023). To align the discussions, we argue that there are two distinctive abilities for LLMs: the instruction-following ability to know which task to perform, and massive knowledge storage to complete the task with high quality. Imitation models are good at the former, by mimicking ChatGPT’s style but perform poorly in terms of factuality in their responses. In Gudibande et al. (2023), the authors conclude that there exists a substantial capabilities gap between open and closed LLMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LLMs. They also advocate that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LLMs. However, unfortunately, the resources to train such base LLMs are only available in a few industry labs. It seems more promising for most academic research labs to explore the opportunities in alignment research with affordable resources, or explore the techniques to reduce the compute barriers.

>> Base LLMs. Developing more capable or commercial usable LLMs is of great value. Besides LLaMA, the open-source community has developed variants of base LLMs such as LLaMA-2, OpenLLaMA (Geng and Liu, 2023), MPT (Team, 2023) and Falcon (Penedo et al., 2023), or released the training recipe (Computer, 2023).


>> 数据中心的AI。我们强调,这些开源LLMs的发展是以数据为中心的(Mazumder等人,2022),而不是以模型为中心的,因此我们希望读者在讨论这个主题时能够与这一观点保持一致。由于训练目标和网络架构正在变得相似甚至相同于类似GPT的模型,因此关键的区别因素是数据。例如,上述LLMs的行为由指令调优数据决定。

>> 虚假承诺?有人辩称,开源LLMs可以赶上专有LLMs是虚假的承诺(Gudibande等人,2023)。为了使讨论保持一致,我们认为LLMs有两个明显的能力:遵循指令的能力,知道执行哪个任务,以及大规模知识存储的能力,以高质量完成任务。模仿模型在前者方面表现出色,通过模仿ChatGPT的风格,但在响应的真实性方面表现不佳。在Gudibande等人(2023)中,作者得出结论,开源和封闭LLMs之间存在实质性的能力差距,目前的方法只能通过大量的模仿数据或使用更强大的基础LLMs来弥合这一差距。他们还主张,改进开源模型的最有价值的举措是解决开发更好的基础LLMs的困难挑战。然而,不幸的是,用于训练此类基础LLMs的资源只在少数行业实验室中可用。对于大多数学术研究实验室来说,探索使用可负担得起的资源进行对齐研究或探索降低计算壁垒的技术的机会似乎更有希望。

>> 基础LLMs。开发更强大或商业可用的LLMs具有巨大的价值。除了LLaMA之外,开源社区还开发了基础LLMs的变种,如LLaMA-2、OpenLLaMA(Geng和Liu,2023)、MPT(Team,2023)和Falcon(Penedo等人,2023),或者发布了训练配方(Computer,2023)。

5.3、Instruction-Tuned Large Multimodal Models指导调整的大型多模态模型


In this section, we illustrate how to build the minimum prototype of multimodal GPT-4 with open- source resources. Specially, we use LLaVA (Liu et al., 2023c) as the running example, a similar idea is also proposed in its con-current work MiniGPT-4 (Zhu et al., 2023a).

The research in the multimodal space has often been inspired by the latest advances in NLP in recent years. One successful recipe is to explore what would happen if the most intriguing and suc- cessful NLP ideas are borrowed for the vision-and-language community, for example, self-instruct. However, the unique challenge with self-instruct in multimodal research is that there is no strong multimodal teacher publicly available. Therefore, the research question becomes: how can we use language models such as language-only GPT-4 to create multimodal instruction following data.



Data Creation数据创建:提高模型的多模态能力=将图像转换为符号序列表示+采用图像的标题和边界框信息+三种类型的指令遵循数据

Instead of directly feeding images into OpenAI GPT-4, we use their symbolic sequence representa- tions shown in Figure 5.9 (a). In LLaVA, both captions and bounding boxes are considered, due to the following reasons: (i) it is empirically found that GPT-4 can understand both well, in contrast to the poor performance of ChatGPT in understanding bounding box coordinates. (ii) They are often complementary to each other and hence can represent the image as informative as possible.

As shown in Figure 5.9 (b), three types of instruction-following data are considered: (i) multi- turn conversations so that users can chat with the model; (ii) detailed description so that long-form responses can be generated from the model; and (iii) complex reasoning, which is more about the implication of the image, rather than the image content. For example, “what challenge do these people face?”, which requires to first recognize that the image is about a SUV in the parking area, and there are quite a few luggage placed on the ground, and then to infer that the challenge is how the luggage can be packed into the SUV due to the tight space of the trunk. In total, 158K samples are collected over three types. To summarize, the spirit is that whatever tasks one wants the model to perform in the serving stage, it is important to create the corresponding instruction-following data for training.

与直接将图像馈送到OpenAI GPT-4不同,我们使用它们在图5.9(a)中显示的符号序列表示。在LLaVA中,考虑了标题和边界框,原因如下:(i)经验表明GPT-4可以很好地理解两者,而ChatGPT在理解边界框坐标方面性能较差。 (ii)它们通常互补,并且可以尽可能详细地表示图像。


Network Architecture and Training网络架构和训练:LLaVA的网络架构是一个通用的图像到文本生成模型,通过将预训练的CLIP ViT-L/14视觉编码器和大型语言模型Vicuna连接起来,并采用两阶段指令调优过程进行训练(特征对齐的预训练+端到端微调)

As illustrated in Figure 5.10, the network architecture of LLaVA is an instantiation of the general image-to-text generative model framework introduced in Figure 5.1 of Section 5.1. Specifically, LLaVa connects the pre-trained CLIP ViT-L/14 visual encoder (Radford et al., 2021) and large lan- guage model Vicuna (Vicuna, 2023), via a simple projection matrix (i.e., the linear projection layer). A two-stage instruction-tuning procedure is adopted to train the model. (i) Stage 1: pre-training for feature alignment. Only the projection matrix is updated, based on a subset of CC3M (Changpinyo et al., 2021). (ii) Stage 2: finetuning end-to-end. Both the projection matrix and LLM are updated on the proposed multimodal instruction-following data for daily user-oriented applications.

如图5.10所示,LLaVA的网络架构是图5.1中介绍的通用图像到文本生成模型框架的一个实例,该框架在第5.1节中介绍。具体来说,LLaVa通过一个简单的投影矩阵(即线性投影层)连接了经过预训练的CLIP ViT-L/14视觉编码器(Radford等人,2021)和大型语言模型Vicuna(Vicuna,2023)。采用了两阶段指导调整过程来训练模型。




Visual chat: towards building multimodal GPT-4 level chatbot. LLaVA is finetuned on the gen- erated multimodal instruction-following data, which contains a diverse set of task instructions and responses for daily user-oriented applications. It is empirically observed that finetuning the linear projection layer only is sufficient for the chat demo/scenarios, though it requires longer training time. To evaluate the model performance, an evaluation dataset named LLaVA-Bench is constructed, with two subsets: (i) LLaVA-Bench (COCO): 30 unseen COCO images with 90 new language-image instructions, (ii) LLaVA-Bench (In-the-Wild): 24 images with 60 questions. Each image can be as- sociated with three types of instructions: conversation, detailed description and complex reasoning.The ground-truth answers are collected by manually re-writing GPT-4 output. We test LLaVA and use language-only GPT-4 to rate their responses from score 1 to 10. Overall, LLaVA achieves 85.1% relative score compared with ground-truth on LLaVA-Bench (COCO), and 73.5% on LLaVA-Bench (In-the-Wild). On the latter dataset, Google Bard (July 19, 2023) and Microsoft BingChat (June 29, 2023) achieves 77.8% and 71.5%, respectively. It indicates the effectiveness of the proposed self-instruct method in multimodal settings. One examples is shown in Table 5.1.

Science QA: New SoTA with the synergy of LLaVA with GPT-4. LLaVA is finetuned on a multimodal reasoning dataset in the science domain (Lu et al., 2022b). LLaVA alone achieves 90.92% in accuracy. We further explores with language-only GPT-4 as the judge, to predict the final answer based on its own previous answers and the LLaVA answers. This “GPT-4 as judge” scheme yields a new SoTA of 92.53%.

OCR in the wild: An emerging property. LLaVA has never been explicitly trained on OCR data,i.e. images that contains scene text that is described in the corresponding caption. Surprisingly, the model shows strong zero-shot OCR task transfer ability in the wild.

视觉对话:朝着构建多模态GPT-4级聊天机器人。LLaVA在生成的多模态指导遵循数据上进行了微调,其中包含了各种各样的面向日常用户应用的任务指导和响应。经验观察表明,仅微调线性投影层对于聊天演示/场景已经足够,尽管需要更长的训练时间。为了评估模型性能,构建了一个名为LLaVA-Bench的评估数据集,包括两个子集:(i)LLaVA-Bench(COCO):30个未见过的COCO图像,带有90个新的语言-图像指令;(ii)LLaVA-Bench(In-the-Wild):24个图像,带有60个问题。每个图像可以与三种类型的指令相关联:对话、详细说明和复杂推理。地面真实答案是通过手动重写GPT-4输出来收集的。我们测试了LLaVA,并使用仅语言的GPT-4来为其响应评分,从1到10。总体而言,LLaVA在LLaVA-Bench(COCO)上相对分数达到了85.1%,在LLaVA-Bench(In-the-Wild)上达到了73.5%。在后一个数据集上,Google Bard(2023年7月19日)和Microsoft BingChat(2023年6月29日)分别达到了77.8%和71.5%,这表明了在多模态环境中提出的自我指导方法的有效性。表5.1中显示了一个示例。



5.4、Advanced Topics高级主题


The history of recent instruction-tuned LMMs are illustrated in Figure 5.11 (a). Due to the popularity of ChatGPT and GPT-4, instruction-tuned LMM appears as an emerging line of research in the past three months after GPT-4 was proposed. Alpaca (Taori et al., 2023) and Vicuna (Vicuna, 2023) were proposed to make LLaMA more instruction-following in the language domain in March. In two weeks, MiniGPT-4 (Zhu et al., 2023a) and LLaVA (Liu et al., 2023c) were proposed to make Vicuna to see and chat about the visual world. In ten days, LLaMA-Adapter v2 (Gao et al., 2023b) and mPlug-OWL (Ye et al., 2023b) started to compare performance with MiniGPT-4/LLaVA, indicating the beginning of model evolution. The data points in April are relatively sparse. In May, a large number of LMM papers appeared on arXiv, which improve this line of research from many different aspects. The momentum is till going in June.

It is easy to lose track of all the recent papers for the readers, so as well in our literature re- view. To better organize the literature, we group them based on specific research topics, shown in Figure 5.11 (b). The early LMMs with billions of parameters include GPT-4 (OpenAI, 2023a), Flamingo (Alayrac et al., 2022), PaLM-E (Driess et al., 2023) and KOSMOS-1 (Huang et al., 2023b). In contrast to these proprietary LMMs, LLaVA and MiniGPT-4 open the opportunities to build LMMs with open-source resource. We will discuss several topics as below, in addition to the extensions of RLHF (Gunjal et al., 2023), dense prediction (Wang et al., 2023h; Zang et al., 2023; Chen et al., 2023d), video (Zhang et al., 2023f; Luo et al., 2023c; Li et al., 2023i), image generation (Koh et al., 2023) and embodied agent (Mu et al., 2023).

近期指导调整的大型多模态模型的历史在图5.11(a)中有所说明。由于ChatGPT和GPT-4的流行,指导调整的大型多模态模型在GPT-4提出后的过去三个月中成为一个新兴的研究领域。在3月份,Alpaca(Taori等人,2023)和Vicuna(Vicuna,2023)提出了使LLaMA在语言领域更符合指导的方法。两周后,MiniGPT-4(Zhu等人,2023a)和LLaVA(Liu等人,2023c)提出了使Vicuna能够看到并与视觉世界进行交互的方法。十天后,LLaMA-Adapter v2(Gao等人,2023b)和mPlug-OWL(Ye等人,2023b)开始与MiniGPT-4/LLaVA进行性能比较,标志着模型演进的开始。4月份的数据点相对稀疏。到了5月,arXiv上出现了大量关于LMM的论文,从许多不同的方面改进了这一研究领域。这个势头一直持续到了6月。


More Modalities (Beyond VL)更多的模态(超越VL):近期研究致力于将多模态语言模型框架扩展到包括更多的感知模态,如声音、图像、视频等,进一步拓展了多模态语言理解和生成的研究领域。

While LMM extends LLM by adding the vision modality, it is natural to further extend the frame- work to include more modalities beyond vision and language. Following this spirit, several at- tempts have been made, including ChatBridge (Zhao et al., 2023e), PandaGPT (Su et al., 2023), SpeechGPT (Zhang et al., 2023d) and X-LLM (Chen et al., 2023c). PandaGPT leverages Image- Bind to add more modalities into LMMs. The ImageBind model (Girdhar et al., 2023) learns a single, shared representation space for text, image/video, audio and sensors that record depth (3D), thermal (infrared radiation), or inertial measurement units (IMU), which calculate motion and po- sition. ImageBind provides a holistic understanding of the visual world that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move. By training a projection layer for one modality in LMM, the model can zero-shot transfer to infer over other modalities, thanks to the shared multimodal embedding space. Another representative model is SpeechGPT, where language and speech modalities are enabled for both inputs and out- puts. Despite of rich model variations, the idea to connect diverse modalities is similar to LMM that adds images into LLMs. NExT-GPT (Wu et al., 2023c) connects an LLM with multimodal adap- tors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. The LMM framework has also been successfully extended to speech (Zhao et al., 2023c), 3D (Wang et al., 2023l; Hong et al., 2023), and point cloud (Xu et al., 2023c).


Improving Visual Instruction Data Quantity and Quality改进视觉指导数据的数量和质量

Given the convergence of model architectures to GPT-like network, the performance of LMM is primarily determined by its training data. Therefore, it is cricial to improve the quantity and quality of visual instruction tuning data. SVIT (Zhao et al., 2023a) follows the same data generation pipeline as in LLaVA, but further includes region description to prompt GPT-4, in addition to the caption and box data as shown in Figure 5.9 (a). The data is scaled up to 3.2 million, which is 20 times larger than the data used in LLaVA.

Unlike existing studies that primarily focus on positive instruction samples, LRV-Instruction (Liu et al., 2023a) includes both positive and negative instructions for more robust instruction-tuning. Other examples along this line include LLaVAR (Zhang et al., 2023o) that adds OCR-related instruction-tuning data for text-rich image understanding, and StableLLaVA (Li et al., 2023o) that considers model-synthesized images for image-dialogue data. Polite Flamingo (Chen et al., 2023b) trains LLM to re-write the instruct data. Instead of leveraging GPT-4 for data generation, VIGC (Wang et al., 2023a) considers to utilize LMM to generate instruction-tuning data and progres-sively enhance its quality on-the-fly. Similar to the “less is more” observation in LIMA (Zhou et al., 2023a) from the NLP domain, InstructionGPT-4 shows that the quality of the instruction-tuning data is more important than its quantity, where they finetune a better version of MiniGPT-4 with 200 high-quality samples (6%), selected from the 3500 samples used in the original MiniGPT-4.


不同于现有研究主要关注正面指导样本的研究,LRV-Instruction(Liu等人,2023a)包括正面和负面指导,以实现更强大的指导调整。其他沿这一方向的示例包括LLaVAR(Zhang等人,2023o),它为文本丰富的图像理解添加了与OCR相关的指导调整数据,以及StableLLaVA(Li等人,2023o),它考虑了模型合成的图像用于图像对话数据。Polite Flamingo(Chen等人,2023b)训练LLM以重新编写指导数据。与利用GPT-4进行数据生成不同,VIGC(Wang等人,2023a)考虑了利用LMM生成指导调整数据并在运行时逐步提高其质量。与NLP领域的LIMA中的“少即是多”的观察类似,InstructionGPT-4表明指导调整数据的质量比其数量更重要,他们使用了200个高质量样本(6%)来微调更好的MiniGPT-4版本,这些样本从原始的MiniGPT-4中选择而来(共有3500个样本)。

Multitask Instruct with Established Academic Datasets/Tasks建立学术数据集/任务的多任务指导:指令调优可以通过两种不同的方式实现=通过对多样化任务进行模型微调使用人工注释的提示和反馈+通过使用公共基准和数据集进行监督微调

As discussed earlier in Section 5.2, instruction tuning in the language domains is implemented in two different ways: finetuning the model on a wide range of tasks using human-annotated prompts and feedback (Ouyang et al., 2022), or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions (Wang et al., 2022f). The former is good at user-oriented daily life tasks, and the latter is good at achieving decent performance on established benchmarks. LLaVA and MiniGPT-4 fall into the former class. Several other works either target for the latter class or combine both classes, including MultiInstruct (Xu et al., 2022b), mPlug-OWL (Ye et al., 2023b), InstructBLIP (Dai et al., 2023b), Multimodal-GPT (Gong et al., 2023), Instruction-ViT (Xiao et al., 2023) and Qwen-VL (Bai et al., 2023a).

For example, MultiInstruct is an early attempt before open-source LLaMA for instruction tun-ing with multimodal datasets. InstructBLIP is a recent work that combines chat and benchmark instruction-following data. As shown in Figure 5.12, InstructBLIP transforms 26 publicly available datasets, covering a wide variety of tasks and capabilities, into instruction tuning format. Trained on 13 held-in datasets, InstructBLIP attains SoTA zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Qwen-VL scales up both image-text pair data for pre-traning and academic datasets for multi-task pre-traning, and achieve excellent performance on many tasks.


Multimodal In-Context-Learning多模态上下文学习

Similar to the behavior of LLMs, which can address a language task by processing examples of the task in their text prompt, multimodal in-context-learning refers to an visual and text interface that can steer the model towards solving a multimodal task. Given a few example pairs of visual inputs and expected text responses composed in the multimodal prompt, the model can be queried with a ques-tion about a new image or video, and then generate an answer. The direction to extend in-context-learning from language to multi-modalities has been explored, including OpenFlamingo (Awadalla et al., 2023), Otter (Li et al., 2023d), M3IT (Li et al., 2023j), MetaVL (Monajatipoor et al., 2023) and Sparkles (Huang et al., 2023d). OpenFlamingo (Awadalla et al., 2023) is an open source version of DeepMind’s Flamingo model, trained on Multimodal C4 dataset (Zhu et al., 2023b), which is a billions-scale corpus of inter-leaved image-text data. To explicitly enhance the multimodal in-context-learning ability of LMMs, MIMIC-IT (Li et al., 2023c) dataset is constructed, which is 2.4M multimodal instruction instances with in-context examples. By tuning OpenFlamingo on MIMIC-IT, a new model Otter is obtained with a stronger instruction-following ability. Using two image-text pairs as the context, Otter learns the concise answer style demonstrated by the examples, otherwise a tedious response is generated.

与LLMs的行为类似,LLMs可以通过处理任务示例的文本提示来解决语言任务,多模态上下文学习是指多模态任务的视觉和文本接口,可以引导模型解决多模态任务。通过给定几个视觉输入和预期的文本响应示例对组成的多模态提示,可以向模型提问关于新图像或视频的问题,然后生成答案。从语言到多模态的上下文学习扩展方向已经被探索,包括OpenFlamingo(Awadalla等人,2023)、Otter(Li等人,2023d)、M3IT(Li等人,2023j)、MetaVL(Monajatipoor等人,2023)和Sparkles(Huang等人,2023d)。OpenFlamingo(Awadalla等人,2023)是DeepMind的Flamingo模型的开源版本,训练在Multimodal C4数据集(Zhu等人,2023b)上,这是一个亿级别的交织图像-文本数据的语料库。为了明确提高LLMs的多模态上下文学习能力,构建了MIMIC-IT(Li等人,2023c)数据集,其中包含240万个多模态指导实例和上下文示例。通过在MIMIC-IT上调整OpenFlamingo,得到了一个具有更强的指导遵循能力的新模型Otter。使用两个图像-文本对作为上下文,Otter学习了示例中展示的简洁答案风格,否则将生成冗长的响应。

Parameter-Efficient Training参数高效训练:精细调整成本过高→参数高效训练和模型量化是减小内存占用的有效方法

While finetuning very large models often leads to high performance, it is prohibitively expensive; For example, regular 16-bit finetuning of a LLaMA-65B model (Touvron et al., 2023) requires more than 780 GB of GPU memory (Dettmers et al., 2023). Therefore, it is critical to reduce the memory footprint of LLMs/LMMs, especially when it comes to improve the accessibility of large models to a wider community.

Parameter-efficient training is an effective approach for LMM adaptation. It freezes most of the model parameters, and only allows a fraction of trainable parameters to update with domain-specific data. For example, LLaMA Adapter v2 (Gao et al., 2023b) and LAVIN (Luo et al., 2023a) only have 14M and 3.8M trainable parameters, compared with 7B/13B LLM parameters, respectively. Another efficient training method is quantization. The recent QLoRA (Dettmers et al., 2023) finetunes 65B LLaMA for 24 hours on a single GPU, achieving 99.3% of the performance level of ChatGPT. Since instruction tuning typically involves a small amount of data, it makes parameter-efficient training or model quantization the practical approach, especially when with limited GPU resources. Both LoRA (Hu et al., 2021) and QLoRA are supported in LLaVA codebase to allow LMM training with less GPUs. It is empirically shown in Lu et al. (2023d) that LoRA/QLoRA can achieve similar per-formance with full-modal tuning when scaling LLaVA to 33B and 65B, when training with around 150K instruct data and evaluating with LLaVA-Bench.

尽管对非常大的模型进行微调通常会导致高性能,但这是 prohibitively expensive 的;例如,对LLaMA-65B模型(Touvron等人,2023)进行常规的16位微调需要超过780 GB的GPU内存(Dettmers等人,2023)。因此,降低LLMs/LMMs的内存占用对于提高大型模型对更广泛的社区的可访问性至关重要。参数高效训练是一种有效的LMM适应方法。它冻结了大部分模型参数,只允许一小部分可训练参数随领域特定数据更新。例如,LLaMA Adapter v2(Gao等人,2023b)和LAVIN(Luo等人,2023a)只有1400万和380万可训练参数,而LLMs参数为700亿/130亿。另一种高效的训练方法是量化。最近的QLoRA(Dettmers等人,2023)在单个GPU上对65B LLaMA进行了24小时的微调,实现了ChatGPT性能水平的99.3%。由于指导调整通常涉及少量数据,因此在GPU资源有限的情况下,参数高效训练或模型量化是实际的方法。LLaVA代码库中都支持LoRA(Hu等人,2021)和QLoRA,允许以更少的GPU进行LMM训练。实验证明,在将LLaVA扩展到33B和65B时,使用LoRA/QLoRA可以实现与完全模态调整相似的性能,训练约15万个指导数据并使用LLaVA-Bench进行评估时。


While LMMs have shown excellent visual recognition and reasoning in an open-set manner with free-form text across many scenarios, the evaluation of LMMs is becoming an urgent and challeng-ing problem. Several related benchmarks have been developed to evaluate various aspects of LMMs, ranging from their specific abilities including OCR (Liu et al., 2023k), hallucination (POPE (Li et al., 2023l) and HaELM (Wang et al., 2023d)) and adversarial robustness (Zhao et al., 2023d), to comprehensive evaluation such as LAMM (Yin et al., 2023), LVLM-eHub (Xu et al., 2023b). We summarize the LMM evaluation benchmarks in Table 5.2. Among them, LLaVA-Bench is the first attempt to designed open-world visual chat benchmark specifically for LMM. Recently, early multi-modal experiments have been conducted to compare open-source LMM with commercial ones such as BingChat and Bard and LLaVA-Bench (Liu et al., 2023c) and LVLM-eHub (Shao et al., 2023).

It is surprising that LMMs shows strong zero-shot OCR performance in the wild, without explicitly training on text recognition data. To shed light on the hidden mystery of OCR in LMMs, a com-prehensive empirical study is conducted in Liu et al. (2023k) to compare open-source LMMs on 24 academic text recognition datasets, shown in Figure 5.13. Three observations are highlighted:(i) LLaVA consistently outperforms MiniGPT-4 on 21 out of 24 datasets, despite that the training data in LLaVA is an order of magnitude smaller. (ii) Training with significantly more training data leads to higher OCR performance, as demonstrated by BLIP2 (Li et al., 2023h) and mPLUG-Owl.(iii) In most cases, supervised SoTA results significantly outperform zero-shot LMM. However, it is worth noting that in the WordArt dataset (Xie et al., 2022a), which primarily features challenging artistic text, BLIP2 surpasses supervised SoTA. This reveals the potential of LMM in recognizing more complex text types.


令人惊讶的是,LMMs在域外表现出强大的零样本OCR性能,而没有明确在文本识别数据上进行训练。为了揭示LMMs中OCR的隐秘之处,Liu等人(2023k)进行了一项综合的经验研究,比较了24个学术文本识别数据集上的开源LMMs,如图5.13所示。三个观察结果被强调:(i)尽管LLaVA的训练数据规模比MiniGPT-4小一个数量级,但LLaVA在24个数据集中的21个数据集上始终优于MiniGPT-4。(ii)使用更多的训练数据会导致更高的OCR性能,正如BLIP2(Li等人,2023h)和mPLUG-Owl所示。 (iii) 在大多数情况下,监督SoTA的结果明显优于零样本LMM。然而,值得注意的是,在WordArt数据集(Xie等人,2022a)中,这个数据集主要包含具有挑战性的艺术文本,BLIP2超越了监督SoTA,这显示了LMM在识别更复杂的文本类型方面的潜力。


The success of ChatGPT/GPT-4 in the general domain has inspired the interests in building assistants in the vertical domains such as medicine, gaming and education. Such domain-specific assistants can have the several advantages over the general domain counterpart: (i) training with high-quality domain-speicifc data makes the assistants more helpful; (ii) the model size can be smaller, with lower severing cost; and (iii) the sensitive user prompt data can be maintained internally by serving the model locally, to avoid privacy issue.


To improve text recognition ability of LMM, OCR-specific models have been developed, including BLIVA (Hu et al., 2023), LLaVAR (Zhang et al., 2023o), mPlug-DocWL (Ye et al., 2023a). LMMs have been recently explored in the biomedical domain (Sun et al., 2023c; Zhang et al., 2023m; Li et al., 2023e), where conversational generative AI has demonstrated remarkable promise for empow-ering biomedical practitioners. LLaVA-Med (Li et al., 2023e) is a cost-efficient approach for train-ing a vision-language conversational assistant that can answer open-ended research questions about biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then finetune a large general-domain vision-language model LLaVA using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the image-caption pairs as is, then learns open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. In Figure 5.14, we provide examples of the biomed vi-sual conversations with different chatbots. LLaVA-Med precisely answers the questions requiring biomedical knowledge, while LLaVA behaves like a layperson, that hallucinates based on common-sense. LLaVA-Med has inspired several generalist biomedical AI models, including Google Med-PaLM-M (Tu et al., 2023), Stanford Med-Flamingo (Moor et al., 2023) and radiology generalist (Wu et al., 2023b).

为了提高LMM的文本识别能力,已经开发了OCR特定的模型,包括BLIVA(Hu等人,2023)、LLaVAR(Zhang等人,2023o)和mPlug-DocWL(Ye等人,2023a)。LMMs最近在生物医学领域得到了探索(Sun等人,2023c;Zhang等人,2023m;Li等人,2023e),在这个领域,对话生成AI已经展示出了为生物医学从业者提供强大帮助的潜力。LLaVA-Med(Li等人,2023e)是一种用于训练视觉-语言对话助手的成本效益方法,可以回答关于生物医学图像的开放性研究问题。其关键思想是利用从PubMed Central中提取的大规模、广覆盖的生物医学图示标题数据集,使用GPT-4从标题中自行生成开放性指导遵循数据,然后使用一种新的课程学习方法对大规模的通用领域视觉-语言模型LLaVA进行微调。具体来说,模型首先学习使用图像-标题对齐生物医学词汇,然后使用GPT-4生成的指导遵循数据来学习开放性对话语义,广泛模仿普通人逐渐获得生物医学知识的方式。在图5.14中,我们提供了不同聊天机器人的生物医学视觉对话示例。LLaVA-Med精确回答需要生物医学知识的问题,而LLaVA则表现得像一个普通人,根据常识进行幻想。LLaVA-Med启发了一些通用生物医学AI模型,包括Google Med-PaLM-M(Tu等人,2023)、Stanford Med-Flamingo(Moor等人,2023)和放射学通用医生(Wu等人,2023b)。

5.5、How Close We Are To OpenAI Multimodal GPT-4?我们距离OpenAI多模态GPT-4有多近?



With all the works mentioned above, are we close to (or, even surpassing) OpenAI Multimodal GPT-4? It is encouraging to see that the open-source community has quickly developed a variety of models and prototypes for various new capabilities. For example, LLaVA/Mini-GPT4 paves the way towards building multimodal chatbots, with some examples that reproduce the results in OpenAI GPT-4 technique report; CM3leon (Yu and et al, 2023), Emu (Sun et al., 2023a), GILL (Koh et al., 2023) extends LMMs for end-to-end image generation, to the best of our knowledge, this is a capability that the current GPT-4 does not exhibit. From the perspective of enabling new capabilities with the minimum prototypes, the open-source community seems close to OpenAI Multimodal GPT- 4, by exploring the baby steps towards building the general-purpose multimodal assistant.

However, there is still a clear large gap in terms of scaling a given capability, e.g., for the visual reasoning capability that we have observed in LLaVA. There are two more visual examples from OpenAI technical report, to correctly answer the questions, it requires models to understand multiple high-resolution images and long sequence text depicted in the image, as well as responding with domain knowledge. It requires much more compute and more powerful language models, which are not available to most people.

通过上述提到的所有工作,我们是否接近(甚至超越)OpenAI多模态GPT-4?令人鼓舞的是,开源社区迅速开发了各种不同新功能的模型和原型。例如,LLaVA/Mini-GPT4为构建多模态聊天机器人铺平了道路,其中一些示例重现了OpenAI GPT-4技术报告中的结果;CM3leon(Yu等人,2023)、Emu(Sun等人,2023a)、GILL(Koh等人,2023)扩展了LMM以实现端到端的图像生成,据我们所知,这是当前的GPT-4没有展示的能力。从通过最小原型来启用新功能的角度来看,开源社区似乎接近OpenAI多模态GPT-4,通过探索朝着构建通用多模态助手迈出的初步步伐。


In summary, we have presented the background and strong capabilities of LMM, reviewed instruc-tion tuning in LLMs, and showed how to build a prototype such as LLaVA and MiniGPT-4 using open-source resources. We also summarized the most recent papers emerged on this line of research to help those who are interested to gain the momentum to start the journey of LMM research. To discuss the next steps to work on as a community, one sustainable suggestion can be that those with resources can continue focusing on the scaling success and study new emerging properties, while others focus on prototypes for new functionalities and evaluation, as well as developing techniques to reduce the computational barriers and thus allow easier accessibility to large models.


6、Multimodal Agents:Chaining Tools with LLM 多模态智能代理:与LLM协同工作


Large Language Models (LLMs) (Chowdhery et al., 2022; OpenAI, 2023a) have shown intriguing properties generalizing to user prompts in various domains, and rapidly adapting to new scenarios, using in-context learning with a few examples. Inspired by such strong capabilities, researchers are now exploring a new modeling paradigm that shifts from standalone models for solving finite, pre-defined problems, into synergistically chaining multiple tools or experts with LLMs to solve complicated, open problems. Unlike what has been introduced in Chapter 5, such a system can be built without any training involved, just by using a few demonstration examples to teach the LLM to generate proper calling to existing tools.


In this chapter, we review the fast-evolving literature on chaining different multimodal experts with LLMs to solve complicated multimodal understanding problems, referred to as multimodal agents. We start with an overview on the evolution of this modeling paradigm in Section 6.1, highlighting the differences between traditional approaches and the new modeling paradigm of chaining tools with LLM. Section 6.2 gives a general overview of multimodal agents. Pivoting on an exemplary multimodal agent MM-REACT (Yang* et al., 2023), Section 6.3 comprehensively reviews how to build a multimodal agent, its emerging capabilities in multimodal understanding, and how it can be easily extended to incorporate the latest and strongest LLM and potentially millions of tools. Finally, in Section 6.4, we end the chapter with discussions on advanced topics, such as how to improve/evaluate multimodal agents, the diverse applications powered by multimodal agents.



第6.2节概述了多模态智能体的总体概述。以典型的多模式代理MM-REACT (Yang* et al., 2023)为中心,



