LAMM: Label Alignment for Multi-Modal Prompt Learning

系列论文研读目录



文章链接

文章题目含义

LAMM:多模态提示学习的标签对齐

Abstract

With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named LAMM, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.随着预训练的视觉语言(VL)模型(如CLIP)在视觉表征任务中的成功,将预训练的模型转移到下游任务已成为一个重要的范式。近年来,受自然语言处理(NLP)启发的即时调优范式在语言学领域取得了重大进展。然而,以前的方法主要集中在为文本和视觉输入构建提示模板,忽略了VL模型和下游任务之间的类标签表示的差距。为了应对这一挑战,我们引入了一种名为LAMM的创新标签对齐方法,该方法可以通过端到端训练动态调整下游数据集的类别嵌入。此外,为了实现更合适的标签分布,我们提出了一个分层损失,包括对齐的参数空间,特征空间和logits空间。我们在11个下游视觉数据集上进行了实验,并证明我们的方法显着提高了现有多模态提示学习模型在少数镜头场景中的性能,与16个镜头的最先进方法相比,平均准确率提高了2.31%。此外,我们的方法表现出卓越的持续学习相比,其他及时调整方法。重要的是,我们的方法与现有的快速调优方法是协同的,并且可以在它们之上提高性能。我们的代码和数据集将在https://github.com/gaojingsheng/LAMM上公开。

Introduction

  1. Building machines to comprehend multi-modal information in real-world environments is one of the primary goals of artificial intelligence, where vision and language are the two crucial modalities (Du et al. 2022). One effective implementation method is to pre-train a foundational visionlanguage (VL) model on a large-scale visual-text dataset and then transfer it to downstream application scenarios (Radford et al. 2021; Jia et al. 2021). Typically, VL models employ two separate encoders to encode image and text features, followed by the design of an appropriate loss function for training. However, finetuning on extensively trained models is costly and intricate, thus making the question of how to effectively transfer pre-trained VL models to downstream tasks a inspiring and valuable issue.构建机器来理解现实世界环境中的多模态信息是人工智能的主要目标之一,其中视觉和语言是两个关键的模态(Du et al. 2022)。一种有效的实现方法是在大规模视觉文本数据集上预训练基础视觉语言(VL)模型,然后将其转移到下游应用场景(拉德福等人,2021; Jia等人,2021)。通常,VL模型采用两个单独的编码器来编码图像和文本特征,然后设计适当的损失函数进行训练。然而,对广泛训练的模型进行微调是昂贵且复杂的,因此如何有效地将预训练的VL模型转移到下游任务是一个鼓舞人心且有价值的问题。

  2. Prompt learning provides an effective solution to this problem, which provides downstream tasks with corresponding textual descriptions based on human prior knowledge and can effectively enhance the zero-shot and few-shot recognition capability ofVL models. Through trainable templates with a small number of task-specific parameters, the process of constructing templates is further automated via gradient descent instead of manual constructions (Lester, Al-Rfou, and Constant 2021). Specifically, existing multimodal prompt tuning methods (Zhou et al. 2022b,a; Khattak et al. 2022) use the frozen CLIP (Radford et al. 2021) model and design trainable prompts separately for the textual and visual encoders. These approaches ensure that VL models could be better transferred to downstream tasks without any changes to the VL model’s parameters. However, their approach mainly focuses on the prompt template that is applicable to all categories, overlooking the feature representation of each category.提示学习为这一问题提供了有效的解决方案,它基于人类先验知识为下游任务提供相应的文本描述,可以有效提高VL模型的zero-shot和few-sho识别能力。通过具有少量特定于任务的参数的可训练模板,构建模板的过程通过梯度下降而不是手动构建进一步自动化(Lester、Al-Rfou和Constant 2021)。具体而言,现有的多模态提示调整方法(Zhou et al. 2022 b,a; Khattak et al. 2022)使用冻结CLIP(拉德福et al. 2021)模型,并分别为文本和视觉编码器设计可训练提示。这些方法确保VL模型可以更好地转移到下游任务,而不需要对VL模型的参数进行任何更改。然而,他们的方法主要集中在适用于所有类别的提示模板上,忽略了每个类别的特征表示。

  3. The token in the text template is crucial in classifying an image into the proper category. For example, as depicted in Figure 1, llamas and alpacas are two animals that resemble each other closely. In CLIP, there exists a propensity to misclassify a llama as an alpaca owing to the overrepresentation of alpaca data in the pre-training dataset. By refining the text embedding position, CLIP can distinguish between these two species with trained feature space. Hence, identifying an optimal representation for each category in downstream tasks within the VL model is crucial. In the field of NLP, there exists the soft verbalizer (Cui et al. 2022), which enables the model to predict the representation of in the text template to represent the category of the original sentence on its own. Unlike NLP, it is infeasible to task the text encoder of the VL model with predicting the image category directly. Nevertheless, we can optimize the category embeddings of various categories within the downstream datasets to increase the similarity between each image and its corresponding category description.文本模板中的标记对于将图像分类到适当的类别中至关重要。例如,如图1所示,美洲驼和羊驼是两种彼此非常相似的动物。在CLIP中,由于预训练数据集中羊驼数据的过度表示,存在将美洲驼误分类为羊驼的倾向。通过细化文本嵌入位置,CLIP可以用训练好的特征空间区分这两个物种。因此,在VL模型中确定下游任务中每个类别的最佳表示是至关重要的。在NLP领域,存在软动词化器(Cui et al. 2022),它使模型能够预测文本模板中的表示,以自己表示原始句子的类别。与NLP不同,直接预测图像类别是不可行的VL模型的文本编码器。然而,我们可以优化下游数据集中各种类别的类别嵌入,以增加每个图像及其相应类别描述之间的相似性。在这里插入图片描述

  4. Consequently, we introduce a label alignment technique named LAMM, which automatically searches optimal embeddings through gradient optimization. To the best of our knowledge, the concept of trainable category token is first proposed in the pre-trained VL models. Simultaneously, to prevent the semantic features of the entire prompt template from deviating too far, we introduce a hierarchical loss during our training phase. The hierarchical loss facilitates alignment of category representations among parameter, feature and logits spaces. With these operations, the generalization ability of CLIP model can be preserved in LAMM, which makes LAMM better distinguish different categories in downstream tasks while preserving the semantics of the original category descriptions. Furthermore, given that LAMM solely fine-tunes the label embeddings within the downstream dataset, it doesn’t encounter the issue of catastrophic forgetting typically encountered in conventional methods during continual learning.因此,本文提出了一种标记对齐技术LAMM,它通过梯度优化来自动搜索最优嵌入。据我们所知,可训练类别标记的概念是在预训练的VL模型中首次提出的。同时,为了避免整个提示模板的语义特征偏离太远,我们在训练阶段引入了层次丢失。层次损失便于在参数、特征和logit空间之间对齐类别表示。通过这些操作,CLIP模型的泛化能力在LAMM中得以保留,使得LAMM在下游任务中更好地区分不同的类别,同时保留了原始类别描述的语义。此外,假定LAMM仅微调下游数据集中的标签嵌入,则它不会遇到在连续学习期间传统方法中通常遇到的灾难性遗忘问题。

  5. We conduct experiments on 11 datasets, covering a range of downstream recognition scenarios. In terms of models, we test the vanilla CLIP, CoOp (Zhou et al. 2022b), and MaPLe (Khattak et al. 2022), which currently perform best in multi-modal prompt learning. Extensive experiments demonstrate the effectiveness of the proposed method within few-shot learning, illuminating its merits in both domain generalization and continual learning. Furthermore, our approach, being compatible to prevailing multi-modal prompt techniques, amplifies their efficacy across downstream datasets, ensuring consistent enhancement.我们在11个数据集上进行了实验,涵盖了一系列下游识别场景。在模型方面,我们测试了目前在多模态即时学习中表现最好的香草CLIP、CoOp(Zhou et al. 2022b)和MaPLe(Khattak et al. 2022)。与多种代表性算法的对比实验,验证了本方法的有效性。实验验证了该框架的有效性。此外,我们的方法与主流的多模态提示技术兼容,在下游数据集上增强了它们的效力,确保了一致的增强。

Related Work

Vision Language Models

In recent years, the development of Vision-Language Pre-Trained Models (VL-PTMs) has made tremendous progress, as evidenced by models such as CLIP (Radford et al. 2021), ALIGN (Jia et al. 2021), LiT (Zhai et al. 2022) and FILIP (Yao et al. 2022). These VL-PTMs are pre-trained on large-scale image-text corpora and learn universal cross-modal representations, which are beneficial for achieving strong performance in downstream VL tasks. For instance, CLIP are pre-trained on massive collections of image-caption pairs sourced from the internet, utilizing a contrastive loss that brings the representations of matching image-text pairs closer while pushing those of non-matching pairs further apart. After the pretrained stage, CLIP has demonstrated exceptional perfor mance on learning universal cross-modal representation in image-recognition (Gao et al. 2021), object detection (Zang et al. 2022), image segmentation (Li et al. 2022) and vision question answering (Sung, Cho, and Bansal 2022).近年来,视觉语言预训练模型(VL-PTM)的发展取得了巨大进展,CLIP(拉德福et al. 2021)、ALIGN(Jia et al. 2021)、LiT(Zhai et al. 2022)和FILIP(Yao et al. 2022)等模型就是明证。这些VL-PTM在大规模图像-文本语料库上进行预训练,并学习通用的跨模态表示,这有利于在下游VL任务中实现强大的性能。例如,CLIP是在来自互联网的大量图像-标题对集合上进行预训练的,利用对比损失,使匹配的图像-文本对的表示更接近,同时使不匹配的图像-文本对的表示更远。在预训练阶段之后,CLIP在图像识别(Gao et al. 2021),对象检测(Zang et al. 2022),图像分割(Li et al. 2022)和视觉问题回答(Sung,Cho和Bansal 2022)中学习通用跨模态表示方面表现出了卓越的性能。

4.

5.

6.

7.

8.

9.

10.轻松搞懂 Zero-Shot、One-Shot、Few-Shot

https://zhuanlan.zhihu.com/p/696054303

  • 14
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值