论文翻译《Semantic Prompt for Few-Shot Image Recognition》

论文地址:https://arxiv.org/pdf/2303.14123.pdf
论文代码:https://github.com/WentaoChen0813/SemanticPrompt

Abstract

Few-shot learning is a challenging problem since only a few examples are provided to recognize a new class. Several recent studies exploit additional semantic information, e.g. text embeddings of class names, to address the issue of rare samples through combining semantic prototypes with visual prototypes. However, these methods still suffer from the spurious visual features learned from the rare support samples, resulting in limited benefits. In this paper, we propose a novel Semantic Prompt (SP) approach for few-shot learning. Instead of the naive exploitation of semantic information for remedying classifiers, we explore leveraging semantic information as prompts to tune the visual feature extraction network adaptively. Specifically, we design two complementary mechanisms to insert semantic prompts into the feature extractor: one is to enable the interaction between semantic prompts and patch embeddings along the spatial dimension via self-attention, another is to supplement visual features with the transformed semantic prompts along the channel dimension. By combining these two mechanisms, the feature extractor presents a better ability to attend to the class-specific features and obtains more generalized image representations with merely a few support samples. Through extensive experiments on four datasets, the proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.

小样本学习是一个具有挑战性的问题,因为只有很少的例子可以识别一个新的类。最近的一些研究利用额外的语义信息,例如类名的文本嵌入,通过结合语义原型和视觉原型来解决稀少样本的问题。然而,这些方法仍然会受到从稀少的支持样本中学习到的虚假视觉特征的影响,从而导致有限的性能增益。在本文中,我们提出了一种新颖的面向小样本学习的语义提示(Semantic Prompt,SP)方法。并不是简单地利用语义信息对分类器进行补救,而是探索利用语义信息作为提示,对视觉特征提取网络进行自适应调整。具体来说,我们设计了两种互补的机制将语义提示插入到特征提取器中:一种是通过自注意力机制在空间维度上实现语义提示与块嵌入之间的交互,另一种是在通道维度上用转换后的语义提示补充视觉特征。通过结合这两种机制,特征提取器表现出更好的关注类特定特征的能力,并在仅有少量支持样本的情况下获得更一般化的图像表示。通过在四个小样本图像分类数据集上的大量实验,所提出的方法取得了良好的效果,平均提高了3.67 %的1-shot学习精度。

1. Introduction

Few-shot learning (FSL) [21] is a fundamental and challenging task and remains largely unsolved as it aims to predict a new class with rare samples. To address this problem, most effective FSL approaches leverage the prior knowledge learned from a large labeled base dataset, and encode the prior knowledge as a set of initial network parameters [12, 37, 42], or a fixed embedding function shared by all classes [16, 45, 46, 49].

Few-shot learning (FSL)[21]是一项基本且具有挑战性的任务,由于它旨在用稀有的样本预测一个新的类,因此在很大程度上仍未得到解决。为了解决这个问题,大多数有效的FSL方法利用从大型标记基数据集中学习到的先验知识,并将先验知识编码为一组初始网络参数[12,37,42],或所有类共享的固定嵌入函数[16,45,46,49]。

As the labeled images of novel classes are scarce, a straightforward alternative is to use auxiliary information from other modalities, e.g. natural language, to assist in learning new concepts, which has been extensively studied in zero-shot learning [13,26,40,43]. These methods usually directly use textual embeddings as the image classifiers for novel classes. Following this idea, a recent FSL study [52] proposes to infer textual prototypes from class names and combine them with the visual prototypes (i.e., classifiers) extracted from the rare support images. Others [32, 53] improve this work by introducing more sophisticated textual prototype predictors (e.g. Graph Convolutional Network) or producing more accurate textual prototypes through leveraging the benefits of large-scale pre-trained language models.

由于新类别的标记图像稀缺,一种直接的替代方法是使用其他模态的辅助信息,例如自然语言,来辅助学习新概念,这在零样本学习[ 13、26、40、43 ]中得到了广泛的研究。这些方法通常直接使用文本嵌入作为新类别的图像分类器。遵循这一思路,最近的一项FSL研究[ 52 ]提出从类名推断文本原型,并将其与从稀有支持图像中提取的视觉原型(即分类器)相结合。其他[ 32、53 ]通过引入更复杂的文本原型预测器( e.g.图卷积网络( Graph Convolutional Network ))或通过利用大规模预训练的好处来产生更准确的文本原型来改进这项工作。

In spite of their success, most of the above methods for directly inferring class prototypes from textual features ignore the information gap between textual and visual features. Specifically, the textual features may contain the semantic relationship between a novel class and known classes. However, they fail to provide the exact discriminative visual features of the new class because of lacking interaction with the underlying visual representations. As a result, the rich semantic information has derived limited benefit for recognizing novel classes when directly injecting it into classifiers. Moreover, with only limited support images, the learned visual features still suffer from spurious features, such as background clutters, and struggles to produce an accurate class prototype. For example, as illustrated in Figure 1, given one support image of a novel class ‘unicycle’, the feature extractor may capture image features containing both unicycles and other distractors, like riders and tile roofs, and fail to recognize the unicycle in other environments. Actually, human perception system has a unique visual perceptual mechanism, called cognitive penetrability [30], which uses linguistic prior knowledge to tune ongoing visual perceptual processing to category-relevant stimulus features, promoting the learning of novel objects. Hence, it is necessary to develop a new architecture for effectively leveraging textual information to remedy the defective representation caused by rare samples.

尽管这些方法取得了成功,但大多数从文本特征直接推断类原型的方法都忽略了文本特征和视觉特征之间的信息差距。具体来说,文本特征可能包含新类和已知类之间的语义关系。然而,由于缺乏与底层视觉表征的交互,它们无法提供新类的确切的判别视觉特征。因此,当将丰富的语义信息直接输入分类器时,对识别新类的好处有限。此外,在只有有限的支持图像的情况下,学习到的视觉特征仍然受到虚假特征的影响,例如背景杂乱,并且难以产生准确的类原型。例如,如图1所示,给定一个新类别“独轮车”的支持图像,特征提取器可能捕获包含独轮车和其他干扰物(如骑手和瓦片屋顶)的图像特征,而无法识别其他环境中的独轮车。实际上,人类感知系统具有一种独特的视觉感知机制,称为认知渗透效应[30],它利用语言先验知识将正在进行的视觉感知加工调整为与类别相关的刺激特征,从而促进对新物体的学习。因此,有必要开发一种新的架构来有效地利用文本信息,以弥补由于稀有样本而导致的缺陷表示。

在这里插入图片描述

Figure 1. Given only one image about a new class ‘unicycle’, the feature extractor is easily confused by the spurious features, such as the rider on the unicycle, and fails to obtain generalized image representations about the new class. In this paper, we propose Semantic Prompt, a new method to condition the feature extraction on rich semantic prior knowledge, such that the feature extractor captures the intrinsic class-specific features about the novel class.

图1所示。仅给定一张关于新类别“独轮车”的图像,特征提取器很容易被虚假特征混淆,例如骑独轮车的人,并且无法获得关于新类别的广义图像表示。本文提出了一种新的语义提示方法,将丰富的语义先验知识作为特征提取的条件,使特征提取器能够捕捉到新类固有的类特有特征。

In this paper, we propose Semantic Prompt, a novel approach that leverages textual information of class names to significantly improve the representation ability of visual features for few-shot learning. Instead of directly inferring prototypes from textual features, we explore leveraging the textual features as semantic prompts to adaptively tune the feature extraction network for the rare support samples. As shown in Figure 1, with the guidance of semantic prompts, the feature extractor is expected to capture the intrinsic class-specific features for the novel class rather than other background clutters. Moreover, the advent of largescale training has produced a cornucopia of powerful Natural Language Processing (NLP) models, such as BERT [9] and GPT [36], which bootstrap extracting rich textual information from class names. Through the interaction between semantic prompts and visual features, such semantically rich representations have powerful potential to provide the feature extractor with additional discriminative visual features about the new class, and subsequently produce more generalized class prototypes.

在本文中,我们提出了一种新颖的语义提示方法,该方法利用类名的文本信息来显著提高视觉特征对小样本学习的表征能力。与直接从文本特征推断原型不同,我们探索利用文本特征作为语义提示,自适应地调整特征提取网络,以获得稀有的支持样本。如图1所示,在语义提示的引导下,特征提取器有望捕获新类别的内在类别特异性特征,而不是其他背景杂波。此外,大规模训练的出现产生了强大的自然语言处理( Natural Language Processing,NLP )模型,如BERT [ 9 ]、GPT [ 36 ]等,它们自动地从类名中提取丰富的文本信息。通过语义提示和视觉特征之间的相互作用,这种语义丰富的表示具有强大的潜力,可以为特征提取器提供关于新类的额外的判别性视觉特征,从而产生更泛化的类原型。

To condition the visual feature extraction on semantic prompts, we propose two complementary mechanisms to inject semantic information into the feature extractor, which allow the interaction between semantic prompts and visual features on the spatial and the channel dimensions, respectively. Specifically, to facilitate the interaction on the spatial dimension, we extend the image patch sequence with semantic prompts and feed them into a Transformer encoder. Through self-attention layers, the semantic prompts can inform the feature extractor to attend to the class-specific features while suppressing other distractors. For the interaction on the channel dimension, we first concatenate the semantic prompts with the visual context extracted from all patches, and then feed them into an MLP module. The extracted feature vector is added to each patch token to modulate and augment the visual features channel-by-channel. By combining the two interaction mechanisms, the proposed Semantic Prompt approach (SP) can effectively leverage the textual information in class names to boost FSL. Through comprehensive experiments on four benchmarks, the proposed SP presents consistent performance improvements with different types of text encoders and architecture designs, demonstrating its strong generality for the FSL problem.

为了使视觉特征提取以语义提示为条件,我们提出了两种互补机制,将语义信息注入特征提取器,使语义提示和视觉特征分别在空间和通道维度上进行交互。具体来说,为了促进空间维度上的交互,我们用语义提示扩展图像块序列,并将其输入到Transformer编码器中。通过自注意力层,语义提示可以通知特征提取器注意特定类别的特征,同时抑制其他干扰因素。对于通道维度上的交互,我们首先将语义提示与从所有位置的图像特征中中提取的视觉上下文进行串拼接,然后将其输入到MLP模块中。将提取的特征向量添加到每个patch token(图像块特征)中,以实现对视觉特征逐通道的调整和增强。通过结合这两种交互机制,所提出的语义提示方法(SP)可以有效地利用类名中的文本信息来增强 FSL。通过在四个小样本图像分类数据集上的综合实验,所提出的语义提示方法SP在不同类型的文本编码器和架构设计下都实现了一致的性能提升,证明了它在 FSL 问题上的强大通用性。

In summary, our contribution are three-folds:

总之,我们的贡献有三个方面:

We propose a novel Semantic Prompt approach to leveraging textual information in class names for fewshot image recognition, which is inspired by the topdown cognitive penetrability effect in human perception and aims to adaptively tune the feature extraction to class-specific features according to the semantic prompts.

我们提出了一种新颖的语义提示方法来利用类名中的文本信息进行小样本图像识别,该方法受到人类感知中自上而下的认知渗透效应的启发,旨在根据语义提示将特征提取自适应地调整为类特定的特征。

To condition visual feature extraction on semantic prompts, we propose two complementary mechanisms to inject semantic prompts into the visual feature extractor, which allow the interaction on the spatial and the channel dimensions, respectively.

为了将视觉特征提取限定在语义提示上,我们提出了两种互补的机制将语义提示注入到视觉特征提取器中,分别允许空间维度和通道维度上的交互。

The proposed method achieves remarkable performance on four FSL benchmarks, improving the FSL accuracy by 3.67% on average under the challenging 1-shot setting.

所提出的方法在四个FSL小样本图像分类数据集上取得了显著的性能,在具有挑战性的1-shot设置下,FSL准确率平均提高了3.67%。

2. Related work

Few-shot learning. FSL aims to recognize novel classes given only a few examples for each class. Previous work usually adopts a meta-learning paradigm, in which a learner is trained on a sequence of few-shot training tasks (named episodes) sampled from a large base dataset in order to rapidly adapt to unseen testing tasks. In particular, optimization-based methods [12,37,42] aim to learn a set of optimal initial parameters shared by all tasks with fast adaptation ability. Metric learning-based methods [16,45,46,49] learn a fixed embedding function, which maps input images into a low-dimension embedding space and classifies unlabeled queries according to certain distances to the support samples, e.g., Euclidean distance [45], cosine-similarity distance [31], and Earth Mover’s Distance [56].

小样本学习。 FSL的目的是识别新的类别,每个类别只给出几个样例。先前的工作通常采用元学习范式,即学习者在从大型基础数据集采样的一系列小样本训练任务(命名为episodes)上进行训练,以快速适应未知的测试任务。特别地,基于优化的方法[ 12、37、42 ]旨在学习一组由所有任务共享的、具有快速自适应能力的最优初始参数。基于度量学习的方法[ 16、45、46、49]学习一个固定的嵌入函数,将输入图像映射到一个低维的嵌入空间中,并根据与支持样本的一定距离,如欧氏距离[ 45 ]、余弦相似距离[ 31 ]、地球移动距离[ 56 ]等,对未标记查询进行分类。

Few-shot learning with language. To leverage additional information from other modalities (especially language) to help recognize novel classes, a line of recent studies [3, 24, 32, 52] propose to integrate both visual features and auxiliary text features to represent a novel class. For example, Xing et al. [52] propose an adaptive fusion mechanism to combine a visual prototype with a semantic prototype obtained by the word embedding of the class label. Peng et al. [32] adopt a Graph Convolutional Network [58] as the predictor to incorporate additional knowledge from a knowledge graph. Yan et al. [54] propose a word vectorguided attention mechanism to obtain label prototypes for the multi-label few-shot learning problem. Different from previous work that leverages semantic information at the level of classifiers or class prototypes, we explore the auxiliary information as a kind of semantic prompt to enhance the feature extraction for the limited support samples.

小样本语言学习。 为了利用来自其他模态(尤其是语言)的额外信息来帮助识别新类别,最近的一系列研究[3, 24, 32, 52]提出了整合视觉特征和辅助文本特征来表示新类别的方法。 例如,Xing 等人[52] 提出了一种自适应融合机制,将视觉原型与通过类标签的词嵌入获得的语义原型结合起来。 Peng 等人[32]采用图卷积网络[58]作为预测器,从知识图谱中纳入额外的知识。 Yan 等人[54]提出了一种词向量引导的关注机制,用于获取多标签小样本学习问题的标签原型。 与以往在分类器或类原型层面上利用语义信息的工作不同,我们将辅助信息作为一种语义提示进行探索,以加强对有限支持样本的特征提取。

Transformer and prompt-based learning. Transformer is general network architecture for NLP tasks [5, 9, 36, 55], and has also demonstrated great potential to deal with computer vision tasks [11,28,44,59]. Specially, Dosovitskiy et al. [11] propose a simple Vision Transformer (ViT) architecture that regards image patches as a sequence and inputs them into a Transformer encoder to extract visual features. Due to the limited inductive bias in its architecture design, Transformer usually requires a lot of data to learn a new task. To address this problem, prompt-based methods [5, 34] have been proposed to adapt a pre-trained language model to down-stream tasks in a data-efficient way. For example, Brown et al. [5] wrap the input sentence with several hand-crafted prompt words, which inform the model of the task prior knowledge and modulate the model’s behavior to the desired mode. Other studies [23, 25, 57] propose to replace the discrete prompt words with continuous prompt vectors that are easier to optimize than the discrete prompts. Recently, Tsimpoukelli et al. [48] propose a cross-modal prompt approach, which regards image features as the prompts for language inputs to perform multimodal few-shot learning. In this paper, we propose to regard textual features as the semantic prompts for image inputs, which can tune the ongoing visual feature extraction to class-specific features and facilitate learning novel classes with few examples. As far as we know, this is the first time to adopt semantic features as prompts to tune visual feature extractors for few-shot learning.

Transformer 和基于提示的学习。 Transformer是用于NLP任务的通用网络架构[5,9,36,55],并且在处理计算机视觉任务方面也显示出巨大的潜力[11,28,44,59]。特别地,Dosovitskiy等人[11]提出了一种简单的视觉变换(Vision Transformer, ViT)架构,将图像patch作为一个序列,输入到Transformer编码器中提取视觉特征。由于其架构设计中有限的归纳偏置,Transformer通常需要大量数据来学习新任务。为了解决这个问题,已经提出了基于提示的方法[5,34],以数据高效的方式使预训练的语言模型适应下游任务。例如,Brown等人[5]用几个手工制作的提示词包裹输入句子,这将任务先验知识告知模型,并将模型的行为调节到所需的模式。其他研究[23,25,57]提出用连续提示向量代替离散提示词,连续提示向量比离散提示更容易优化。最近,Tsimpoukelli等人[48]提出了一种跨模态提示方法,将图像特征作为语言输入的提示,进行多模态小样学习。在本文中,我们提出将文本特征作为图像输入的语义提示,这可以将正在进行的视觉特征提取调整为特定于类的特征,并便于用较少的示例学习新类。据我们所知,这是第一次采用语义特征作为提示来调优视觉特征提取器进行小样本学习。

3. Problem formulation

The FSL problem is usually defined as a N -way K-shot classification task, where a model should classify a query sample x q x^q xq from the query set Q into one of N classes C n o v e l C_{novel} Cnovel, based on a few labeled examples ( x i s , y i s ) i = 1 N × K (x^s_i , y^s_i )^{N\times K}_{i=1} (xis,yis)i=1N×K from the support set S. Since it is very difficult to train a model from scratch with the small support set S, a large labeled dataset D b a s e D_{base} Dbase is provided to pre-train the model before performing few-shot learning. Previous work usually adopts a meta-training strategy [49] to split the base dataset into multiple N -way K-shot episodes. Each episode also contains a support set and a query set, mimicking the few-shot learning problem during testing. Note that the base classes C b a s e C_{base} Cbase do not overlap with the novel classes, i.e., C b a s e ∩ C n o v e l = ϕ C_{base} \cap C_{novel} = \phi CbaseCnovel=ϕ. Therefore, the model is expected to acquire the ability to generalize to unseen classes after meta-training.

FSL 问题通常被定义为一个 N-way K-shot 分类任务,其中,模型应根据支持集 S 中的少量带标记示例 ( x i s , y i s ) i = 1 N × K (x^s_i , y^s_i )^{N\times K}_{i=1} (xis,yis)i=1N×K 将查询集 Q 中的查询样本 x q x^q xq分类为 N 个类别之一 C n o v e l C_{novel} Cnovel。由于很难在支持集 S 较小的情况下从头开始训练模型,因此在进行小样本学习之前,需要提供一个大的标注数据集 D b a s e D_{base} Dbase 来预训练模型。以往的工作通常采用元训练策略[49],将基础数据集分割成多个 N-way K-shot episodes。每个episode还包含一个支持集和一个查询集,以模拟测试过程中的小样本学习问题。需要注意的是,基础类 C b a s e C_{base} Cbase 不会与新类别重叠,即 C b a s e ∩ C n o v e l = ϕ C_{base} \cap C_{novel} = \phi CbaseCnovel=ϕ。因此,该模型有望在元训练后获得泛化到未见类别的能力。

Variant: In most previous work, the image label y y y is usually represented as a one-hot vector, e.g. y = [ 0 , 1 , 0 , 0 , . . . ] y = [0, 1, 0, 0, ...] y=[0,1,0,0,...]. However, this representation erases the semantic relationships among object concepts and ignores the valuable linguistic information contained in the textual labels. In this paper, we retain text labels (e.g. ‘cat’, ‘dog’) besides the one-hot labels in order to extract semantics from text labels. We denote y t e x t y^{text} ytext as the text label to distinguish it with the one-hot label y y y.

变体:在以前的大多数工作中,图像标签 y y y 通常表示为one-hot向量,例如 y = [ 0 , 1 , 0 , 0 , . . . ] y = [0, 1, 0, 0, ...] y=[0,1,0,0,...]。然而,抹去了对象概念之间的语义关系,忽略了文本标签中蕴含的有价值的语言信息。在本文中,除了one-hot标签,我们还保留了文本标签(如 “猫”、“狗”),以便从文本标签中提取语义。我们将文本标签 y t e x t y^{text} ytext与独热标签 y y y区分开来。

4. Method

Following [6], our approach consists of two training stages. In the first stage, we pre-train a feature extractor f f f by classifying all images in the base set D b a s e D_{base} Dbase. In the second stage, we fine-tune f f f with Semantic Prompt (SP) under the meta-learning paradigm, such that f f f acquires the ability to extract generalized and class-relevant visual features for data-scarce scenarios.

根据文献[ 6 ],我们的方法包括两个训练阶段。在第一阶段,我们通过对基数据集 D b a s e D_{base} Dbase中的所有图像进行分类来预训练一个特征提取器 f f f。在第二阶段,我们在元学习范式下使用语义提示(Semantic Prompt,SP)对 f f f进行微调,使得 f f f获得了针对数据稀缺场景提取通用的和类相关视觉特征的能力。

在这里插入图片描述

Figure 2. Framework of the proposed Semantic Prompt approach. The support image is split into small patches and fed into Transformer layers to extract visual features, which however may contain both class-specific features and other clutters. To address this problem, we leverage textual features extracted from class names as semantic prompts to adaptively tune the visual feature extraction. The semantic prompts can interact with visual features along the spatial and the channel dimensions, and guide the feature extractor to capture the intrinsic discriminative features about the new class.

图2。提出的语义提示方法的框架。支持图像被分割成小块并送入Transformer层以提取视觉特征,但这些特征可能同时包含类别特异性特征和其他杂乱特征。为了解决这个问题,我们利用从类名中提取的文本特征作为语义提示,自适应地调整视觉特征提取。语义提示可以沿着空间维度和通道维度与视觉特征进行交互,引导特征提取器捕获关于新类的内在判别性特征。

4.1. Pre-training

Learning a general feature extractor is the key to transfer knowledge to down-stream learning tasks [15, 19, 35], including few-shot learning [47]. Given the labeled base dataset D b a s e D_{base} Dbase, we adopt a simple supervised learning paradigm to learn the feature extractor. A linear classification head [ W , b ] [W, b] [W,b] is added on the top of the feature extractor, which maps the input feature vector f ( x ) f(x) f(x) into one of the base classes. We jointly train the feature extractor and the classification head by minimizing the standard cross entropy loss:
L p r e = 1 ∣ D b a s e ∣ ∑ ( x , y ) ∈ D b a s e − log ⁡ exp ⁡ ( W y T f ( x ) + b y ) ∑ i exp ⁡ ( W i T f ( x ) + b i ) \mathcal{L}_{pre}=\frac1{|D_{base}|}\sum_{(x,y)\in D_{base}}-\log\frac{\exp(W_y^Tf(x)+b_y)}{\sum_i\exp(W_i^Tf(x)+b_i)} Lpre=Dbase1(x,y)Dbaselogiexp(WiTf(x)+bi)exp(WyTf(x)+by)
where W i W_i Wi, b i b_i bi denote the classifier weight and the bias for the class i i i.

学习一个通用的特征提取器是将知识迁移到下游学习任务[ 15、19、35]的关键,包括小样本学习[ 47 ]。给定有标签的基础数据集 D b a s e D_{base} Dbase,我们采用简单的监督学习范式来学习特征提取器。在特征提取器的顶部添加一个线性分类头 [ W , b ] [W, b] [W,b],将输入的特征向量 f ( x ) f(x) f(x)映射为其中的一个基类。我们通过最小化标准交叉熵损失来联合训练特征提取器和分类头:
L p r e = 1 ∣ D b a s e ∣ ∑ ( x , y ) ∈ D b a s e − log ⁡ exp ⁡ ( W y T f ( x ) + b y ) ∑ i exp ⁡ ( W i T f ( x ) + b i ) \mathcal{L}_{pre}=\frac1{|D_{base}|}\sum_{(x,y)\in D_{base}}-\log\frac{\exp(W_y^Tf(x)+b_y)}{\sum_i\exp(W_i^Tf(x)+b_i)} Lpre=Dbase1(x,y)Dbaselogiexp(WiTf(x)+bi)exp(WyTf(x)+by)
其中, W i W_i Wi b i b_i bi分别表示分类器第 i i i类的权重和偏置。

Backbone: To facilitate the following interaction between visual features and semantic prompts, we adopt the Vision Transformers as the image feature extractor f f f. Specifically, an input image x ∈ R H × W × C x \in \mathbb{R}^{H\times W \times C} xRH×W×C is first divided to a sequence of M M M image patches X = x p 1 , x p 2 , . . . , x p M X = {x^1_p, x^2_p, ..., x^M_p } X=xp1,xp2,...,xpM where x p i ∈ R P × P × C x^i_p \in \mathbb{R}^{P\times P \times C} xpiRP×P×C is an image patch and P P P is the patch size. Then, each patch is mapped into an embedding vector and added with a learnable position embedding. The preprocessed image patches for the Transformer input can be written as: Z 0 = [ z 0 1 , z 0 2 , . . . , z 0 M ] Z_0 = [z^1_0, z^2_0, ..., z^M_0 ] Z0=[z01,z02,...,z0M], where z 0 i ∈ R C z z^i_0 \in \mathbb{R}^{C_z} z0iRCz is the patch token at the position i i i and C z C_z Cz is the number of channels of each token.

主干:为了便于后续视觉特征与语义提示的交互,我们采用Vision Transformers作为图像特征提取器 f f f。具体来说,首先将输入图像 x ∈ R H × W × C x \in \mathbb{R}^{H\times W \times C} xRH×W×C划分为 M M M个图像块序列 X = x p 1 , x p 2 , . . . , x p M X = {x^1_p, x^2_p, ..., x^M_p } X=xp1,xp2,...,xpM,其中 x p i ∈ R P × P × C x^i_p \in \mathbb{R}^{P\times P \times C} xpiRP×P×C为图像块, P P P为图像块的大小。然后,将每个块映射为一个嵌入向量,并添加一个可学习的位置嵌入。预处理后的Transformer输入图像块可以写成: Z 0 = [ z 0 1 , z 0 2 , . . . , z 0 M ] Z_0 = [z^1_0, z^2_0, ..., z^M_0 ] Z0=[z01,z02,...,z0M],其中 z 0 i ∈ R C z z^i_0 \in \mathbb{R}^{C_z} z0iRCz是位置 i i i处的patch token(图像块特征), C z C_z Cz是每个token的通道数。

The patch tokens are fed into L Transformer layers to extract visual features, each of which consists of multihead self-attention (MSA), a MLP block, Layernorm (LN), and residual connections. (Please refer to the appendix for more details.) At the top layer L, we average all embedding vectors in the sequence as the extracted image features:
f ( x ) = 1 M ∑ i = 1 M z L i f(x)=\frac1M\sum_{i=1}^Mz_L^i f(x)=M1i=1MzLi
where z L i z^i_L zLi is the i t h i_{th} ith embedding vector at the layer L.

patch token(图像块特征)被送入 L 个Transformer层以提取视觉特征,每个Transformer层由多头自注意(MSA)模块、一个 MLP 块、Layernorm(LN)和残差连接组成。(更多细节请参考附录。)在顶层 L,我们将序列中所有嵌入向量的平均值作为提取的图像特征:
f ( x ) = 1 M ∑ i = 1 M z L i f(x)=\frac1M\sum_{i=1}^Mz_L^i f(x)=M1i=1MzLi
其中, z L i z^i_L zLi 是第 L 层的 i t h i_{th} ith 嵌入向量。

Note that self-attention has quadratic computation costs with respect to the sequence length. To reduce computation costs, we adopt the Visformer [7], a variant of the original ViT [11], in our implementation, which replaces the first seven Transformer layers with convolutional blocks and adopts pooling among stages to reduce the sequence length.

需要注意的是,自注意力的计算成本与序列长度成二次方关系。为了减少计算成本,我们采用了Visformer [7],它是原始ViT [11]的一个变体,它用卷积块代替了前7个Transformer层,并在阶段之间采用池化来减少序列长度。

4.2. Semantic Prompt

After pre-trained on the base dataset, the feature extractor f f f can extract substantial visual features from the input images. However, due to the semantic shift between novel classes and the base dataset, the feature extractor is limited in its ability to generalize the knowledge to novel concepts with only a few labeled examples, especially when spurious correlations appear in novel class images [3, 50]. For example, given an image of an unseen bird standing in a tree, the model may treat both bird features and other visual features (e.g. leaves, twigs) to represent the concept of the bird, and fails to recognize the bird in other environments.

经过在基础数据集上的预训练,特征提取器 f f f 可以从输入图像中提取大量的视觉特征。然而,由于新类别和基础数据集之间的语义转换,特征提取器仅能通过少量的标注样本将知识泛化到新概念上,特别是当新类别图像[3、50]中出现虚假相关时,其能力受到限制。例如,给定一张从未见过的鸟立在树上的图像,该模型可能同时将鸟类特征和其他视觉特征(如树叶、树枝)来表示鸟类的概念,而无法识别其他环境中的鸟类。

To mitigate this problem, we explore additional semantic information as prompts to guide the visual feature network to obtain intrinsic and discriminative class prototypes under rare support samples, so that query images can be classified easily in terms of their distances to theses prototypes. Specifically, textual data of class names is adopted as prior knowledge for novel classes, due to its strong ability to describe semantics. Moreover, we use the NLP models with large-scale pre-training [33, 35, 38] to extract textual features. The prior knowledge from a large bank of pre-trained NLP models benefits textual feature extraction from class names.

为了缓解这个问题,我们探索了额外的语义信息作为提示,以指导视觉特征网络在稀少的支持样本下获得内在的、有判别力的类原型,从而可以根据查询图像与这些原型的距离轻松地进行分类。具体来说,类名的文本数据由于其强大的语义描述能力被用作新类的先验知识。此外,我们使用大规模预训练[33、35、38]的NLP模型来提取文本特征。来自大量预训练NLP模型的先验知识有利于从类名中提取文本特征。

To accommodate the model to semantic prompts, we adopt the meta-training strategy [49] to fine-tune the feature extractor associated with semantic prompts on a series of training episodes. The framework of our approach is illustrated in Figure 2. Specifically, given a support image x s x^s xs in a training episode, we feed its class name y t e x t y^{text} ytext into a pre-trained language model g ( ⋅ ) g(\cdot) g() to extract semantic features i.e., g ( y t e x t ) g(y^{text}) g(ytext). The semantic features are used to modulate the feature extraction for the rare support samples. We denote f g ( x s ) = f ( x s ∣ g ( y t e x t ) ) f_g(x^s)=f(x^s|g(y^{text})) fg(xs)=f(xsg(ytext)) as the conditional feature extraction process, which will be described in the following section. The obtained support features are averaged within each class to compute class prototypes. Let p i p_i pi denote the prototype for the class i i i, then
p i = 1 K ∑ j = 1 K f g ( x j s ) p_i=\frac1K\sum_{j=1}^Kf_g(x_j^s) pi=K1j=1Kfg(xjs)
where x j s x^s_j xjs is the j t h j^{th} jth support image of the class i i i.

为了让模型适应语义提示,我们采用了元训练策略[49],在一系列episodes中对与语义提示相关的特征提取器进行微调。我们的方法框架如图2所示。具体来说,在训练集中给定一个支持图像 x s x^s xs,我们将其类名 y t e x t y^{text} ytext 输入预先训练好的语言模型 g ( ⋅ ) g(\cdot) g(),以提取语义特征,即 g ( y t e x t ) g(y^{text}) g(ytext)。语义特征用于调节稀有支持样本的特征提取。我们将 f g ( x s ) = f ( x s ∣ g ( y t e x t ) ) f_g(x^s)=f(x^s|g(y^{text})) fg(xs)=f(xsg(ytext)) 表示为条件特征提取过程,这将在下面的章节中描述。将得到的支持特征在每个类内求平均,从而计算出类原型。令 p i p_i pi表示第 i i i类的原型,则
p i = 1 K ∑ j = 1 K f g ( x j s ) p_i=\frac1K\sum_{j=1}^Kf_g(x_j^s) pi=K1j=1Kfg(xjs)
其中, x j s x^s_j xjs 是类 i i i j t h j^{th} jth 支持图像。

During meta-training, we freeze the text encoder g ( ⋅ ) g(\cdot) g() and fine-tune other parameters by maximizing the feature similarities between query samples and their prototypes with a cross-entropy loss:
L m e t a = − E S , Q E x q log ⁡ exp ⁡ ( s ( f ( x q ) , p y q ) / τ ) ∑ i = 1 N exp ⁡ ( s ( f ( x q ) , p i ) / τ ) \mathcal{L}_{meta}=-\mathbb{E}_{S,Q}\mathbb{E}_{x^q}\log\frac{\exp(s(f(x^q),p_{y^q})/\tau)}{\sum_{i=1}^N\exp(s(f(x^q),p_i)/\tau)} Lmeta=ES,QExqlogi=1Nexp(s(f(xq),pi)/τ)exp(s(f(xq),pyq)/τ)
where s s s denotes the cosine similarity, p y q p_{y^q} pyq is the prototype of the class y q y^q yq, and τ \tau τ is a temperature hyper-parameter.

在元训练期间,我们冻结文本编码器 g ( ⋅ ) g(\cdot) g(),并通过交叉熵损失最大化查询样本与其原型之间的特征相似性来微调其他参数:
L m e t a = − E S , Q E x q log ⁡ exp ⁡ ( s ( f ( x q ) , p y q ) / τ ) ∑ i = 1 N exp ⁡ ( s ( f ( x q ) , p i ) / τ ) \mathcal{L}_{meta}=-\mathbb{E}_{S,Q}\mathbb{E}_{x^q}\log\frac{\exp(s(f(x^q),p_{y^q})/\tau)}{\sum_{i=1}^N\exp(s(f(x^q),p_i)/\tau)} Lmeta=ES,QExqlogi=1Nexp(s(f(xq),pi)/τ)exp(s(f(xq),pyq)/τ)
其中, s s s 表示余弦相似度, p y q p_{y^q} pyq 是类 y q y^q yq 的原型, τ \tau τ 是温度超参数。

4.2.1 Interaction on the spatial dimension

We first take inspiration from the prompt methods in NLP [5, 34] to concatenate prompt vectors with the input sequence and feed them together into Transformer layers. Given the semantic features g ( y t e x t ) g(y^{text}) g(ytext) and the input sequence of patch embeddings Z l − 1 = [ z l − 1 1 , z l − 1 2 , . . . , z l − 1 M ] ∈ R M × C z Z_{l−1} = [z^1_{ l−1}, z^2_{l−1}, ..., z^M_{l−1}] \in \mathbb{R}^{M\times C_z} Zl1=[zl11,zl12,...,zl1M]RM×Cz at the layer l l l, we obtain a new sequence Z ^ l − 1 ∈ R ( M + 1 ) × C z \hat{Z}_{l−1} \in \mathbb{R}^{ (M+1) \times C_z} Z^l1R(M+1)×Cz by extending Z l − 1 Z_{l−1} Zl1 with the projected semantic features :
Z ^ l − 1 = [ z 0 , z l − 1 1 , . . . , z l − 1 M ] \hat{Z}_{l-1}=[z^0,z_{l-1}^1,...,z_{l-1}^M] Z^l1=[z0,zl11,...,zl1M]
where z 0 = h s ( g ( y t e x t ) ) ∈ R C z z^0=h_s(g(y^{text}))\in\mathbb{R}^{C_z} z0=hs(g(ytext))RCz is the projected semantic embedding for spatial interaction and h s h_s hs is the projector that keeps the dimension of the semantic embedding to be the same as the patch embeddings.

我们首先从NLP [ 5、34 ]中的提示方法中得到启发,将提示向量与输入序列进行拼接,一起馈送到Transformer层。给定第 l l l层的语义特征 g ( y t e x t ) g(y^{text}) g(ytext)和patch嵌入 Z l − 1 = [ z l − 1 1 , z l − 1 2 , . . . , z l − 1 M ] ∈ R M × C z Z_{l−1} = [z^1_{ l−1}, z^2_{l−1}, ..., z^M_{l−1}] \in \mathbb{R}^{M\times C_z} Zl1=[zl11,zl12,...,zl1M]RM×Cz的输入序列,利用投影后的语义特征对 Z l − 1 Z_{l−1} Zl1进行扩展,得到新的序列 Z ^ l − 1 ∈ R ( M + 1 ) × C z \hat{Z}_{l−1} \in \mathbb{R}^{ (M+1) \times C_z} Z^l1R(M+1)×Cz
Z ^ l − 1 = [ z 0 , z l − 1 1 , . . . , z l − 1 M ] \hat{Z}_{l-1}=[z^0,z_{l-1}^1,...,z_{l-1}^M] Z^l1=[z0,zl11,...,zl1M]
其中, z 0 = h s ( g ( y t e x t ) ) ∈ R C z z^0=h_s(g(y^{text}))\in\mathbb{R}^{C_z} z0=hs(g(ytext))RCz 是空间交互的语义嵌入投影, h s h_s hs 是保持语义嵌入维度与patch嵌入维度相同的投影器。

Then, the extended sequence Z ^ l − 1 \hat{Z}_{l−1} Z^l1 is fed into the remaining Transformer layers, which contain multihead selfattention modules (MSA) to allow the interaction between semantic prompts and patch tokens along the spatial dimension. Specifically, letting Z ^ l − 1 \hat{Z}_{l−1} Z^l1 be the input sequence to a MSA module at the layer l l l, MSA first maps each token into three vectors, q , k , v ∈ R N h × ( M + 1 ) × C h q,k,v \in\mathbb{R}^{N_{h}\times(M+1)\times C_{h}} q,k,vRNh×(M+1)×Ch , with linear projection parameterized by W q k v W_{qkv} Wqkv, i.e.,
[ q , k , v ] = Z ^ l − 1 W q k v [q,k,v]=\hat{Z}_{l−1} W_{qkv} [q,k,v]=Z^l1Wqkv
where N h N_h Nh is the number of heads and C h C_h Ch is the number of channels for each head. It then computes the attention weights A ∈ R N h × ( M + 1 ) × ( M + 1 ) A \in \mathbb{R}^{N_h\times(M+1)\times(M+1)} ARNh×(M+1)×(M+1) by taking the inner product between q q q and k k k and performing softmax along the spatial dimension:
A = s o f t m a x ( q k T / C h 1 4 ) A=softmax(qk^T/C_h^{\frac14}) A=softmax(qkT/Ch41)
The attention weights are used to choose and aggregate information from different positions. The final output is obtained by concatenating outputs of all heads and performing linear projection parameterized by W o u t W_{out} Wout:
M S A ( Z l − 1 ^ ) = ( A v ) W o u t MSA(\hat{Z_{l-1}})=(Av)W_{out} MSA(Zl1^)=(Av)Wout

然后,将扩展序列 Z ^ l − 1 \hat{Z}_{l−1} Z^l1输入到剩余的Transformer层,其中包含多头自注意力模块( MSA ),以允许沿空间维度的语义提示和patch token(图像块特征)之间的相互作用。具体来说,设 Z ^ l − 1 \hat{Z}_{l−1} Z^l1为第 l l l层MSA模块的输入序列,MSA首先将每个token映射为3个向量 q , k , v ∈ R N h × ( M + 1 ) × C h q,k,v \in\mathbb{R}^{N_{h}\times(M+1)\times C_{h}} q,k,vRNh×(M+1)×Ch,其线性投影由 W q k v W_{qkv} Wqkv参数化,即:
[ q , k , v ] = Z ^ l − 1 W q k v [q,k,v]=\hat{Z}_{l−1} W_{qkv} [q,k,v]=Z^l1Wqkv
其中, N h N_h Nh为头数, C h C_h Ch为每个头的通道数。然后通过取 q q q k k k之间的内积并沿空间维度执行softmax计算注意力权重 A ∈ R N h × ( M + 1 ) × ( M + 1 ) A \in \mathbb{R}^{N_h\times(M+1)\times(M+1)} ARNh×(M+1)×(M+1)
A = s o f t m a x ( q k T / C h 1 4 ) A=softmax(qk^T/C_h^{\frac14}) A=softmax(qkT/Ch41)
注意力权重用于选择和聚合来自不同位置的信息。最终输出由所有头部的输出连接而成,并进行 W o u t W_{out} Wout参数化的线性投影得到最终的输出:
M S A ( Z l − 1 ^ ) = ( A v ) W o u t MSA(\hat{Z_{l-1}})=(Av)W_{out} MSA(Zl1^)=(Av)Wout

4.2.2 Interaction on the channel dimension

Besides spatial interaction via MSA, we propose another interaction mechanism that allows modulating and augmenting visual features channel-by-channel according to the input semantic prompts. Given the input sequence of patch embeddings Z l − 1 = [ z l − 1 1 , z l − 1 2 , . . . , z l − 1 M ] ∈ R M × C z Z_{l−1} = [z^1_{ l−1}, z^2_{l−1}, ..., z^M_{l−1}] \in \mathbb{R}^{M\times C_z} Zl1=[zl11,zl12,...,zl1M]RM×Cz at the layer l l l, we first obtain a global visual context vector Z l − 1 ∈ R C z Z_{l−1} \in \mathbb{R}^{C_z} Zl1RCz by averaging all patch tokens:
z l − 1 c = 1 M ∑ i = 1 M z l − 1 i z_{l-1}^c=\frac1M\sum_{i=1}^Mz_{l-1}^i zl1c=M1i=1Mzl1i
The visual context z l − 1 c z^c_{l−1} zl1c is then concatenated with the projected semantic vector z 0 = h c ( g ( y t e x t ) ) ∈ R C z z^0=h_c(g(y_{text}))\in\mathbb{R}^{C_z} z0=hc(g(ytext))RCz, and fed into a 2-layer MLP module to obtain a modulating vector β l − 1 ∈ R C z \beta_{l-1} \in \mathbb{R}^{C_z} βl1RCz :
β l − 1 = σ ( W 2 σ ( W 1 [ z 0 ; z l − 1 c ] + b 1 ) + b 2 ) \beta_{l-1}=\sigma(W_2\sigma(W_1[z^0;z_{l-1}^c]+b_1)+b_2) βl1=σ(W2σ(W1[z0;zl1c]+b1)+b2)
where W 1 W_1 W1, b 1 b_1 b1, W 2 W_2 W2, b 2 b_2 b2 are the parameters of the MLP module, σ \sigma σ is the sigmoid activation function, and h c h_c hc is the projector for the channel interaction.

除了通过 MSA 进行空间交互外,我们还提出了另一种交互机制,即根据输入的语义提示,逐个通道调节和增强视觉特征。给定第 l l l层的patch嵌入 Z l − 1 = [ z l − 1 1 , z l − 1 2 , . . . , z l − 1 M ] ∈ R M × C z Z_{l−1} = [z^1_{ l−1}, z^2_{l−1}, ..., z^M_{l−1}] \in \mathbb{R}^{M\times C_z} Zl1=[zl11,zl12,...,zl1M]RM×Cz的输入序列,我们首先通过平均所有patch token(图像块特征)获得一个全局视觉上下文向量 Z l − 1 ∈ R C z Z_{l−1} \in \mathbb{R}^{C_z} Zl1RCz
z l − 1 c = 1 M ∑ i = 1 M z l − 1 i z_{l-1}^c=\frac1M\sum_{i=1}^Mz_{l-1}^i zl1c=M1i=1Mzl1i
然后将视觉上下文 z l − 1 c z^c_{l−1} zl1c与投影语义向量 z 0 = h c ( g ( y t e x t ) ) ∈ R C z z^0=h_c(g(y_{text}))\in\mathbb{R}^{C_z} z0=hc(g(ytext))RCz拼接,并送入2层的MLP模块得到调制向量 β l − 1 ∈ R C z \beta_{l- 1} \in \mathbb{R}^{C_z} βl1RCz
β l − 1 = σ ( W 2 σ ( W 1 [ z 0 ; z l − 1 c ] + b 1 ) + b 2 ) \beta_{l-1}=\sigma(W_2\sigma(W_1[z^0;z_{l-1}^c]+b_1)+b_2) βl1=σ(W2σ(W1[z0;zl1c]+b1)+b2)
其中 W 1 W_1 W1 b 1 b_1 b1 W 2 W_2 W2 b 2 b_2 b2为MLP模块的参数, σ \sigma σ为sigmoid激活函数, h c h_c hc为通道交互的投影仪。

We finally add the modulating vector to all patch tokens such that it can tune the visual features at each channel. The modulated sequence Z ~ l − 1 ∈ R M × C z \tilde{Z}_{l-1}\in\mathbb{R}^{M\times C_z} Z~l1RM×Cz can be written as:
Z ~ l − 1 = [ z l − 1 i + β l − 1 , ] i = 1 , 2 , . . . , M \tilde{Z}_{l-1}=[z_{l-1}^i+\beta_{l-1},]\quad i=1,2,...,M Z~l1=[zl1i+βl1,]i=1,2,...,M

最后,我们将调制向量添加到所有patch token(图像块特征)中,这样它就可以在每个通道上调整视觉特征。调制序列 Z ~ l − 1 ∈ R M × C z \tilde{Z}_{l-1}\in\mathbb{R}^{M\times C_z} Z~l1RM×Cz中可以写成:
Z ~ l − 1 = [ z l − 1 i + β l − 1 , ] i = 1 , 2 , . . . , M \tilde{Z}_{l-1}=[z_{l-1}^i+\beta_{l-1},]\quad i=1,2,...,M Z~l1=[zl1i+βl1,]i=1,2,...,M

5. Experiments

5.1. Datasets and implementation details

miniImageNet and tieredImageNet. The miniImageNet dataset is proposed in [49] to benchmark the few-shot learning problem. It contains a subset of 100 classes in the ImageNet [41] dataset, where 64 classes are used as base classes for pre-training and meta-training, 16 classes are used for validation, and 20 classes are used for testing. The tiredImageNet dataset [39] is also derived from ImageNet and contains more classes: 351 classes used for training, 97 classes used for validation, and 160 classes used for testing. The semantic difference between base classes and novel classes in the tieredImageNet dataset is much larger than miniImageNet.

miniImageNet and tieredImageNet. 文献[49]提出了miniImageNet数据集来测试少样本学习问题。它包含ImageNet [41]数据集中100个类的子集,其中64个类作为基类用于预训练和元训练,16个类用于验证,20个类用于测试。Fatigue ImageNet数据集[ 9]也来源于ImageNet,包含更多的类:用于训练的351个类,用于验证的97个类,用于测试的160个类。在tieredImageNet数据集中,基类和新类之间的语义差异远大于miniImageNet。

CIFAR-FS and FC100. These two datasets are derived from the CIFAR-100 [20] dataset with different partition modes. CIFAR-FS [22] randomly splits 100 classes into 64 training classes, 16 validation classes and 20 testing classes. In contrast, FC100 [31] divides classes based on their semantic superclasses, where 60 classes from 20 superclasses are used for training, 20 classes from 4 superclasses are used for validation, 20 classes from 4 superclasses are used for testing. The large semantic gap makes FC100 more difficult than CIFAR-FS.

CIFAR-FS and FC100. 这两个数据集来自于CIFAR - 100 [20]数据集,具有不同的划分模式。CIFAR-FS [22]将100个类随机分为64个训练类,16个验证类和20个测试类。相比之下,FC100 [31]基于语义超类划分类,其中20个超类中的60个类用于训练,4个超类中的20个类用于验证,4个超类中的20个类用于测试。较大的语义鸿沟使得FC100比CIFAR-FS更加困难。

Text encoders. To extract rich semantic features from class names, we adopt three types of text encoders, i.e., CLIP [35], SBERT [38], and GloVe [33], which are pretrained on large-scale corpora and are available for public use. For CLIP, we only use its text encoder, and extend the input class name with a text template: A photo of a {class name}. For SBERT and Glove, we directly feed class names into their encoders and average the output word vectors if there are multiple words in a name.

Text encoders. 为了从类名中提取丰富的语义特征,我们采用了三种类型的文本编码器,即CLIP [35]、SBERT [38]和GloVe [33],它们是在大规模语料库上预训练的,可供公众使用。对于CLIP,我们只使用它的文本编码器,并用一个文本模板扩展输入类名:一个{类名}的照片。对于SBERT和Glove,我们直接将类名输入到它们的编码器中,如果一个名称中有多个词,则对输出的词向量进行平均。

Implementation details. We adopt Visformer-Tiny [7] as the feature extractor and resize the input image into 224 × 224 224\times224 224×224 by default. Other input resolutions are validated in Section 5.3.5. Images are augmented with RandomResizedCrop, RandAug [8] and RepeatAug [4]. During pretraining, we use the AdamW optimer [29] with a learning rate of 5e-4 and a weight decay of 5e-2. We pre-train the model for 800 epochs on miniImageNet, CIFAR-FS and FC100, and for 300 epochs on tieredImageNet. During meta-training, we reduce the learning rate of the feature extractor to 1e-6 and set the learning rate of the projectors as 5e-4. The model is meta-trained for 100 epochs on all datasets. The hyper-parameter τ is set as 0.2 according to validation accuracy. We conduct experiments with a TITAN Xp server and training can be done with one GPU.

Implementation details. 我们采用Visformer-Tiny [7]作为特征提取器,并默认将输入图像大小调整为 224 × 224 224 \times224 224×224。其他输入分辨率在第5.3.5节进行了验证。使用RandomResizeCrop、RandAug [8]和RepeatAug [4]对图像进行增强。在预训练时,使用Adam W优化器[29],学习率为5e-4,权重衰减为5e-2。在miniImageNet、CIFAR-FS和FC100上预训练800个epoch,在tieredImageNet上预训练300个epoch。在元训练过程中,我们将特征提取器的学习率降低到1e-6,并将投影仪的学习率设置为5e-4。该模型在所有数据集上进行了100次元训练。超参数τ根据验证精度设置为0.2。我们使用TITAN Xp服务器进行实验,训练可以在一个GPU上完成。

During evaluation, we randomly sample 2,000 test episodes from the novel classes. For 1-shot learning, we use the cosine classifier for prediction as in Eq.4. For 5shot learning, we adopt logistic regression classifiers with random crop augmentation. We finally report the average accuracy with 95% confidence intervals.

在评估过程中,我们从新类中随机抽取了2000个测试episode。对于1-shot学习,我们使用余弦分类器进行预测,如公式4所示。对于5-shot学习,我们采用了随机裁剪增强的逻辑回归分类器。最后,我们报告了95%置信区间的平均准确率。

5.2. Comparison with the state-of-the-art

To evaluate the effectiveness of our approach, we conduct extensive experiments on four datasets , and compare the results with previous state-of-the-art methods in Table 1 and Table 2.

为了评估我们方法的有效性,我们在四个数据集上进行了广泛的实验,并将结果与表1和表2中的最新方法进行了比较。
在这里插入图片描述

Table 1. Comparison with previous work on miniImageNet and tieredImageNet. Methods in the top rows do not use semantic information, and methods in the middle rows leverage semantic information from class names [24, 32, 52] or descriptions [53]. Accuracies are reported with 95% confidence intervals.

表1。与前人在miniImageNet和tieredImageNet上的工作进行了比较。位于顶行的方法不使用语义信息,位于中间行的方法利用来自类名[24、32、52]或描述的语义信息[53]。以95 %的置信区间报告准确性。

在这里插入图片描述

Table 2. Comparison with previous work on CIFAR-FS [22] and FC100 [31].

表2。与之前的工作CIFAR-FS [22]和FC100 [31]进行比较。

Compared with previous methods that leverages semantic information (KTN [32], AM3 [52], TRAML [24], DeepBERT [53]), our method improves 1-shot accuracy by 5.21% on miniImageNet and by 4.27% on tieredImageNet. DeepEMD-BERT achieves better 5-shot accuracy than ours on miniImageNet, but requires multiple forward passes and additional inner optimization step to obtain reliable local feature similarities. Note that previous methods usually adopts CNN as the backbone, except a recently proposed method SUN [10] that also adopts the Visformer backbone. Nevertheless, our method outperforms SUN by 2.46% on average over three datasets.

与以往利用语义信息( KTN 、AM3 、TRAML 、Deep BERT )的方法相比,本文方法在miniImageNet上的1-shot准确率提高了5.21 %,在tieredImageNet上的1-shot准确率提高了4.27 %。DeepEMD-BERT在mini ImageNet上取得了比我们更好的5-shot精度,但需要多次前向通道和额外的内部优化步骤才能获得可靠的局部特征相似度。注意,除了最近提出的方法SUN [10]也采用了Visformer骨架外,以前的方法通常采用CNN作为骨架。尽管如此,我们的方法在三个数据集上比SUN平均高出2.46 %。

When using different text encoders to extract semantic features, the proposed SP presents consistent improvements over the pre-training baseline. Specifically, we can see that SP with CLIP achieves better on 1-shot than SBERT and GloVe, probably because CLIP’s multi-modal pre-training results in better alignment of semantic embeddings with visual concepts. In 5-shot, the performance difference decreases as the model performance is dominated by visual features when support images are sufficient. In the following experiments, we use CLIP as the default text encoder.

当使用不同的文本编码器来提取语义特征时,所提出的SP在预训练基线上表现出一致的改进。具体来说,我们可以看到带有CLIP的SP比SBERT和GloVe在1-shot上取得了更好的效果,这可能是因为CLIP的多模态预训练导致语义嵌入与视觉概念更好的对齐。在5-shot中,当支持图像足够多时,由于模型性能由视觉特征主导,性能差异减小。在接下来的实验中,我们使用CLIP作为默认的文本编码器。

5.3. Model analysis

5.3.1 Ablation study

The ablation study results are shown in Table 3. By extending the standard RandomResizedCrop with RandAug and RepeatAug, the 1-shot accuracy of the pre-trained feature extractor is improved by 2.45% on average over four datasets. To validate the effectiveness of SP, we fine-tune the feature extractor with three different interaction mechanisms, including SI (spatial interaction), CI (channel interaction) and SI+CI. As shown in Table 3, both SI and CI are very effective, improving average 1-shot accuracy on 4 datasets by 5.89% and 5.43%, respectively. Furthermore, by combing them together, the 1-shot learning accuracy is further improved on all four datasets. These results indicate that the proposed SP is an effective approach to leveraging semantic information for few-shot learning.

消融研究结果如表 3 所示。通过使用 RandAug 和 RepeatAug 扩展标准的RandomResizedCrop,预先训练的特征提取器的单次准确率在四个数据集上平均提高了 2.45%。为了验证 SP 的有效性,我们用三种不同的交互机制对特征提取器进行了微调,包括 SI(空间交互)、CI(通道交互)和 SI+CI。如表 3 所示,SI 和 CI 都非常有效,在 4 个数据集上的平均单次准确率分别提高了 5.89% 和 5.43%。此外,将这两种方法结合在一起,还能进一步提高所有四个数据集上的单次学习准确率。这些结果表明,所提出的 SP 是一种利用语义信息进行少量学习的有效方法。
在这里插入图片描述

Table 3. Ablation study on four datasets under the 1-shot setting. SI means spatial interaction, and CI means channel interaction.

表 3。对四个数据集在 1 -shot设置下进行的消融研究。SI 表示空间相互作用机制,CI 表示通道相互作用机制。

5.3.2 Layer selection

Theoretically, the semantic prompt in this work can be inserted into the feature extractor at any layer. However, we find that the layer selection has a significant impact on the performance. In Figure 3, we can see that inserting prompts at higher layers improves accuracies, while inserting prompts at lower layers leads to performance drop. Considering that prompt vectors are class-specific, these results indicate that class-specific features should be extracted at higher network layers, while features at lower layers should better be shared among classes. When looking into the performance of each layer, we can see that while the optimal layer selection varies slightly for different datasets, SP at all layers of the third stage improves accuracy consistently. To simplify architecture design, we choose the layer3-2 as default in our experiments.

理论上,本工作中的语义提示可以插入到任何层的特征提取器中。然而,我们发现层的选择对性能有显著的影响。在图3中,我们可以看到在较高层插入提示可以提高准确性,而在较低层插入提示会导致性能下降。考虑到提示向量是特定于类的,这些结果表明,应该在较高的网络层提取特定于类的特征,而较低的网络层的特征应该在不同的类之间共享。当观察每一层的性能时,我们可以看到,虽然不同数据集的最佳层选择略有不同,但第三阶段所有层的SP一致地提高了精度。为了简化架构设计,我们在实验中选择layer3-2作为默认值。

在这里插入图片描述

Figure 3. Accuracy vs. different layers to inset prompts. We report 5-way 1-shot accuracy (%) on the validation set of miniImageNet and CIFAF-FS along the meta-training process. The feature extractor has three stages and multiple Transformer layers in each stage.

图3。准确性vs不同层插入提示。我们报告了在元训练过程中miniImageNet和CIFAF-FS的验证集上的5-way 1-shot准确率(%)。该特征提取器有三个阶段,每个阶段有多个Transformer层。

5.3.3 The backbone and classifier architectures

In Table 4, we re-implement three baseline methods with the same Visformer backbone as ours, and compare the results with different backbones under the miniImageNet 1shot setting. It can be seen that simply replacing ResNet12 with Visformer can not obtain significant improvement. Instead, using semantic prompt can improves 1-shot performance over these baselines when equipped with the same Visformer backbone.

在表4中,我们使用与我们相同的Visformer主干重新实现了三种基线方法,并在miniImageNet 1-shot设置下比较了使用不同主干的结果。可见,简单地用Visformer替换ResNet12并不能获得显著的提升。相反,当配备相同的Visformer主干时,使用语义提示可以在这些基线上提高1-shot性能。

在这里插入图片描述

Table 4. Comparison with different backbones.

表4。与不同Backbone的比较。

In Tab.5, we compare the LR and NN classifiers over all datasets. The simple NN classifier performs as well as the LR classifier for 1-shot, while the LR benefits from more training examples and outperforms the NN by 0.53% for 5-shot.

在表5中,我们在所有数据集上比较了LR和NN分类器。对于1-shot,简单NN分类器表现与LR分类器相当,而对于5-shot,LR从更多的训练样本中获益,性能比NN提高了0.53%。
在这里插入图片描述

Table 5. Comparison of classifiers. NN: cosine-distance nearest prototype classifier. LR: linear logistic regression classifier.

表5。分类器的比较。NN:余弦距离最近原型分类器。LR:线性逻辑回归分类器。

5.3.4 Projector structure and pooling strategy

As shown in Table 6, the projector design has little effect on performance: both linear and MLP projectors work well and the MLP has slight advantage. In contrast, the pooling strategy has much more effect on performance. When adopting the ‘Head’ strategy, both 1-shot and 5-shot learning accuracies are very poor. This indicates that the output at the position of the prompt vector is easy to overfit on semantic features and neglect rich visual features in image patches. Adopting average on all output features can address this problem and achieve better results.

如表6所示,投影函数的设计对性能影响不大:线性投影函数和MLP投影函数都能正常工作,MLP投影函数略占优势。相比之下,池化策略对性能的影响要大得多。当采用“Head”策略时,1-shot和5-shot的学习精度都很差。这表明提示向量位置处的输出容易对语义特征过度拟合,忽略图像块中丰富的视觉特征。在所有的输出特征上采用平均值可以解决这个问题,并取得更好的效果。

在这里插入图片描述

Table 6. Choice of the projector, and the pooling strategy for the output sequence. ‘Head’ means selecting the output at the position of the prompt vector; ‘Patches’ means averaging the output features of all patches; ‘All’ means averaging all feature vectors in the output sequence.

表6。投影函数的选择,以及输出序列的池化策略。“Head”表示在提示向量的位置选择输出;“Patches”是指将所有patch的输出特征进行平均;“all”表示对输出序列中的所有特征向量求平均。

5.3.5 Image size and stem design

In Table 7, we experiment with a smaller input size, 84 × 84 84\times84 84×84, to validate the influence of image size. It can be seen that directly changing the input size to 84 × 84 84\times84 84×84 leads to evident performance drop on all datasets. We suppose that this is because the kernel size and the stride of the stem is too large to capture the detailed visual features when the input image gets small. To address this problem, we reduce the kernel size and the stride of the stem accordingly. After this change, the 1-shot learning performance under 84 × 84 84\times84 84×84 improves significantly, and gets comparable results with the 224 × 224 224\times224 224×224 resolution on all datasets.

在表7中,我们使用较小的输入大小 84 × 84 84 \times84 84×84进行实验,以验证图像大小的影响。 可以看到,直接将输入大小更改为 84 × 84 84\times84 84×84会导致所有数据集上的性能明显下降。我们认为这是由于当输入图像较小时,卷积核尺寸和步长过大,无法捕捉到细致的视觉特征。为了解决这个问题,我们相应地减小了卷积核的大小和步长。在这一变化之后,在此更改之后, 84 × 84 84\times84 84×84下的1-shot学习性能显着提高,并且在所有数据集上获得与 224 × 224 224\times224 224×224分辨率相当的结果。

在这里插入图片描述

Table 7. The effect of input size and stem design. ‘Ks’ means the kernel size of the first convolution layer (stem), and ‘Stride’ means its stride. 5-way 1-shot accuracy is reported on four datasets with 95% confidence intervals.

表7。输入尺寸和卷积主干的设计的影响。“Ks”表示第一个卷积层(stem)的核大小,“Stride”表示步长。在四个数据集上报告了5-way 1-shot精度,置信区间为95%。

5.3.6 Visualization

In Figure 4, we visualize the attention maps by computing the dot product between the output feature and the feature vector at each location. It can be seen that the visual features of the pre-training baseline are cluttered with background information, but our method can focus on semantic-level visual features according to the given text prompt. For example, given the text prompt of harvestman, the model will attend to the features of the harvest rather than spider web or background clutters.

在图4中,我们通过计算每个位置的输出特征和特征向量之间的点积来可视化注意力图。可以看出,预训练baseline的视觉特征与背景信息混杂,但我们的方法可以根据给定的文本提示关注语义级别的视觉特征。例如,给定收割机的文本提示,模型将关注收获的特征,而不是蜘蛛网或杂乱的背景。

在这里插入图片描述

Figure 4. Visualization of attention maps when prompting with different class labels.

图 4。使用不同类别标签进行提示时的注意图可视化。

6. Conclusion

In this paper, we propose a novel Semantic Prompt (SP) approach for FSL, which adaptively tunes the feature extraction with the semantic features derived from class names. The proposed approach is evaluated on four benchmark datasets, and achieves significant improvements against previous methods. More in-depth analysis demonstrates that SP encourages the model to extract more classspecific features and is robust to different text encoders and model designs.

在本文中,我们提出了一种新的面向FSL的语义提示(Semantic Prompt,SP )方法,该方法利用类名的语义特征自适应地调整特征提取。所提出的方法在四个基准小样本图像分类数据集上进行了评估,并与以前的方法相比取得了显著的改进。更深入的分析表明,SP鼓励模型提取更多的类别特异性特征,并且对不同的文本编码器和模型设计具有鲁棒性。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值