Few-Shot Object Detection with Sparse Context Transformers



Few-shot detection is a major task in pattern recognition which seeks to localize objects using models trained with few labeled data. One of the mainstream few-shot methods is transfer learning which consists in pretraining a detection model in a source domain prior to its fine-tuning in a target domain. However, it is challenging for fine-tuned models to effectively identify new classes in the target domain, particularly when the underlying labeled training data are scarce. In this paper, we devise a novel sparse context transformer (SCT) that effectively leverages object knowledge in the source domain, and automatically learns a sparse context from only few training images in the target domain. As a result, it combines different relevant clues in order to enhance the discrimination power of the learned detectors and reduce class confusion. We evaluate the proposed method on two challenging few-shot object detection benchmarks, and empirical results show that the proposed method obtains competitive performance compared to the related stateof-the-art.



Although deep learning (DL) models have achieved remarkable performance, these outstanding models are usually label-hungry and their training is time and memory demanding. In some scenarios, particularly object detection, labeled data are scarce, and this makes DL-based detection a major challenge [1], [3]. Among existing solutions that mitigate scarcity of labeled data, transfer learning is particularly effective and consists in pretraining detection models in the source domains — using abundant labeled data — prior to their fine-tuning in the target domains. However, as labeled data are scarce in the target domains, few-shot object detection — based on transfer learning [5], [6], [8] — is not yet sufficiently effective in identifying new object classes.


The aforementioned issue is accentuated in few-shot object detection as this task involves both localization and classification. Localization focuses on spatial information which is decently obtained from pretrained models in the source domains. Therefore, bounding box regressors (BBOX) trained in the source domain are already reliable for initialization and fine-tuning in the target domain. Hence, detectors fine-tuned with few training samples may effectively locate new object classes. In contrast, classification often requires contextual knowledge of specific categories. In other words, source domain knowledge are insufficient to learn new category distributions in the target domain. Therefore, the underlying models should be completely retrained for new categories. However, scarcity and limited diversity of data in the target domain lowers the accuracy of the new learned category classifiers, and thereby leads to class confusion.


In order to address these issues, we propose in this paper a novel sparse context transformer(SCT) that leverages source domain knowledge together with few training images in the target domain. This transformer learns sparse affinity matrices between BBOXs and classification outputs by exploring the most relevant contexts for new object categories in the target domain. The proposed transformer consists of two simple yet effective submodules: (i) sparse relationship discovery, and (ii) aggregation. In submodule (i), contextual fields are initially designed based on default prior boxes (also called anchor boxes) [9], [10] and multi-scale feature maps extracted from a visual encoder. Then, relationships between each prior box and contextual fields are modeled through a novel sparse attention. In submodule (ii), aggregation further leverages the learned relationships and integrates contextual fields into the relevant prior boxes. As shown through experiments, our proposed transformer enhances prior box representations, and mitigates confusion in few-shot object detection and classification.


Considering all the aforementioned issues, our main contributions include

• A novel sparse context transformer that effectively explores useful contextual fields from a small number of labeled images. This transformer is embedded into an SSD (plug-and-play style) detector suitable for few-shot object detection.

• A novel attention layer that assists object detection in learning task-relevant knowledge from images by enhancing the underlying task-related feature representations.

• A comprehensive evaluation of our proposed method on the challenging configurations for few-shot detection that shows high performance.


  • 提出了一种新颖的稀疏上下文变换器,该变换器能有效地从少量标注图像中探索有用的上下文场。该变换器嵌入到一种适用于少样本目标检测的SSD(即插即用风格)检测器中。
  • 提出了一种新颖的注意力层,帮助目标检测通过增强基础的任务相关特征表示,从图像中学习与任务相关的知识。
  • 对我们提出的方法在具有挑战性的少样本检测配置上进行了全面评估,结果显示出高性能。


Recent years have witnessed a significant progress in few-shot object detection using transfer learning. Among existing methods, Chen et al. [11] introduced a low shot transfer detector that focuses on foreground objects in the target domain during fine-tuning in order to learn more knowledge on the targeted categories. Khandelwal et al. [12] conjectured that simple finetuning may lead to a decrease in the transferability of the models. Hence, they proposed a unified semi-supervised framework that combines weighted multi-modal similarity measures between base and novel classes. With this method, they achieved effective knowledge transfer and adaptation. Unlike these methods, Wang et al. [14] proposed a context-transformer that tackles object confusion in few-shot detection. This transformer relies on a set of contextual fields from different spatial scales and aspect ratios of prior boxes, in order to explore their relationships through dot products. Based on these relationships, the contextual fields are integrated into each prior box and this improves their representation 


Our work is an extension of context transformers that addresses the relatively monotonous contextual fields (constructed in the original version of these transformers) as well as their relationships with prior boxes, which cannot effectively suppress task-independent contextual fields, and further affect the model's ability to recognize novel classes. In this regard, we consider informations from different sources and we model sparse relationships between contextual fields and each prior box to help the model selecting the most effective fields. This also mitigates confusion in few-shot object detection.



In this section, we introduce our novel sparse context transformer. As shown in Fig. 1, our framework relies on an SSD-style detector [10] used as a flexible plug-and-play backbone that delivers rich multi-scale contextual information. The SSD detector consists of K (spatial-scale) heads including bounding boxes regressor (BBOX) and object+background classifiers (OBJ+BG). To generalize few-shot learning in the target domain, we first pretrain the SSD detector with a large-scale dataset in the source domain. Then, we combine the proposed transformer module with the SSD detector for fine-tuning in the target domain (see again Fig. 1). As shown subsequently, our proposed transformer includes two submodules: one for sparse relationship discovery, and another one for aggregation. These submodules are respectively used to model context/classifier relationships and for context fusion.



3.1 Sparse Relationship Discovery

Given an image I fed into an SSD detector, we extract for each prior box in I a vector of scoresPk,m,h,w ∈ RCs ; being Cs the number of (source) object categories, k ∈ J1, KK a spatial scale, m an aspect ratio, and (h, w) the prior box coordinates at the k-th scale. In what follows, we reshape the tensor (Pk,m,h,w)k,m,h,w as a matrix P ∈ RDp×Cs ; being Dp the total number of prior boxes inI across all the possible scales, aspect ratios and locations. Scores in P provide us with a rich semantic representation about object categories [15]; nonetheless, this representation is deprived from contextual relationships between prior boxes. Since SSD usually involves ten thousand prior boxes per image, modeling and training all the relationships between these boxes is clearly intractable, overparameterized and thereby subject to overfitting, particularly in the few-shot scenario. In order to prevent these issues, spatial pooling is first achieved so one may obtain a more compact matrix Q ∈ RDq ×Cs instead of P, being Dq (≪ Dp) the reduced number of prior boxes after spatial pooling.

给定一张图像$\mathcal{I}$输入到一个SSD检测器中,我们为图像$\mathcal{I}$中的每个先验框提取一个分数向量$\mathbb{P}_{k,m,h,w}\in\mathbb{R}^{C_s}$;其中$C_{s}$是(源)对象类别的数量,$k\in[1,K]$ 是空间尺度,m是一个长宽比,而(h, w)是第k个尺度上的先验框坐标。接下来,我们将张量$(\mathbf{P}_{k,m,h,w})_{k,m,h,w}$重塑为一个矩阵$\mathbf{P}\in\mathbb{R}^{D_p\times C_s}$;其中$D_{p}$是在图像$\mathcal{I}$中所有可能的尺度、长宽比和位置上的先验框的总数。矩阵P中的分数为我们提供了关于对象类别的丰富语义表示;然而,这种表示缺乏先验框之间的上下文关系。由于SSD通常涉及每个图像上数万个先验框,建模和训练这些框之间的所有关系显然是不可行的,参数过多,因此容易出现过拟合,特别是在少样本情况下。为了避免这些问题,首先进行了空间池化,以便获得一个更紧凑的矩阵$\mathrm{Q~}\in\mathbb{R}^{D_q\times C_s}$,而不是P,其中$D_q\left(\ll D_p\right)$是空间池化后的减少的先验框数量。

Considering that prior boxes capture only object parts (and not their overall extents), multiscale feature maps extracted by the SSD encoder (denoted as {Fk}k) are also aggregated asM = Concat({Fk}k) ∈ RDq ×Df ; being Concat the concatenation operator and Df the resulting dimension after the application of this operator. These aggregated features M enable to provision with complementary visual cues at different scales, and provide us with a more comprehensive contextual information [16]. In the rest of this paper, the pairs of pooled prior boxes Q together with the underlying aggregated features M are referred to as contextual fields.

考虑到先验框只捕捉到对象的部分(而不是它们的整体范围),通过SSD编码器提取的多尺度特征图(表示为$\{\mathbf{F}_k\}_k$)也被聚合为$\mathbf{M}=\mathsf{Concat}(\{\mathbf{F}_{k}\}_{k})\in\mathbb{R}^{D_{q}\times D_{f}}$;其中Concat是连接操作符,$D_{f}$是应用该操作符后的结果维度。这些聚合特征$\text{M}$能够在不同尺度上提供互补的视觉线索,为我们提供更全面的上下文信息【16】。在本文的其余部分,池化后的先验框Q和基础聚合特征M的组合被称为上下文场。

Attention Focus. In order to learn task-related context from few training data, we design an attention focus layer that enhances the representation of contextual fields and attenuates category confusion. We first define the attention weight matrix AM as




being ψα(.) (and ψβ (.)) trained fully-connected (FC) layers that increase the expressivity of the attention matrix AM ∈ RDf ×Cs , and allow obtaining an enhanced representation M∗ ∈ RDq ×Cs as


其中,ψα(.)(和ψβ(.))是训练过的全连接(FC)层,用于增加注意力矩阵$\mathbf{A_M}\in\mathbb{R}^{D_f\times C_s}$的表达能力,并允许获得增强的表示$\mathbf{M}^*\in\mathbb{R}^{D_q\times C_s}$,表示如下:

$\mathbf{M}^*=\mathbf{M}\mathrm{~A}_\mathbf{M},$ (2)

which also ensures dimension consistency of the learned representation in Eq. (3). By combining the pooled prior boxes in Q and the underlying multi-scale feature maps M∗, we obtain our contextual field representations that capture both intrinsic (feature) and extrinsic (object-class) information, resulting into

$\mathbf{C}=\lambda\mathbf{~M}^{*}+\mathbf{~Q},$ (3)



here λ ≥ 0 controls the impact of attention in M∗. Using Eq. (3), we design our sparse attention mechanism in order to explore the affinity relationships between each prior box and contextual fields, and remove spurious ones (i.e., those farther away from the prior boxes) according to the sparse relationship. More precisely, we evaluate the relationship matrix between the contextual fields in C and prior boxes in P, and we reset weak relationships to zero using soft-thresholding as




 其中,λ ≥ 0 控制了 M∗ 中注意力的影响。利用方程(3),我们设计了我们的稀疏注意力机制,以探索每个先验框与上下文场之间的亲和关系,并根据稀疏关系删除虚假关系(即,与先验框距离较远的关系)。更具体地说,我们评估了上下文场中的 C 和先验框中的 P 之间的关系矩阵,并使用软阈值法将弱关系重置为零。

$\mathbf{R}=\mathrm{sign}(\mathbf{A}),$ (4)


 Here softmax is row-wise applied, ψγ (.), ψρ(.) are again trainable FC layers and sign(.) is an operator that sparsifies a given relationship matrix using a soft threshold. Each row of R ∈ RDp×Dqmeasures the importance of all contextual fields w.r.t. its underlying prior box. Hence, sparse relationship discovery allows a prior box to identify its important contextual fields and discard those that are not sufficiently important according to various aspect ratios, locations and spatial scales.

在这里,softmax按行应用,ψγ(.)、ψρ(.)再次是可训练的全连接层sign(.)是一个操作符,它使用软阈值稀疏化给定的关系矩阵。矩阵 $\mathbf{R}\in\mathbb{R}^{D_p\times D_q}$ 的每一行测量了相对于其底层先验框的所有上下文场的重要性。因此,稀疏关系发现允许一个先验框识别其重要的上下文场,并且根据不同的长宽比、位置和空间尺度丢弃那些不够重要的上下文场。

3.2 Aggregation

We consider the sparse relationship matrix R — between prior boxes and contextual fields — as a relational attention in order to derive the representation of each prior box. We also consider a softmax operator on each row i of R as a gating mechanism that measures how important is each contextual field w.r.t. the i-th prior box. By considering the cross correlations between rows of R and columns of C, we derive our sparse attention-based representation of prior boxes as

$\mathbf{W}=\mathrm{softmax}(\mathbf{R})\psi_\eta(\mathbf{C}),$ (6)


$\mathbf{W}=\mathrm{softmax}(\mathbf{R})\psi_\eta(\mathbf{C}),$ (6)

being $\mathbf{W}\in\mathbb{R}^{D_p\times C_s}$and ψη corresponds again to trainable FC layers. Now, we combine W with the original matrix of prior boxes P in order to derive our final context-aware representation$\hat{\mathbf{P}}\in\mathbb{R}^{D_p\times C_s}$


其中,$\mathbf{W}\in\mathbb{R}^{D_p\times C_s}$,ψη对应于可训练的全连接层。现在,我们将W与原始的先验框矩阵P相结合,以推导出我们的最终上下文感知表示$\hat{\mathbf{P}}\in\mathbb{R}^{D_p\times C_s}$


Here ψξ corresponds to other (last) trainable FC layers. Since ˆP is context-aware, it enhances the discrimination power of prior boxes by attenuating confusions between object classes. By plugging the final representation ˆP into a softmax classifier, we obtain our scoring function, on the Ct target classes, as




Note that the representations in ˆP and the underlying parameters Θ ∈ RCs×Ct are shared across different aspect ratios and spatial scales, so there is no requirement to design separate classifiers at different scales. This not only reduces computational complexity but also prevents overfitting.

请注意,ˆP中的表示以及其底层参数$\Theta\in\mathbb{R}^{C_{s}\times C_{t}}$ 在不同的长宽比和空间尺度上是共享的,因此无需在不同尺度上设计单独的分类器。这不仅降低了计算复杂度,还防止了过拟合。

Implementation Details. 

We choose a recent SSD detector [26] as a basic architecture built upon 6 heads corresponding to different spatial rescaling factors (taken in {1, 3, 5, 10, 19, 38}). The contextual fields we designed consist of two parts: in the first one, prior boxes — corresponding to multiple scales and aspect ratios — are max-pooled with different instances of kernel sizes+strides taken in {2, 3}. For the second one, contextual fields composed of multi-scale features are fused through four spatial scales. In these experiments, we set the hyperparameter λ in Eq. (3) to 0.6, and the embedding functions in the sparse context transformer correspond to the residual FC 6layer whose input and output have the same number of channels.

我们选择了最近的一个 SSD 检测器 [26] 作为基本架构,它由 6 个头部组成,对应不同的空间重缩放因子(取自 {1, 3, 5, 10, 19, 38})。我们设计的上下文场由两部分组成:在第一部分中,对应于多个尺度和长宽比的先验框,使用不同的核大小+步幅(取自 {2, 3})进行最大池化。对于第二部分,由多尺度特征组成的上下文场通过四个空间尺度进行融合。在这些实验中,我们将方程(3)中的超参数 λ 设为 0.6,稀疏上下文变换器中的嵌入函数对应于输入和输出具有相同通道数的残差 FC 层。

We implement our experiments using PyTorch [27] on two Nvidia 3090 GPUs. We pretrain the SSD detectors on the source domain following exactly the original SSD settings in [26], and we fine-tune these SSDs on the target domain using stochastic gradient descent with the following settings: a batch size of 64, a momentum of 0.9, an initial learning rate equal to 4×10−3 (decreased by 10 after 3k and 3.5k iterations), a weight decay of 5 × 10−4, and a total number of training iterations equal to 4k.

我们使用 PyTorch [27] 在两个 Nvidia 3090 GPU 上进行实验。我们在源域上完全按照 [26] 中原始的 SSD 设置对 SSD 检测器进行预训练,然后使用随机梯度下降算法在目标域上微调这些 SSD 检测器,设置如下:批量大小为 64,动量为 0.9,初始学习率为 4×10^(-3)(在 3k 和 3.5k 迭代后减少 10 倍),权重衰减为 5×10^(-4),总训练迭代次数为 4k。





