Few-Shot Object Detection with Sparse Context Transformers

weixin_48211296

已于 2024-05-21 11:04:22 修改

阅读量1.8k

点赞数 38

文章标签：笔记计算机视觉目标检测论文阅读人工智能

于 2024-05-16 21:04:54 首次发布

本文链接：https://blog.csdn.net/weixin_48211296/article/details/135513984

版权

无代码，CVPR，SSD网络

摘要

Few-shot detection is a major task in pattern recognition which seeks to localize objects using models trained with few labeled data. One of the mainstream few-shot methods is transfer learning which consists in pretraining a detection model in a source domain prior to its fine-tuning in a target domain. However, it is challenging for fine-tuned models to effectively identify new classes in the target domain, particularly when the underlying labeled training data are scarce. In this paper, we devise a novel sparse context transformer (SCT) that effectively leverages object knowledge in the source domain, and automatically learns a sparse context from only few training images in the target domain. As a result, it combines different relevant clues in order to enhance the discrimination power of the learned detectors and reduce class confusion. We evaluate the proposed method on two challenging few-shot object detection benchmarks, and empirical results show that the proposed method obtains competitive performance compared to the related stateof-the-art.

少样本检测是模式识别中的一项重要任务，旨在使用仅有少量标注数据训练的模型来定位目标。少样本方法的主流之一是迁移学习，其过程是在源域中预训练一个检测模型，然后在目标域中对其进行微调。然而，对于微调后的模型来说，有效识别目标域中的新类别是具有挑战性的，特别是当基础标注训练数据稀少时。在本文中，我们设计了一种新颖的稀疏上下文变换器（SCT），该变换器能够有效利用源域中的目标知识，并且自动从目标域中的少量训练图像中学习稀疏上下文。这样一来，它结合了不同的相关线索，从而增强了学习到的检测器的区分能力，并减少了类别混淆。我们在两个具有挑战性的少样本目标检测基准上评估了所提出的方法，实证结果表明，所提出的方法相比于相关的最新技术取得了具有竞争力的性能。

1.介绍

Although deep learning (DL) models have achieved remarkable performance, these outstanding models are usually label-hungry and their training is time and memory demanding. In some scenarios, particularly object detection, labeled data are scarce, and this makes DL-based detection a major challenge [1], [3]. Among existing solutions that mitigate scarcity of labeled data, transfer learning is particularly effective and consists in pretraining detection models in the source domains — using abundant labeled data — prior to their fine-tuning in the target domains. However, as labeled data are scarce in the target domains, few-shot object detection — based on transfer learning [5], [6], [8] — is not yet sufficiently effective in identifying new object classes.

尽管深度学习（DL）模型已经取得了显著的性能，这些出色的模型通常需要大量标注数据，并且它们的训练对时间和内存的需求很高。在某些情况下，特别是目标检测中，标注数据稀缺，这使得基于深度学习的检测成为一大挑战【1】【3】。在现有的解决标注数据稀缺的方案中，迁移学习特别有效，其过程是在源域中使用丰富的标注数据预训练检测模型，然后在目标域中进行微调。然而，由于目标域中的标注数据稀缺，基于迁移学习的少样本目标检测【5】【6】【8】在识别新目标类别方面仍然不够有效。

The aforementioned issue is accentuated in few-shot object detection as this task involves both localization and classification. Localization focuses on spatial information which is decently obtained from pretrained models in the source domains. Therefore, bounding box regressors (BBOX) trained in the source domain are already reliable for initialization and fine-tuning in the target domain. Hence, detectors fine-tuned with few training samples may effectively locate new object classes. In contrast, classification often requires contextual knowledge of specific categories. In other words, source domain knowledge are insufficient to learn new category distributions in the target domain. Therefore, the underlying models should be completely retrained for new categories. However, scarcity and limited diversity of data in the target domain lowers the accuracy of the new learned category classifiers, and thereby leads to class confusion.

上述问题在少样本目标检测中尤为突出，因为这项任务既涉及定位又涉及分类。定位主要关注空间信息，而预训练模型在源域中已经能够较好地获取这些信息。因此，在源域中训练的边界框回归器（BBOX）已经足够可靠，可以用于目标域中的初始化和微调。因此，用少量训练样本微调的检测器可以有效地定位新目标类别。相反，分类通常需要特定类别的上下文知识。换句话说，源域知识不足以学习目标域中新类别的分布。因此，基础模型需要为新类别进行完全重新训练。然而，目标域中数据的稀缺性和有限的多样性降低了新学到的类别分类器的准确性，从而导致类别混淆。

In order to address these issues, we propose in this paper a novel sparse context transformer(SCT) that leverages source domain knowledge together with few training images in the target domain. This transformer learns sparse affinity matrices between BBOXs and classification outputs by exploring the most relevant contexts for new object categories in the target domain. The proposed transformer consists of two simple yet effective submodules: (i) sparse relationship discovery, and (ii) aggregation. In submodule (i), contextual fields are initially designed based on default prior boxes (also called anchor boxes) [9], [10] and multi-scale feature maps extracted from a visual encoder. Then, relationships between each prior box and contextual fields are modeled through a novel sparse attention. In submodule (ii), aggregation further leverages the learned relationships and integrates contextual fields into the relevant prior boxes. As shown through experiments, our proposed transformer enhances prior box representations, and mitigates confusion in few-shot object detection and classification.

为了解决这些问题，我们在本文中提出了一种新颖的稀疏上下文变换器（SCT），该变换器利用源域知识和目标域中的少量训练图像。这个变换器通过探索目标域中新目标类别的最相关上下文，学习边界框（BBOX）和分类输出之间的稀疏亲和矩阵。所提出的变换器由两个简单但有效的子模块组成：（i）稀疏关系发现，和（ii）聚合。在子模块（i）中，基于默认的先验框（也称为锚框）【9】【10】和从视觉编码器中提取的多尺度特征图，初步设计了上下文字段。然后，通过一种新颖的稀疏注意力机制，建模每个先验框与上下文场之间的关系。在子模块（ii）中，聚合进一步利用学习到的关系，将上下文场整合到相关的先验框中。通过实验结果表明，我们提出的变换器增强了先验框表示，并减少了少样本目标检测和分类中的混淆。

Considering all the aforementioned issues, our main contributions include

• A novel sparse context transformer that effectively explores useful contextual fields from a small number of labeled images. This transformer is embedded into an SSD (plug-and-play style) detector suitable for few-shot object detection.

• A novel attention layer that assists object detection in learning task-relevant knowledge from images by enhancing the underlying task-related feature representations.

• A comprehensive evaluation of our proposed method on the challenging configurations for few-shot detection that shows high performance.

考虑到上述所有问题，我们的主要贡献包括：

提出了一种新颖的稀疏上下文变换器，该变换器能有效地从少量标注图像中探索有用的上下文场。该变换器嵌入到一种适用于少样本目标检测的SSD（即插即用风格）检测器中。
提出了一种新颖的注意力层，帮助目标检测通过增强基础的任务相关特征表示，从图像中学习与任务相关的知识。
对我们提出的方法在具有挑战性的少样本检测配置上进行了全面评估，结果显示出高性能。

2.相关工作

Recent years have witnessed a significant progress in few-shot object detection using transfer learning. Among existing methods, Chen et al. [11] introduced a low shot transfer detector that focuses on foreground objects in the target domain during fine-tuning in order to learn more knowledge on the targeted categories. Khandelwal et al. [12] conjectured that simple finetuning may lead to a decrease in the transferability of the models. Hence, they proposed a unified semi-supervised framework that combines weighted multi-modal similarity measures between base and novel classes. With this method, they achieved effective knowledge transfer and adaptation. Unlike these methods, Wang et al. [14] proposed a context-transformer that tackles object confusion in few-shot detection. This transformer relies on a set of contextual fields from different spatial scales and aspect ratios of prior boxes, in order to explore their relationships through dot products. Based on these relationships, the contextual fields are integrated into each prior box and this improves their representation

近年来，使用迁移学习的少样本目标检测取得了显著进展。在现有方法中，Chen等人【11】提出了一种低样本迁移检测器，该检测器在微调过程中关注目标域中的前景对象，从而学习更多关于目标类别的知识。Khandelwal等人【12】推测，简单的微调可能会导致模型的可迁移性下降。因此，他们提出了一个统一的半监督框架，该框架结合了基础类别和新类别之间的加权多模态相似性度量方法。通过这种方法，他们实现了有效的知识迁移和适应。不像这些方法，Wang等人【14】提出了一种上下文变换器，解决了少样本检测中的目标混淆问题。该变换器依赖于来自不同空间尺度和先验框的纵横比的一组上下文场，通过点积来探索它们之间的关系。基于这些关系，将上下文场整合到每个先验框中，从而改善它们的表示。

Our work is an extension of context transformers that addresses the relatively monotonous contextual fields (constructed in the original version of these transformers) as well as their relationships with prior boxes, which cannot effectively suppress task-independent contextual fields, and further affect the model's ability to recognize novel classes. In this regard, we consider informations from different sources and we model sparse relationships between contextual fields and each prior box to help the model selecting the most effective fields. This also mitigates confusion in few-shot object detection.

我们的工作是对上下文变换器的扩展，旨在解决这些变换器原始版本中构建的相对单调的上下文场及其与先验框之间的关系，这些问题无法有效抑制与任务无关的上下文场，从而影响模型识别新类别的能力。在这方面，我们考虑了来自不同来源的信息，并建模上下文场与每个先验框之间的稀疏关系，以帮助模型选择最有效的上下文场。这也减少了少样本目标检测中的混淆

3. PROPOSED METHOD

In this section, we introduce our novel sparse context transformer. As shown in Fig. 1, our framework relies on an SSD-style detector [10] used as a flexible plug-and-play backbone that delivers rich multi-scale contextual information. The SSD detector consists of K (spatial-scale) heads including bounding boxes regressor (BBOX) and object+background classifiers (OBJ+BG). To generalize few-shot learning in the target domain, we first pretrain the SSD detector with a large-scale dataset in the source domain. Then, we combine the proposed transformer module with the SSD detector for fine-tuning in the target domain (see again Fig. 1). As shown subsequently, our proposed transformer includes two submodules: one for sparse relationship discovery, and another one for aggregation. These submodules are respectively used to model context/classifier relationships and for context fusion.

在本节中，我们介绍了我们的新型稀疏上下文变换器。如图1所示，我们的框架依赖于一种SSD风格的检测器【10】，作为一个灵活的即插即用的骨干，提供丰富的多尺度上下文信息。SSD检测器包括K（空间尺度）头部，包括边界框回归器（BBOX）和目标+背景分类器（OBJ+BG）。为了在目标域中泛化少样本学习，我们首先在源域中使用大规模数据集对SSD检测器进行预训练。然后，我们将所提出的变换器模块与SSD检测器结合起来，在目标域中进行微调（再次参见图1）。随后显示，我们提出的变换器包括两个子模块：一个用于稀疏关系发现，另一个用于聚合。这些子模块分别用于建模上下文/分类器关系和上下文融合。

稀疏上下文变换器（Sparse-Context-Transformer）用于少样本检测。它包括稀疏关系发现和上下文聚合两部分，能够有效地利用少样本任务的上下文场，提升每个先验框的上下文感知能力，并解决少样本检测中的目标混淆问题。注意力聚焦模块能够有效地帮助我们学习与任务相关的上下文场。

3.1 Sparse Relationship Discovery

Given an image I fed into an SSD detector, we extract for each prior box in I a vector of scoresPk,m,h,w ∈ RCs ; being Cs the number of (source) object categories, k ∈ J1, KK a spatial scale, m an aspect ratio, and (h, w) the prior box coordinates at the k-th scale. In what follows, we reshape the tensor (Pk,m,h,w)k,m,h,w as a matrix P ∈ RDp×Cs ; being Dp the total number of prior boxes inI across all the possible scales, aspect ratios and locations. Scores in P provide us with a rich semantic representation about object categories [15]; nonetheless, this representation is deprived from contextual relationships between prior boxes. Since SSD usually involves ten thousand prior boxes per image, modeling and training all the relationships between these boxes is clearly intractable, overparameterized and thereby subject to overfitting, particularly in the few-shot scenario. In order to prevent these issues, spatial pooling is first achieved so one may obtain a more compact matrix Q ∈ RDq ×Cs instead of P, being Dq (≪ Dp) the reduced number of prior boxes after spatial pooling.

给定一张图像 $\mathcal{I}$ 输入到一个SSD检测器中，我们为图像 $\mathcal{I}$ 中的每个先验框提取一个分数向量 $\mathbb{P}_{k,m,h,w}\in\mathbb{R}^{C_s}$ ；其中 $C_{s}$ 是（源）对象类别的数量， $k\in[1,K]$ 是空间尺度，m是一个长宽比，而(h, w)是第k个尺度上的先验框坐标。接下来，我们将张量 $(\mathbf{P}_{k,m,h,w})_{k,m,h,w}$ 重塑为一个矩阵 $\mathbf{P}\in\mathbb{R}^{D_p\times C_s}$ ；其中 $D_{p}$ 是在图像 $\mathcal{I}$ 中所有可能的尺度、长宽比和位置上的先验框的总数。矩阵P中的分数为我们提供了关于对象类别的丰富语义表示；然而，这种表示缺乏先验框之间的上下文关系。由于SSD通常涉及每个图像上数万个先验框，建模和训练这些框之间的所有关系显然是不可行的，参数过多，因此容易出现过拟合，特别是在少样本情况下。为了避免这些问题，首先进行了空间池化，以便获得一个更紧凑的矩阵 $\mathrm{Q~}\in\mathbb{R}^{D_q\times C_s}$ ，而不是P，其中 $D_q\left(\ll D_p\right)$ 是空间池化后的减少的先验框数量。

Considering that prior boxes capture only object parts (and not their overall extents), multiscale feature maps extracted by the SSD encoder (denoted as {Fk}k) are also aggregated asM = Concat({Fk}k) ∈ RDq ×Df ; being Concat the concatenation operator and Df the resulting dimension after the application of this operator. These aggregated features M enable to provision with complementary visual cues at different scales, and provide us with a more comprehensive contextual information [16]. In the rest of this paper, the pairs of pooled prior boxes Q together with the underlying aggregated features M are referred to as contextual fields.

考虑到先验框只捕捉到对象的部分（而不是它们的整体范围），通过SSD编码器提取的多尺度特征图（表示为 $\{\mathbf{F}_k\}_k$ ）也被聚合为 $\mathbf{M}=\mathsf{Concat}(\{\mathbf{F}_{k}\}_{k})\in\mathbb{R}^{D_{q}\times D_{f}}$ ；其中Concat是连接操作符， $D_{f}$ 是应用该操作符后的结果维度。这些聚合特征 $\text{M}$ 能够在不同尺度上提供互补的视觉线索，为我们提供更全面的上下文信息【16】。在本文的其余部分，池化后的先验框Q和基础聚合特征M的组合被称为上下文场。

Attention Focus. In order to learn task-related context from few training data, we design an attention focus layer that enhances the representation of contextual fields and attenuates category confusion. We first define the attention weight matrix AM as

$\mathbf{A_M}=\psi_\alpha(\mathbf{M})^\top\psi_\beta(\mathbf{M}),$ (1)

注意力聚焦。为了从少量训练数据中学习与任务相关的上下文，我们设计了一个注意力聚焦层，用于增强上下文场的表示并减弱类别混淆。我们首先定义注意力权重矩阵 $\mathbf{A_M}$ 为

$\mathbf{A_M}=\psi_\alpha(\mathbf{M})^\top\psi_\beta(\mathbf{M}),$ (1)

being ψα(.) (and ψβ (.)) trained fully-connected (FC) layers that increase the expressivity of the attention matrix AM ∈ RDf ×Cs , and allow obtaining an enhanced representation M∗ ∈ RDq ×Cs as

$\mathbf{M}^*=\mathbf{M}\mathrm{~A}_\mathbf{M},$ (2)

其中，ψα(.)（和ψβ(.)）是训练过的全连接（FC）层，用于增加注意力矩阵 $\mathbf{A_M}\in\mathbb{R}^{D_f\times C_s}$ 的表达能力，并允许获得增强的表示 $\mathbf{M}^*\in\mathbb{R}^{D_q\times C_s}$ ，表示如下：

$\mathbf{M}^*=\mathbf{M}\mathrm{~A}_\mathbf{M},$ (2)

which also ensures dimension consistency of the learned representation in Eq. (3). By combining the pooled prior boxes in Q and the underlying multi-scale feature maps M∗, we obtain our contextual field representations that capture both intrinsic (feature) and extrinsic (object-class) information, resulting into

$\mathbf{C}=\lambda\mathbf{~M}^{*}+\mathbf{~Q},$ （3）

这也确保了学习到的表示在方程（3）中的维度一致性。通过组合池化后的先验框Q和底层多尺度特征图M∗，我们得到了捕获内在（特征）和外在（对象类别）信息的上下文字段表示，结果如下:

$\mathbf{C}=\lambda\mathbf{~M}^{*}+\mathbf{~Q},$ (3)

here λ ≥ 0 controls the impact of attention in M∗. Using Eq. (3), we design our sparse attention mechanism in order to explore the affinity relationships between each prior box and contextual fields, and remove spurious ones (i.e., those farther away from the prior boxes) according to the sparse relationship. More precisely, we evaluate the relationship matrix between the contextual fields in C and prior boxes in P, and we reset weak relationships to zero using soft-thresholding as

$\mathbf{R}=\mathrm{sign}(\mathbf{A}),$ (4)

with

$\mathrm{A}=\mathrm{softmax}\bigg(\frac{\psi_\gamma(\mathbf{P})\psi_\rho(\mathbf{C})^\top}{\sqrt{C_s}}\bigg).$ (5)

其中，λ ≥ 0 控制了 M∗ 中注意力的影响。利用方程（3），我们设计了我们的稀疏注意力机制，以探索每个先验框与上下文场之间的亲和关系，并根据稀疏关系删除虚假关系（即，与先验框距离较远的关系）。更具体地说，我们评估了上下文场中的 C 和先验框中的 P 之间的关系矩阵，并使用软阈值法将弱关系重置为零。

$\mathbf{R}=\mathrm{sign}(\mathbf{A}),$ （4）

$\mathrm{A}=\mathrm{softmax}\bigg(\frac{\psi_\gamma(\mathbf{P})\psi_\rho(\mathbf{C})^\top}{\sqrt{C_s}}\bigg).$ （5）

Here softmax is row-wise applied, ψγ (.), ψρ(.) are again trainable FC layers and sign(.) is an operator that sparsifies a given relationship matrix using a soft threshold. Each row of R ∈ RDp×Dqmeasures the importance of all contextual fields w.r.t. its underlying prior box. Hence, sparse relationship discovery allows a prior box to identify its important contextual fields and discard those that are not sufficiently important according to various aspect ratios, locations and spatial scales.

在这里，softmax按行应用，ψγ(.)、ψρ(.)再次是可训练的全连接层，sign(.)是一个操作符，它使用软阈值稀疏化给定的关系矩阵。矩阵 $\mathbf{R}\in\mathbb{R}^{D_p\times D_q}$ 的每一行测量了相对于其底层先验框的所有上下文场的重要性。因此，稀疏关系发现允许一个先验框识别其重要的上下文场，并且根据不同的长宽比、位置和空间尺度丢弃那些不够重要的上下文场。

3.2 Aggregation

We consider the sparse relationship matrix R — between prior boxes and contextual fields — as a relational attention in order to derive the representation of each prior box. We also consider a softmax operator on each row i of R as a gating mechanism that measures how important is each contextual field w.r.t. the i-th prior box. By considering the cross correlations between rows of R and columns of C, we derive our sparse attention-based representation of prior boxes as

$\mathbf{W}=\mathrm{softmax}(\mathbf{R})\psi_\eta(\mathbf{C}),$ （6）

我们将稀疏关系矩阵R（先验框和上下文场之间的关系）视为关系注意力，以推导出每个先验框的表示。我们还考虑将softmax运算符应用于R的每一行i，作为一个门控机制，衡量每个上下文场相对于第i个先验框的重要性。通过考虑R的行和C的列之间的交叉相关性，我们得到了基于稀疏注意力的先验框表示，如下所示:

$\mathbf{W}=\mathrm{softmax}(\mathbf{R})\psi_\eta(\mathbf{C}),$ （6）

being $\mathbf{W}\in\mathbb{R}^{D_p\times C_s}$ and ψη corresponds again to trainable FC layers. Now, we combine W with the original matrix of prior boxes P in order to derive our final context-aware representation $\hat{\mathbf{P}}\in\mathbb{R}^{D_p\times C_s}$

$\hat{\mathbf{P}}=\mathbf{P}+\psi_\xi(\mathbf{W}).$ (7)

其中， $\mathbf{W}\in\mathbb{R}^{D_p\times C_s}$ ，ψη对应于可训练的全连接层。现在，我们将W与原始的先验框矩阵P相结合，以推导出我们的最终上下文感知表示 $\hat{\mathbf{P}}\in\mathbb{R}^{D_p\times C_s}$ 。

$\hat{\mathbf{P}}=\mathbf{P}+\psi_\xi(\mathbf{W}).$ (7)

Here ψξ corresponds to other (last) trainable FC layers. Since ˆP is context-aware, it enhances the discrimination power of prior boxes by attenuating confusions between object classes. By plugging the final representation ˆP into a softmax classifier, we obtain our scoring function, on the Ct target classes, as

$\hat{\mathbf{Y}}=\mathrm{softmax}(\hat{\mathbf{P}}\Theta).$ （8）

在这里， $\psi_{\xi}$ 对应于其他（最后一层）可训练的全连接层。由于ˆP具有上下文感知性，它通过减弱对象类别之间的混淆来增强先验框的区分能力。将最终表示ˆP插入到一个softmax分类器中，我们可以得到我们在Ct目标类别上的评分函数，如下所示：

$\hat{\mathbf{Y}}=\mathrm{softmax}(\hat{\mathbf{P}}\Theta).$ （8）

Note that the representations in ˆP and the underlying parameters Θ ∈ RCs×Ct are shared across different aspect ratios and spatial scales, so there is no requirement to design separate classifiers at different scales. This not only reduces computational complexity but also prevents overfitting.

请注意，ˆP中的表示以及其底层参数 $\Theta\in\mathbb{R}^{C_{s}\times C_{t}}$ 在不同的长宽比和空间尺度上是共享的，因此无需在不同尺度上设计单独的分类器。这不仅降低了计算复杂度，还防止了过拟合。

Implementation Details.

We choose a recent SSD detector [26] as a basic architecture built upon 6 heads corresponding to different spatial rescaling factors (taken in {1, 3, 5, 10, 19, 38}). The contextual fields we designed consist of two parts: in the first one, prior boxes — corresponding to multiple scales and aspect ratios — are max-pooled with different instances of kernel sizes+strides taken in {2, 3}. For the second one, contextual fields composed of multi-scale features are fused through four spatial scales. In these experiments, we set the hyperparameter λ in Eq. (3) to 0.6, and the embedding functions in the sparse context transformer correspond to the residual FC 6layer whose input and output have the same number of channels.

我们选择了最近的一个 SSD 检测器 [26] 作为基本架构，它由 6 个头部组成，对应不同的空间重缩放因子（取自 {1, 3, 5, 10, 19, 38}）。我们设计的上下文场由两部分组成：在第一部分中，对应于多个尺度和长宽比的先验框，使用不同的核大小+步幅（取自 {2, 3}）进行最大池化。对于第二部分，由多尺度特征组成的上下文场通过四个空间尺度进行融合。在这些实验中，我们将方程（3）中的超参数 λ 设为 0.6，稀疏上下文变换器中的嵌入函数对应于输入和输出具有相同通道数的残差 FC 层。

We implement our experiments using PyTorch [27] on two Nvidia 3090 GPUs. We pretrain the SSD detectors on the source domain following exactly the original SSD settings in [26], and we fine-tune these SSDs on the target domain using stochastic gradient descent with the following settings: a batch size of 64, a momentum of 0.9, an initial learning rate equal to 4×10−3 (decreased by 10 after 3k and 3.5k iterations), a weight decay of 5 × 10−4, and a total number of training iterations equal to 4k.

我们使用 PyTorch [27] 在两个 Nvidia 3090 GPU 上进行实验。我们在源域上完全按照 [26] 中原始的 SSD 设置对 SSD 检测器进行预训练，然后使用随机梯度下降算法在目标域上微调这些 SSD 检测器，设置如下：批量大小为 64，动量为 0.9，初始学习率为 4×10^(-3)（在 3k 和 3.5k 迭代后减少 10 倍），权重衰减为 5×10^(-4)，总训练迭代次数为 4k。