论文翻译《Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection》

论文地址:https://arxiv.org/abs/2203.13903
论文代码:https://github.com/facebookresearch/sylph-few-shot-detection

Abstract

We study the challenging incremental few-shot object detection (iFSD) setting. Recently, hypernetwork-based approaches have been studied in the context of continuous and finetune-free iFSD with limited success. We take a closer look at important design choices of such methods, leading to several key improvements and resulting in a more accurate and flexible framework, which we call Sylph. In particular, we demonstrate the effectiveness of decoupling object classification from localization by leveraging a base detector that is pretrained for class-agnostic localization on a large-scale dataset. Contrary to what previous results have suggested, we show that with a carefully designed class-conditional hypernetwork, finetune-free iFSD can be highly effective, especially when a large number of base categories with abundant data are available for metatraining, almost approaching alternatives that undergo test-time-training. This result is even more significant considering its many practical advantages: (1) incrementally learning new classes in sequence without additional training, (2) detecting both novel and seen classes in a single pass, and (3) no forgetting of previously seen classes. We benchmark our model on both COCO and LVIS, reporting as high as 17% AP on the long-tail rare classes on LVIS, indicating the promise of hypernetwork-based iFSD.

我们研究的是具有挑战性的增量小样本目标检测(iFSD)。最近,在连续和无微调 iFSD 的背景下研究了基于超网络的方法,但成效有限。我们对此类方法的重要设计选择进行了仔细研究,并进行了几项关键改进,最终形成了一个更精确、更灵活的框架,我们称之为 Sylph。特别是,我们在大规模数据集上利用预先训练好的无类别的定位基础检测器,证明了将物体分类与定位解耦的有效性。与之前的结果相反,我们表明,通过精心设计的类条件超网络,无微调 iFSD 可以非常有效,尤其是当大量具有丰富数据的基础类别可用于元训练时,几乎接近于经过测试时训练的替代方案。考虑到 iFSD 的许多实际优势,这一结果就显得更加重要了:(1) 无需额外训练即可依次增量学习新类别;(2) 一次性检测新类别和已见类别;(3) 不会遗忘之前已见类别。我们在 COCO 和 LVIS 上对我们的模型进行了基准测试,结果表明,在 LVIS 上,长尾稀有类的 AP 高达 17%,这表明基于超网络的 iFSD 大有可为。

1.Introduction

While advances in deep learning have led to significant progress in computer vision [18, 23, 24, 31], much of this success has relied upon large-scale data collection and annotation [7, 20, 22, 36], a process that is both labor-intensive and time-consuming, and does not scale well with the number of categories. This is especially true for object detection [18, 23, 37], particularly for the long tail of object categories, where data may be scarcer [22]. As a result, fewshot learning of object detectors (FSD) [28, 65, 70, 72] has become a recent topic of interest.

虽然深度学习的进步在计算机视觉领域取得了重大进展[18, 23, 24, 31],但其中大部分成功都依赖于大规模的数据收集和标注[7, 20, 22, 36],这一过程既耗费人力又耗费时间,而且不能随着类别数量的增加而很好地扩展。这一点在物体检测中尤为明显[18, 23, 37],特别是对于数据可能较少的长尾物体类别[22]。因此,物体检测器的小样本学习(FSD)[28, 65, 70, 72]已成为近期备受关注的话题。

While learning a novel class from only a few samples alone is a challenging problem, the task can be made simpler by leveraging known classes with abundant data (commonly referred to as base classes), whose structure can be used as a prior for knowledge transfer. The few previous FSD works have approached this primarily in two ways. The first is fine-tuning [47, 65], where a model is first pretrained on the base classes and then fine-tuned on a small balanced set of data from both the base and novel classes, a form of test-time training [59]. Although simple, it has difficulty scaling to many real-world applications due to its computational and memory requirements. An alternate strategy is taking a meta-learning approach [72]. Meta-learning approaches frame the problem as “learning to learn” [4,10,32,44,61,69,72], training the model episodically to induce fast adaptation to novel classes.

虽然仅从少量样本中学习一个新类别是一个极具挑战性的问题,但通过利用拥有丰富数据的已知类别(通常称为基类),可以使这项任务变得更加简单,因为基类的结构可以用作知识转移的先验。之前为数不多的 FSD 方法主要从两个方面着手。第一种是微调[47, 65],即首先在基类上对模型进行预训练,然后在基类和新类别的少量均衡数据集上对模型进行微调,这是一种测试时训练[59]。这种方法虽然简单,但由于对计算和内存的要求,很难推广到许多实际应用中。另一种策略是采用元学习方法 [72]。元学习方法将问题定义为 “学会学习”[4,10,32,44,61,69,72],对模型进行偶发训练,以促使其快速适应新的类别。

However, many FSD methods focus on the limited set-up where only novel categories are to be detected. These methods often fail to preserve the original detector performance on base categories [4, 10, 32, 72] or forget about the ones it was initially trained on [65]. Given the ever-changing nature of the real-world, a desirable property of machine learning systems is the ability to incrementally learn new concepts without revisiting previous ones and not forgetting them [40, 42]. Humans are able to achieve such feat, learning novel concepts not only without forgetting but reusing such knowledge [45]. Conventional supervised learning struggles with incrementally presented data, tending to suffer catastrophic forgetting [39, 51]. An alternative is studying all the available data every time new concepts arrive, commonly referred to as “joint training” [22], but such a paradigm imposes a slow development cycle, requiring significant data collection efforts for the new concepts and expensive large-scale training (and re-training).

然而,许多 FSD 方法都侧重于只检测新类别的有限设置。这些方法往往无法保持原有检测器在基础类别上的性能 [4, 10, 32, 72] 或遗忘最初训练的类别 [65]。鉴于现实世界瞬息万变的特性,机器学习系统的一个理想特性就是能够逐步学习新概念,而无需重新审视以前的概念,也不会遗忘它们 [40, 42]。人类就能做到这一点,在学习新概念时不仅不会遗忘,还会重复使用这些知识 [45]。传统的监督学习很难处理增量数据,往往会出现灾难性遗忘 [39, 51]。另一种方法是在每次出现新概念时研究所有可用数据,这通常被称为 “联合训练”[22],但这种模式会带来缓慢的开发周期,需要为新概念收集大量数据,并进行昂贵的大规模训练(和再训练)。

Instead, we seek an object detection model capable of learning new classes from a few shots in a fast, scalable manner without forgetting previously seen classes, a setting commonly referred to as incremental few-shot detection (iFSD). ONCE [44], a meta-learning approach to FSD, is of particular interest due to its hypernetwork-based classconditional design. ONCE is able to enroll novel categories without affecting its ability to remember base classes. We use a base detector and hypernetwork architecture similar to ONCE, but with a few key design differences: (1) ONCE, along several other recent works [28, 65, 72], attempts to directly produce (via training or hypernet) the parameters of a localization regression model that transforms the query sample feature maps into the output bounding boxes, all from the few available training samples. We find this to be unnecessary and potentially harmful, as the task can be significantly simplified by decoupling localization from classification. To achieve this goal, we leverage a base detector with class-agnostic localization capability pretrained on abundant base class data. (2) We study the class-conditional hypernetwork’s behavior, making some key changes to the structure and adding normalization to the predicted parameters, resulting in much higher accuracy.

相反,我们所寻求的目标检测模型能够以快速、可扩展的方式从少量样本中学习新的类别,而不会遗忘之前看到的类别,这种设置通常被称为增量小样本检测(iFSD)。ONCE [44] 是一种 FSD 元学习方法,由于其基于超网络的类别条件设计而特别引人关注。ONCE 能够在不影响基础类别记忆能力的情况下注册新类别。我们使用的基类检测器和超网络架构与 ONCE 类似,但在设计上有一些关键区别: (1) ONCE 和其他几项最新成果[28, 65, 72]一样,试图直接生成(通过训练或超网络)定位回归模型的参数,该模型将查询样本的feature maps 转化为输出bounding boxes,而所有参数都来自为数不多的可用训练样本。我们认为这是不必要的,而且可能是有害的,因此通过将定位与分类解耦,可以大大简化任务。为了实现这一目标,我们利用了具有类别无关定位能力的基础检测器,并在丰富的基础类别数据上进行了预训练。(2) 我们对类条件超网络的行为进行了研究,对其结构进行了一些关键改动,并对预测参数进行了归一化处理,从而大大提高了准确性。

With an architecture that can swiftly adapt to the long tail of classes from few shots, we name our framework Sylph, after the nimble long-tailed hummingbird (Figure 1). We present extensive evaluations that empirically demonstrate the benefits of our design, showing that Sylph is more effective than ONCE [44] (our main baseline) across all the reported datasets and evaluation regimes. On the challenging LVIS few-shot learning benchmark in particular, we show that Sylph is superior by a margin of 8% points.

我们的架构能够迅速适应拥有少数样本的长尾类,因此我们以灵活的长尾蜂鸟(图 1)命名我们的框架为 Sylph。我们进行了广泛的评估,从经验上证明了我们设计的优势,表明在所有报告的数据集和评估机制中,Sylph 比 ONCE [44](我们的主要baseline)更有效。特别是在具有挑战性的 LVIS 少量学习基准上,我们表明 Sylph 优于 ONCE 8%。

2.Related Work

Object Detection Object detection is the task of simultaneously localizing and classifying objects within a scene. Most modern object detectors consist of a convolutional feature extractor [24, 31, 57] followed by various mechanisms or networks to predict classes and some form of bounding box coordinates [25]. Detectors that first generate region proposals during inference are often referred to as two-stage detectors [6, 17, 18, 23, 54], while ones that directly predict class and localization from the convolutional feature maps are considered single-stage detectors [8, 35, 38, 53, 62]. Single-stage detectors have the advantage of having simpler implementations and faster inference speeds, and recent advances have increased their accuracy to be competitive with two-stage models [37], which had previously been the primary advantage of such models. Throughout this work, we choose to primarily use FCOS [62] as our base detector due to its strong performance and class-agnostic localization based on “centerness” and intersection-over-union (IoU) losses; this allows for better generalization and high recall on novel unseen classes [29], especially when trained on large-scale datasets.

目标检测 目标检测是同时对场景中的物体进行定位和分类的任务。现代物体检测器大多由卷积特征提取器 [24, 31, 57] 和各种机制或网络组成,用于预测类别和某种形式的边界框坐标 [25]。在推理过程中首先生成区域建议的检测器通常被称为两阶段检测器[6, 17, 18, 23, 54],而直接从卷积特征图预测类别和定位的检测器则被视为单阶段检测器[8, 35, 38, 53, 62]。单级检测器的优点是实现更简单、推理速度更快,而且最近的进步也提高了其准确性,可与两级模型媲美[37],而两级模型曾是此类模型的主要优势。在整个工作中,我们选择主要使用 FCOS [62] 作为我们的基础检测器,这是因为它具有强大的性能和基于 "中心性 "和交集-联合(IoU)损失的类无关定位功能;这使得它在未见过的新类别上具有更好的泛化能力和更高的召回率[29],尤其是在大规模数据集上进行训练时。

Few-shot Learning While many supervised learning approaches assume a large number of samples from the data distribution, such methods risk overfitting when the model has only a few samples to learn from. Given the costs of collecting, annotating, and training models with large amounts of data, few-shot learning has become an active research direction, with image classification as the most common task. Many recent approaches take a meta-learning strategy [63]. Optimization-based approaches produce models that can quickly learn from few samples [13,41,52]. Metriclearning methods learn an embedding function that induces a space where samples can be compared with nearest neighbors or other such simple algorithms [33,58,60,64]. Hypernetworks have also been used to predict model parameters for new classes from limited samples [3, 14, 16, 48, 50, 66]. We use a hypernetwork in our model to predict convolutional kernels for novel object classification. Such a strategy requires zero training during inference time and can easily scale to an arbitrary number of classes.

小样本学习 虽然许多监督学习方法都假定数据分布中有大量样本,但当模型只有少量样本可供学习时,这些方法就会面临过度拟合的风险。鉴于收集、注释和训练大量数据模型的成本较高,少样本学习已成为一个活跃的研究方向,而图像分类是最常见的任务。最近的许多方法都采用了元学习策略[63]。基于优化的方法能产生从少量样本中快速学习的模型 [13,41,52]。元学习方法可学习一个嵌入函数,该函数可诱导出一个空间,在该空间中,样本可与近邻或其他类似的简单算法进行比较 [33,58,60,64]。超网络也被用于从有限的样本中预测新类别的模型参数[3, 14, 16, 48, 50, 66]。我们在模型中使用超网络来预测新物体分类的卷积核。这种策略在推理过程中不需要任何训练,而且可以轻松扩展到任意数量的类别。

Few-shot Object Detection and Beyond Most neural networks are trained with stochastic gradient descent, which often assumes the training data are drawn independently and identically distributed (i.i.d.). However, this i.i.d. assumption is violated in the practical scenario where fewshot categories are seen only after the model have been trained for a set of base categories. In such situations, catastrophic forgetting [12,19] can occur: the model suffers severe degradation in performance on the original classes. In image classification, some works have proposed a generalized setting for few-shot learning to tackle this exact situation [16, 46]. Similarly for object detection, recent works have focused on incorporating few-shot categories into a model that has been pretrained with large-scale datasets [11,44,65]. This goes beyond the simpler more traditional few-shot object detection set-up [28, 47, 48]. More generally, continual object detection [1, 44] works attempt to learn to detect new classes through several learning instances without forgetting any of the seen categories.

小样本目标检测及其他 大多数神经网络都采用随机梯度下降法进行训练,这种方法通常假定训练数据是独立且同分布(i.i.d.)的。然而,在实际情况中,只有在对一组基础类别进行模型训练后,才会出现少影类别,这就违反了这一 i.i.d. 假设。在这种情况下,可能会出现灾难性遗忘 [12,19]:模型在原始类别上的性能严重下降。在图像分类中,有些研究提出了一种广义的 “少量学习”(few-shot learning)设置来解决这种情况[16, 46]。同样,在物体检测方面,近期的研究重点是将小样本类别纳入使用大规模数据集进行预训练的模型中[11,44,65]。这超越了更简单、更传统的少镜头物体检测设置[28, 47, 48]。更一般而言,连续物体检测 [1, 44] 尝试通过多次学习实例来检测新的类别,而不会遗忘任何已见类别。

Of the prior work with the goal of both few-shot and continual learning for object detection, some are continual only in that they do not degrade base class accuracy during a single few-shot adaptation to new classes [11, 65]. In contrast, ONCE [44] considers a setting in which novel classes arrive sequentially and incrementally, leading to multiple learning events during which forgetting must be avoided. We adopt a model architecture that is able to provide such capabilities, as it is more flexible and a better fit for interactively learning new classes from the world. Methodologically, however, we approach the problem differently, as we (1) simplify learning by utilizing a detector with class-agnostic localization rather than trying to learn per-class localization from only a few samples; (2) leverage a per-class binary classifier to allow incrementally and independently added novel classes to co-exist with previously learned base classes, detecting seen and novel classes in a single pass; (3) generate both weights and biases for newly added classes, proposing an effective weight normalization to the output of a hypernetwork weight generator that enables stable training and more effective synthesis of class-specific class codes.

在以前的工作中,有的工作以对象检测的少量学习和持续学习为目标,有的工作是持续学习,即在对新类别进行少量适应时不会降低基础类别的准确性[11, 65]。相比之下,ONCE[44]考虑的情况是,新类别是连续和渐进到达的,会导致多次学习事件,在此期间必须避免遗忘。我们采用的模型架构能够提供这种能力,因为它更灵活,更适合从世界中交互式地学习新的类别。不过,在方法上,我们采用了不同的方法来解决这个问题,因为我们(1)通过利用具有类无关定位功能的检测器来简化学习,而不是试图仅从少量样本中学习每个类的定位功能;(2)利用每个类的二元分类器来允许增量和独立添加的新类与之前学习的基础类共存,从而一次性检测出可见类和新类; (3) 为新添加的类别生成权重和偏置,对超网络权重生成器的输出进行有效的权重归一化,从而实现稳定的训练和更有效地合成特定类别的类别代码。

3.Method

We seek a model that can operate in the incremental few-shot detection (iFSD) [44] setting: a detector that can flexibly adapt to new classes introduced in sequence from only a few examples, without forgetting any previously seen classes. We differentiate this continuous iFSD with batch iFSD where novel classes are added in a batch. Concretely, after being pretrained on a base set of classes C b C^b Cb, the objective is to achieve good performance on a novel class c t n ∈ C n c^{n}_{t} \in C^n ctnCn from a support set of only K K K shots while maintaining strong performance on C b C^b Cb, and the preceding novel classes c t ′ n ∀ t ′ < t c^{n}_{t'} \forall t^′ < t ctnt<t, without re-training on data from these previous classes. As the goal is to learn to adapt to new classes, we assume C b ∩ C n = ∅ C^b \cap C^n = \emptyset CbCn=.

我们寻求一种能在增量小样本检测(iFSD)[44] 环境中运行的模型:一种能灵活适应仅从少量示例中依次引入的新类别的检测器,而不会遗忘任何以前见过的类别。我们将这种连续 iFSD 与批量 iFSD 区分开来,在批量 iFSD 中,新的类别会被批量添加。具体地说,在一组基础类 C b C^b Cb上进行预训练之后,目标是在一个只有 K K K个样本的支持集上,在一个新类 c t n ∈ C n c^{n}_{t} \in C^n ctnCn上获得良好的性能,同时在 C b C^b Cb和之前的新类 c t ′ n ∀ t ′ < t c^{n}_{t'} \forall t^′ < t ctnt<t上保持良好的性能,而不需要对来自这些先前类的数据进行重新训练。由于目标是学习适应新类别,我们假设 C b ∩ C n = ∅ C^b \cap C^n = \emptyset CbCn=

3.1.Sylph

To achieve the stated objective of iFSD, we introduce Sylph, a framework that can quickly add detection capabilities of new classes, without any additional optimization of model parameters. Sylph is composed of two primary components (Figure 1): (1) a base object detector with classagnostic localization to surface salient objects in an image with high recall and (2) a few-shot hypernetwork to generate class-specific parameters for a per-class binary classifier. We discuss each of these in detail below.

为了实现 iFSD 的既定目标,我们引入了 Sylph,这是一个无需额外优化模型参数就能快速增加新类别检测能力的框架。Sylph 由两个主要部分组成(图1): (1) 带有分类定位功能的基础物体检测器,用于以高召回率显示图像中的突出物体;(2) 小样本超网络,用于为每个类别的二元分类器生成特定类别的参数。我们将在下文逐一进行详细讨论。
在这里插入图片描述

Figure 1. The Sylph Framework. Sylph is composed of a base object detector and a few-shot hypernetwork, whose Code Generator consists of a Code Predictor Head and Code Process Module (detailed in Section. 3.1.2). The dashed arrow indicates weight sharing.

图1。Sylph框架。 Sylph由一个基础目标检测器和一个小样本超网络组成,其代码生成器由一个代码预测头和代码处理模块组成(详见3.1.2节)。虚线箭头表示权重共享。

3.1.1 Object Detector

Modern object detection models [25] are often composed of a convolutional backbone F θ F_\theta Fθ followed by a detector head D ϕ D_\phi Dϕ. Given an image I I I, the former produces high-level feature maps h = F θ ( I ) h = F_\theta(I) h=Fθ(I), which can then be used by the detector head to predict both class $c% and location, as specified by a bounding box b = ( x , y , h , w ) b = (x, y, h, w) b=(x,y,h,w). Many detection models perform both these tasks in parallel [18, 38, 54], predicting the class category and bounding box coordinates from the same features: o = D ϕ ( h ) o = D_\phi(h) o=Dϕ(h), where o = [ o 1 , . . . , o n ] o = [o_1, ..., o_n] o=[o1,...,on] are predicted objects in I I I, with each object o i = [ c i , b i ] o_i = [c_i, b_i] oi=[ci,bi] containing the predicted class label and bounding box. We denote the final regression and classification layer as B β B_\beta Bβ and C γ C_\gamma Cγ, which can be a fully connected layer in region-based detection [54] or a convolutional layer in dense prediction [62]. For an N -way classification problem, the parameters γ \gamma γ for the classifier normally produce N + 1 N + 1 N+1 logits for a softmax, corresponding to the N N N classes and the background. Meanwhile, the bounding box regressor’s parameters β \beta β contains N N N stacked weights β c \beta_c βc, with one for each class c c c; the class with the highest prediction score determines which regressor’s prediction is selected. In order for our object detector to support the challenging iFSD setting, we make several key design choices affecting the two primary outputs of a detector: classification and localization.

现代目标检测模型[25]通常由卷积主干 F θ F_\theta Fθ和检测器头部 D ϕ D_\phi Dϕ组成。给定图像 I I I,前者生成高级特征映射 h = F θ ( I ) h = F_\theta(I) h=Fθ(I),然后检测器头可以使用它来预测类别 c c c和位置信息,位置信息由边界框 b = ( x , y , h , w ) b = (x, y, h, w) b=(x,y,h,w)指定。许多检测模型并行执行这两个任务[18,38,54],从相同的特征中预测类别和边界框坐标: o = D ϕ ( h ) o = D_\phi(h) o=Dϕ(h),其中 o = [ o 1 , . . . , o n ] o = [o_1, ..., o_n] o=[o1,...,on] I I I中的预测对象,每个对象 o i = [ c i , b i ] o_i = [c_i, b_i] oi=[ci,bi]包含预测的类标签和边界框。我们将最终的回归和分类层分别表示为 B β B_\beta Bβ C γ C_\gamma Cγ,在基于区域的检测[54]中可以是一个全连接层,在密集预测[62]中可以是一个卷积层。对于N-way分类问题,分类器的参数 γ \gamma γ通常会为softmax生成 N + 1 N + 1 N+1logits,对应于 N N N个类别和一类背景。同时,边界框回归器的参数 β \beta β包含 N N N个堆叠权值 β c \beta_c βc,每个类一个权值 c c c;预测得分最高的类决定选择哪个回归量的预测。为了使我们的目标检测器支持具有挑战性的iFSD任务,我们做出了几个关键的设计选择来影响检测器的两个主要输出:分类和定位。

Incremental Classification Without Forgetting A major contributor to catastrophic forgetting is highly non-i.i.d. sequential training with a shared classification head [12]; optimizing the softmax can result in destructive gradients overwriting previous knowledge. We thus replace the single softmax-based classifier C γ C_\gamma Cγ with many binary sigmoidbased classifiers C γ c C_{\gamma_{c}} Cγc, with each class individually handled by its own set of parameters. When trained with the focal loss [35], sigmoid classifiers have been shown to be just as effective as a single softmax classifier. Thus, when adding novel classes, we can train or generate a new set of classifier parameters γ c n \gamma^n_c γcn. When combined with previous parameters to predict all available classes, there is zero interference between each class’s prediction score.

无遗忘增量分类 导致灾难性遗忘的一个主要因素是高度的非独立同分布。具有共享分类头的顺序训练[12];优化softmax会导致破坏性梯度覆盖之前的知识。因此,我们将单个基于softmax的分类器 C γ C_\gamma Cγ替换为许多基于二元sigmoid的分类器 C γ c C_{\gamma_{c}} Cγc,每个类都由自己的一组参数单独处理。当使用focal loss进行训练时[35],sigmoid 分类器已被证明与单个softmax分类器一样有效。因此,当添加新类时,我们可以训练或生成一组新的分类器参数 γ c n \gamma^n_c γcn。当与前面的参数结合预测所有可用的类时,每个类的预测分数之间没有干扰。

Class-agnostic Bounding Box Regressor Previous fewshot object detection methods [28, 44, 65] have tended to learn a per-class box regressor B β c B_{\beta_{c}} Bβc in tandem with the classifier. However, when only a few examples are available for learning, the model has very little opportunity to learn a custom location regressor for each novel class. Instead, we propose pretraining the base object detector with a single class-agnostic box regressor B β B_\beta Bβ for all classes. When adapting the model to novel classes C n C_n Cn, we simply reuse B β B_\beta Bβ for localization. Such an approach has been shown to work well for zero-shot object detection if pretrained on a largescale dataset [21] and can leverage progress in the openworld detection literature [29]. By alleviating the need to learn localization in a few-shot or continual manner, we can treat the problem as a few-shot classification task and focus just on generating additional classifier parameters γ c n \gamma^n_c γcn. We validate the effectiveness of this setup in Section 5.

类别无关的边界框回归器 以前的一些目标检测方法[28,44,65]倾向于与分类器一起学习每个类的包围盒回归量 B β c B_{\beta_{c}} Bβc。然而,当只有少数例子可用于学习时,模型几乎没有机会为每个新类学习自定义位置回归量。相反,我们建议使用一个与所有类无关的包围盒回归器 B β B_\beta Bβ对基础目标检测器进行预训练。在使模型适应新类 C n C_n Cn时,我们只需重用 B β B_\beta Bβ进行定位操作。如果在大规模数据集上进行预训练,这种方法已经被证明可以很好地用于零样本目标检测[21],并且可以利用开放世界检测文献[29]中的进展。通过减少以小样本或连续方式学习定位的需要,我们可以将问题视为一个小样本分类任务,只关注生成额外的分类器参数 γ c n \gamma^n_c γcn。我们将在第5节中验证此设置的有效性。

We can satisfy both the aforementioned objectives with FCOS [62], a simple one-stage and anchor-free object detector. With these design choices, we decouple the few-shot novel class detection problem into serial tasks of localization and few-shot classification, dramatically simplifying it.

我们可以利用 FCOS [62](一种简单的单级无锚对象检测器)满足上述两个目标。有了这些设计选择,我们就能将小样本新类别检测问题解耦为定位和小样本分类这两个串行任务,从而大大简化了问题。

3.1.2 Few-shot Hypernetwork

With localization handled by the class-agnostic object detector, the problem reduces to few-shot classification. Sylph uses a hypernetwork H ψ H_\psi Hψ to generate parameters γ c ∗ = w c , b c \gamma^*_c = {w_c, b_c} γc=wc,bc for each binary classifier C γ c ∗ C_{\gamma^*_{c}} Cγc. H ψ H_\psi Hψ takes as input an N -way K-shot episode of support set samples, consisting of K instances of N classes randomly sampled from the meta-training set. We denote this support set S N × K = ( I N × K , b N × K ) S^{N\times K}=(I^{N\times K},b^{N\times K}) SN×K=(IN×K,bN×K), with I N × K ∈ R ( N ∗ K ) × C × H × W I^{N\times K}\in\mathbb{R}^{(N*K)\times C\times H\times W} IN×KR(NK)×C×H×W. The hypernetwork is modularized into three components: support set feature extraction, code prediction, and code aggregation and normalization, which we detail below.

通过类不可知对象检测器处理定位,将问题简化为小样本分类。Sylph使用超网络 H ψ H_\psi Hψ为每个二元分类器 C γ c ∗ C_{\gamma^*_{c}} Cγc生成参数 γ c ∗ = w c , b c \gamma^*_c = {w_c, b_c} γc=wc,bc H ψ H_\psi Hψ取支持集样本的N -way K-shot集作为输入,支持集样本由从元训练集中随机抽取的N个类的K个实例组成。我们用 S N × K = ( I N × K , b N × K ) S^{N\times K}=(I^{N\times K},b^{N\times K}) SN×K=(IN×K,bN×K)表示这个支持集,其中 I N × K ∈ R ( N ∗ K ) × C × H × W I^{N\times K}\in\mathbb{R}^{(N*K)\times C\times H\times W} IN×KR(NK)×C×H×W。该超网络被模块化为三个组件:支持集特征提取、代码预测、代码聚合和规范化,我们将在下面详细介绍。

Support Set Feature Extraction The first stage consists of extracting features from the episode’s support set. We share the same convolutional backbone F θ F_\theta Fθ from the base detector to obtain features for each of the support set images, as it can be pretrained with the base detector in normal batch training. ROIAlignV2 [23] is then used to pull the features corresponding to the location of each instance of each class. We choose to crop at the feature level rather than at the image level, as features have a larger receptive field, potentially allowing for increased global context. ROIAlignV2 produces a fixed size feature z c , i ∈ R d f × d h × d w \begin{aligned}z_{c,i}\in\mathbb{R}^{d_f\times d_h\times d_w}\end{aligned} zc,iRdf×dh×dw for each object instance, with d f d_f df being the channel dimension of the final layer of the backbone, and typically d h = d w = 7 \begin{aligned}d_h=d_w=7\end{aligned} dh=dw=7.

支持集特征提取第一阶段包括从episode的支持集中提取特征。我们从基础检测器共享相同的卷积主干 F θ F_\theta Fθ来获取每个支持集图像的特征,因为它可以在正常的批处理训练中使用基础检测器进行预训练。然后使用ROIAlignV2[23]提取每个类的每个实例所在位置对应的特征。我们选择在特征级别而不是在图像级别进行裁剪,因为特征具有更大的接受域,潜在地允许增加全局上下文。ROIAlignV2为每个对象实例生成一个固定大小的特征 z c , i ∈ R d f × d h × d w \begin{aligned}z_{c,i}\in\mathbb{R}^{d_f\times d_h\times d_w}\end{aligned} zc,iRdf×dh×dw,其中 d f d_f df是主干最后一层的通道维度,通常是 d h = d w = 7 \begin{aligned}d_h=d_w=7\end{aligned} dh=dw=7

Code Predictor Head (CPH) Given each support sample’s extracted features z c , i z_{c,i} zc,i, hypernetwork H ψ H_\psi Hψ predicts weights w c , i ∈ R 1 × C × k × k w_{c,i}\in\mathbb{R}^{1\times C\times k\times k} wc,iR1×C×k×k and bias b c , i ∈ R b_{c,i}\in\mathbb{R} bc,iR, where C C C is the preceding channel dimension and k k k is the convolutional kernel size. The code predictor head (Figure 2) consists of a shared subnetwork consisting of 3 × 3 convolutional layers interleaved with group normalization [67] and ReLU activation functions, followed by a layer for predicting a weight and bias. Global average pooling after the weight and bias predictor layers is used to reduce the predicted weights to the final dimensions. While the hypernetwork is capable of predicting weights w c , i w_{c,i} wc,i of arbitrary size, we choose a kernel size of k = 1 k = 1 k=1 so that the generated weights can be used as either convolutional or linear layer weights, allowing compatibility with both region-based and dense detection.

代码预测头 (CPH) 给定每个支持样本提取的特征 z c , i z_{c,i} zc,i,超网络 H ψ H_\psi Hψ预测权重 w c , i ∈ R 1 × c × k × k w_{c,i}\in\mathbb{R}^{1\times c \times k \times k} wc,iR1×c×k×k和偏置 b c , i ∈ R b_{c,i}\in\mathbb{R} bc,iR,其中 c c c是前一个通道维度, k k k是卷积核大小。代码预测头(图2)由一个共享子网络组成,该网络由3 × 3卷积层组成,这些层与组归一化[67]和ReLU激活函数交织在一起,然后是一个用于预测权重和偏置的层。在权重和偏置预测层之后,使用全局平均池化将预测权重降至最终维度。虽然超网络能够预测任意大小的权重 w c , i w_{c,i} wc,i,但我们选择了卷积核大小为 k = 1 k = 1 k=1,以便生成的权重可以用作卷积或线性层权重,从而允许兼容基于区域和密集检测。

在这里插入图片描述

Figure 2. Hypernetwork architecture of the code predictor head.

图2。超网络体系结构的代码预测头。

Code Process Module (CPM) In the CPM, we aggregate the predicted parameters for all samples of a class from CPH into a single set of weights w c w_c wc and bias b c b_c bc. We found a simple average of the weights and the bias across shots to be effective: w c = 1 K ∑ i = 0 k − 1 ( w c , i ) \begin{aligned}w_c=\frac{1}{K}\sum_{i=0}^{k-1}(w_{c,i})\end{aligned} wc=K1i=0k1(wc,i) and b c = 1 K ∑ i = 0 k − 1 ( b c , i ) b_c=\frac1K\sum_{i=0}^{k-1}(b_{c,i}) bc=K1i=0k1(bc,i). However, directly using the class code w c w_c wc in this form can cause gradient exploding, especially when stacking multiple convolutional layers between the input features and the final predictor head [16]. As shown in Fig. 4, gradient clipping [43] can help, but occasionally the model still does not converge well, leading to high variance in model accuracy.

代码处理模块 (CPM) 在CPM中,我们将CPH中一个类的所有样本的预测参数聚合成一组权重 w c w_c wc和偏置 b c b_c bc。我们发现权重和偏置的简单平均值是有效的: w c = 1 K ∑ i = 0 k − 1 ( w c , i ) w_c=\frac{1}{K}\sum_{i=0}^{k-1}(w_{c,i}) wc=K1i=0k1(wc,i) b c = 1 K ∑ i = 0 k − 1 ( b c , i ) b_c=\frac1K\sum_{i=0}^{k-1}(b_{c,i}) bc=K1i=0k1(bc,i)。然而,以这种形式直接使用类代码 w c w_c wc会导致梯度爆炸,特别是当在输入特征和最终预测头之间堆叠多个卷积层时[16]。如图4所示,梯度裁剪[43]可以起到一定的作用,但有时模型仍然不能很好地收敛,导致模型精度方差较大。

Our weights w c w_c wc, as generated from the input support set features at this stage, more closely resemble feature maps than classifier weights. To this end, we want to avoid directly passing w c w_c wcto the conditional classifier. Inspired by the success of L 2 L^2 L2-normalized feature embeddings in parameterizing classifiers for zero-shot object detection [2, 21], we explore incorporating L 2 L^2 L2-normalization of the weights w c ∣ ∣ w c ∣ ∣ \frac{w_c}{||w_c||} ∣∣wc∣∣wc. We normalize along the channel axis (in contrast to batch normalization [26]) to ensure weights for different classes do not interact. With normalization, learning is simplified and training is stabilized by mapping the support set features onto a unit sphere.

我们的权重 w c w_c wc是由这个阶段的输入支持集特征生成的,它比分类器权重更接近于特征映射。为此,我们希望避免直接将 w c w_c wc传递给条件分类器。受到 L 2 L^2 L2归一化特征嵌入在零样本目标检测参数化分类器中的成功[2,21]的启发,我们探索了将 L 2 L^2 L2归一化纳入权重 w c ∣ ∣ w c ∣ ∣ \frac{w_c}{||w_c||} ∣∣wc∣∣wc。我们沿着通道轴进行归一化(与批归一化[26]相反),以确保不同类的权重不会相互作用。在归一化中,通过将支持集特征映射到单位球面上,简化了学习,稳定了训练。

To ensure compatibility of the normalized weights w c ∣ ∣ w c ∣ ∣ \frac{w_c}{||w_c||} ∣∣wc∣∣wc with a non-cosine classifier, we follow [56] and add a learnable scalar parameter g g g, rescaling the normalized weights as w c ∗ = g ∣ ∣ w c ∣ ∣ w c w_c^*=\frac{g}{||w_c||}w_c wc=∣∣wc∣∣gwc. This allows us to avoid needing to adapt the classifier in the base detector. By replacing per-class norm with a universal g g g, we end up with less variance across all class weights. We found that predicting the bias counteracts this negative effect. For the bias, we further add a prior bias b p = − log ⁡ ( ( 1 − π ) / π ) , π = 0.01 b_p=-\log((1-\pi)/\pi),\pi=0.01 bp=log((1π)/π),π=0.01following [35] and with a scalar g b g_b gb, resulting in a final bias of b c ∗ = g b ∗ b c + b p b_c^*=g_b*b_c+b_p bc=gbbc+bp.

为了确保归一化权值 w c ∣ ∣ w c ∣ ∣ \frac{w_c}{||w_c||} ∣∣wc∣∣wc与非余弦分类器的兼容性,我们遵循[56]并添加一个可学习的标量参数 g g g,将归一化权值重新缩放为 w c ∗ = g ∣ ∣ w c ∣ ∣ w c w_c^*=\frac{g}{||w_c||}w_c wc=∣∣wc∣∣gwc。这允许我们避免在基础检测器中调整分类器。通过用一个通用的 g g g代替每个类的范数,我们最终在所有类权重上得到更少的方差。我们发现,预测偏置可以抵消这种负面影响。对于偏置,我们进一步在[35]之后添加先验偏置 b p = − log ⁡ ( ( 1 − π ) / π ) , π = 0.01 b_p=-\log((1-\pi)/\pi),\pi=0.01 bp=log((1π)/π),π=0.01,并添加标量 g b g_b gb,从而得到最终偏置 b c ∗ = g b ∗ b c + b p b_c^*=g_b*b_c+b_p bc=gbbc+bp

3.2. Training and Evaluation Details

To train the base detector and the hypernetwork, Sylph framework requires two sequential training stages: pretraining the base detector and learning the hypernetwork.

为了训练基础检测器和超网络,Sylph框架需要两个连续的训练阶段:预训练基础检测器和学习超网络。

Base Object Detector Pretraining We first pretrain the base detector D ϕ D_\phi Dϕ with batch stochastic gradient descent on base classes C b C^b Cb, optimizing for classification and bounding box regression losses. We choose FCOS as our base detector; we refer the reader to [62] for further training details. The pretraining process produces trained parameters θ \theta θ and ϕ \phi ϕ as well as class agnostic box regression parameters β \beta β and class codes for the base classes γ b = { w c b , b c b } ∀ c b ∈ C b \begin{aligned}\gamma_b=\{w_{c_b},b_{c_b}\} \forall c_b\in C^b\end{aligned} γb={wcb,bcb}cbCb. Thus, at the conclusion of this stage, we have a detector D ϕ D_\phi Dϕ capable of producing bounding boxes in an image for the base classes and potentially novel classes as well.

基础对象检测器预训练 我们首先在基类 C b C^b Cb上使用批量随机梯度下降对基检测器 D ϕ D_\phi Dϕ进行预训练,优化分类和包围盒回归损失。我们选择FCOS作为基础检测器;我们请读者参考[62]了解进一步的培训细节。预训练过程产生训练参数 θ \theta θ ϕ \phi ϕ,以及与类无关的框回归参数 β \beta β和基类 γ b = { w c b , b c b } ∀ c b ∈ C b \begin{aligned}\gamma_b=\{w_{c_b},b_{c_b}\} \forall c_b\in C^b\end{aligned} γb={wcb,bcb}cbCb的类代码。因此,在这一阶段结束时,我们有了一个检测器 D ϕ D_\phi Dϕ,它能够在图像中为基类和潜在的新类生成边界框。

Meta-training During meta-training, we create few-shot episodes of N N N categories by sampling a set of N × ( K + 1 ) N\times(K+1) N×(K+1) image and bounding box tuples ( I , b ) (I, b) (I,b) from C b C^b Cb, a support set of N × K N\times K N×K samples, and a query set with N × 1 N\times1 N×1 samples. The query set is used as input to the base detector. Only the focal loss [35] from the classification branch is computed at this stage. The primary goal at this stage is to train the few-shot hypernetwork H ψ H_\psi Hψ so that it is able to map S N × K S^{N\times K} SN×K to a new set of synthesized class codes γ c b ∗ = ( w c b ∗ , b c b ∗ ) \gamma_{c_b}^*=(w_{c_b}^*,b_{c_b}^*) γcb=(wcb,bcb) for classification. We freeze the whole base object detector except the four convolutional layers in the FCOS classification subnetwork and replace its initial classifier with our conditional classifier capable of taking the synthesized class codes to make predictions on the query image features. We found that finetuning these extra convolutional layers in the base detector results in better overall performance than not finetuning them (Section 5). In preliminary experiments we found that the more components/layers we initialize from pretraining the better our final AP for base classes.

元训练 在元训练期间,我们通过采样一组来自 C b C^b Cb N × ( K + 1 ) N\times(K+1) N×(K+1)图像和边界框元组 ( I , b ) (I, b) (I,b)、一个包含 N × K N\times K N×K样本的支持集和一个包含 N × 1 N\times1 N×1样本的查询集来创建拥有 N N N个类别的小样本episodes。查询集用作基础检测器的输入。在此阶段只计算来自分类分支的focal loss[35]。这个阶段的主要目标是训练小样本超网络 H ψ H_\psi Hψ,以便它能够将 S N × K S^{N\times K} SN×K映射到一组新的合成类代码 γ c b ∗ = ( w c b ∗ , b c b ∗ ) \gamma_{c_b}^*=(w_{c_b}^*,b_{c_b}^*) γcb=(wcb,bcb)进行分类。我们冻结了FCOS分类子网络中除了四个卷积层之外的整个基础目标检测器,并用我们的条件分类器替换其初始分类器,该分类器能够利用合成的类码对查询图像特征进行预测。我们发现,在基础检测器中微调这些额外的卷积层比不微调它们会产生更好的整体性能(第5节)。在初步实验中,我们发现,我们从预训练中初始化的组件/层越多,我们最终的基础类AP就越好。

Meta-testing To evaluate the model’s performance across all classes, we take K K K shots per-class samples from the whole set and make feed-forward passes through the hypernetwork one class at a time to synthesize class codes γ c ∗ = { w c ∗ , b c ∗ } ∀ c ∈ C b ∪ C n \begin{aligned}\gamma_{c}^{*}=\{w_{c}^{*},b_{c}^{*}\}\forall c\in C^{b}\cup C^{n}\end{aligned} γc={wc,bc}cCbCn. With the generated codes, the base detector is able to perform inference at the same inference speed and behavior as a normal detector. This setup of our model is denoted as Sylph.

元测试 为了评估模型在所有类上的性能,我们在整个集合中每个类都截取 K K K个样本,并在超网络中进行前向传播,每次用一个类来合成类代码 γ c ∗ = { w c ∗ , b c ∗ } ∀ c ∈ C b ∪ C n \begin{aligned}\gamma_{c}^{*}=\{w_{c}^{*},b_{c}^{*}\}\forall c\in C^{b}\cup C^{n}\end{aligned} γc={wc,bc}cCbCn。使用生成的代码,基础检测器能够以与普通检测器相同的推理速度和行为执行推理。我们的模型的这个设置被表示为Sylph。

4. Experiments

Datasets and Metrics We benchmark and ablate Sylph on two datasets: COCO [36] and LVIS [22]. For COCO, we follow the split commonly used for few-shot object detection [28, 44, 65]: the 60 categories that are disjoint from PASCAL VOC [9] are used as base classes, while the remaining 20 classes are designated novel. We report experimental results for K = {1, 5, 10, 20, 30} shots on the COCO minival set. For LVIS-v1, we follow the organically long-tail distribution of the dataset as proposed in [65] to produce a base-novel split. LVIS contains 405 frequent classes appearing in more than 100 images, 461 common classes with 10-100 images, and 337 rare classes with fewer than 10 images, for a total of 1203 object categories. In our experiments, we use the 337 rare classes as novel classes and the 866 frequent and common classes as base classes.

我们在两个数据集上对Sylph进行基准测试和消融:COCO[36]和LVIS[22]。对于COCO,我们遵循通常用于小样本目标检测的数据集划分方法[28,44,65]:将与PASCAL VOC[9]不相关的60个类别作为基类,而将其余20个类别指定为新类。我们报告了K ={1,5,10,20,30}在COCO minival set上的实验结果。对于LVIS-v1,我们遵循[65]中提出的数据集的有机长尾分布来产生新类和基类的数据集划分。LVIS包含405个出现在100张以上图像中的频繁类,461个出现在10-100张图像中的常见类,337个出现在少于10张图像中的罕见类,总共1203个对象类别。在我们的实验中,我们使用337个罕见类作为新类,866个常见类作为基类。

For evaluation metrics, we report mean average precision (mAP) computed on a per-split basis; we run inference for both the base and novel classes in a single pass, but we report mAP separately as different models tend to have different performances across splits. For COCO, we denote the mAPs for base and novel categories as A P b AP_b APb and A P n AP_n APn, respectively. Similarly, for LVIS, A P r AP_r APr, A P c AP_c APc, and A P f AP_f APf is the average precision aggregated across rare, common, and frequent classes, respectively. For all experiments, we report the mean and standard deviation of the mAP across five meta-testing runs. We run experiments with several pretraining strategies: (1) Default: the model is pretrained on ImageNet-1k [55]; (2) Aug: large-scale jittering (LSJ) [15] and RandAugment [5] are also applied; and (3) All: in addition to the aforementioned augmentations, IG-50M pretrained backbone weights from PreDet [49] are used.

对于评估指标,我们报告了在每个数据集划分的基础上计算的平均精度(mAP);我们在一次传递中对基类和新类运行推理,但是我们分别报告mAP,因为不同的模型往往具有不同的性能。对于COCO,我们将基础类别和新类别的map分别表示为 A P b AP_b APb A P n AP_n APn。类似地,对于LVIS, A P r AP_r APr A P c AP_c APc A P f AP_f APf分别是聚合在罕见类、常见类和频繁类上的平均精度。对于所有实验,我们报告了5次元测试中mAP的均值和标准差。我们使用几种预训练策略进行实验:(1)Default:模型在ImageNet-1k上进行预训练[55];(2) Aug:大规模抖动(LSJ)[15]和RandAugment[5]也被应用;(3)All:除上述增强外,还使用了来自PreDet[49]的IG-50M预训练骨干权重。

Implementation Details For all our experiments, we use a ResNet-50 backbone [24] with a feature pyramid network (FPN) [34]. We use SGD with momentum (0.9) and weight decay (1e−4) for all training stages. During pretraining, we set the learning rate to 1e−2 with a batch size of 16; we increase the batch size to 128 when data augmentation is on. During meta-training, we set the learning rate (lr) to 5e−4. We uniformly sample 3-way 5-shot tasks from the base classes, with a single query image per class.

实现细节 对于我们所有的实验,我们使用ResNet-50骨干网[24]和特征金字塔网络(FPN)[34]。我们在所有训练阶段使用动量(0.9)和权值衰减(1e−4)的SGD。在预训练时,我们将学习率设置为1e−2,批大小为16;当启用数据扩展时,我们将批大小增加到128。在元训练中,我们将学习率(lr)设置为5e−4。我们统一地从基类中采样3-way 5-shot任务,每个类使用单个查询图像。

We pretrain for 90k steps (∼11hrs), with an extra 30k steps for meta-learning (∼13hrs). The lr is decreased tenfold at steps 60k and 80k in the pretraining, and at 20k and 26k during the meta-training. Finally, we limit the number of detections per image to 100 for COCO and 300 for LVIS. We build our framework on top of Detectron2 [68]; we plan to publicly release our code upon publication.

我们预训练了90k步(约11小时),另外还有30k步用于元学习(约13小时)。在预训练的60k和80k步,以及元训练的20k和26k步,lr降低了10倍。最后,我们将每张图像的检测次数限制为COCO100次和LVIS300次。我们在Detectron2之上构建框架[68];我们计划在发布时公开发布我们的代码。

4.1. Incremental Few-shot Object Detection

As the only other method designed for iFSD, we primarily compare against ONCE on both COCO and LVIS. Focusing on the finetuning-free iFSD evaluation protocol [44], we demonstrate the effectiveness of Sylph with several pretraining strategies. In addition, we report results for a few training-intensive FSD methods as an upper bound of our finetune-free approach, including joint-training, which is normally used for long-tailed datasets [22], and a finetuning-based method known as TFA [65].

作为为iFSD设计的唯一其他方法,我们主要在COCO和LVIS上与ONCE进行比较。关注无需微调的iFSD评估协议[44],我们通过几种预训练策略证明了Sylph的有效性。此外,我们报告了一些训练密集型FSD方法的结果,作为我们无微调方法的上界,包括通常用于长尾数据集的关节训练[22],以及基于微调的TFA方法[65]。

Finetuning-free iFSD benchmarking We primarily compare with ONCE [44], as the most relevant method in this setting. On COCO, we compare with the reported number from [44] in Table 2. For LVIS, we re-implement ONCE with a baseline version of our code generator which has no bias prediction, no weight norm, no scaling factor g g g in the CPM, and no convolutional layers in the shared portion of the CPH, which effectively leaves the basic components of the hypernetwork as close to the originally-proposed ONCE as possible. We denote this version ONCE∗ in Table 1.

无需微调的iFSD基准测试 我们主要与ONCE[44]进行比较,因为它是在这种情况下最相关的方法。关于COCO,我们与表2中[44]报告的数字进行比较。对于LVIS,我们使用代码生成器的基线版本重新实现ONCE,该代码生成器没有偏置预测,没有权重范数,在CPM中没有缩放因子 g g g,并且在CPH的共享部分中没有卷积层,这有效地使超网络的基础组件尽可能接近最初提出的ONCE。我们在表1中表示此版本为ONCE *。

We demonstrate that the key design choices of Sylph allow it to significantly outperform ONCE on both datasets, across all data splits. On the large-scale dataset LVIS, Sylph surpasses ONCE by 8% averaged across different pretraining strategies in a fair head-to-head (no additional data augmentation or pretraining data). For the heavy data augmentation setup, Aug, ONCE∗ struggles to converge during training, resulting in much worse performance than Sylph. In particular, we show that our method is truly able to learn novel categories from few shots without forgetting base classes. For example, with early stopping during pretraining (Sylph-es in Table 2) and K = 10 shots on COCO, we achieve an APn twice as good as ONCE, while still surpassing it by 4 points for the base classes.

我们证明,Sylph 的关键设计选择使其在这两个数据集上的表现明显优于 ONCE,而且跨越了所有数据拆分。在大规模数据集 LVIS 上,在公平的正面交锋中(没有额外的数据增强或预训练数据),Sylph 在不同的预训练策略上平均超过 ONCE 8%。在大量数据增强的情况下,ONCE∗ 在训练过程中难以收敛,导致其性能比 Sylph 差很多。尤其是,我们证明了我们的方法确实能够在不遗忘基础类别的情况下从少数几个镜头中学习新类别。例如,如果在预训练期间提前停止(表 2 中的 Sylph-es),并在 COCO 上使用 K = 10,我们的 APn 是 ONCE 的两倍,而在基础类别方面仍比 ONCE 高出 4 个百分点。

Joint-training and finetune-based iFSD as upper bounds For the Joint-train method, we follow [22] to ensure its effectiveness on the novel split in the low-data regime. In particular, we perform repeat factor sampling with the factor set to 0.001 in order to balance the sampling frequency across different classes during training. We select TFA [65] to represent finetuning-based iFSD methods. For this, we adapt the TFA [65] methodology to our FCOS-based framework, following the finetuning protocol as closely as possible. Specifically, this involves two training steps: (1) pretraining of the base detector on base classes; (2) sampling K = 10 shots across both base and novel classes while freezing all layers other than the box regressor and the classifier. We finetune the regressor and train a new classifier for all classes, with base classifier parameters initialized from pretraining. We denote the adapted TFA method as TFA∗. Additionally, we make several modification to the standard TFA to bring it closer to our setup, adjusting to an incremental batch setup [44] where the novel classes are added in a single round. In particular, in the finetuning stage, (1) only C n C^n Cn is used, (2) the box regressor is kept frozen, and (3) the classifier is not initialized with any pretrained base class parameters, as we do not finetune on the base classes. In this setup, a deployed model can be directly extended without any backbone modification to novel categories. We label this version of TFA as TFA-ours, as it is made possible by our framework. As our model relies on a large-scale dataset in the meta-training stage, we create a variant, Sylph-LVIS, which uses the LVIS dataset excluding the COCO novel classes while keeping all parts of the base detector frozen so that it is able to preserve its pretrained COCO base class codes.

联合训练和基于微调的iFSD作为上界 对于联合训练方法,我们遵循[22],以确保其在低数据状态下对新类数据集分割的有效性。特别是,我们将因子设置为0.001执行重复因子采样,以便在训练期间平衡不同类别的采样频率。我们选择TFA[65]来代表基于微调的iFSD方法。为此,我们将TFA[65]方法适应于我们基于fcos的框架,并尽可能紧密地遵循微调协议。具体来说,这包括两个训练步骤:(1)在基类上对基检测器进行预训练;(2)在基础类和新类中抽样K = 10次,同时冻结除包围盒回归器和分类器以外的所有层。我们对回归量进行微调,并使用预训练初始化的基础分类器参数为所有类别训练一个新的分类器。我们将调整后的TFA方法表示为TFA *。此外,我们对标准TFA进行了一些修改,使其更接近我们的设置,调整为增量批处理设置[44],其中在单个回合中添加新类。特别是,在微调阶段,(1)只使用 C n C^n Cn,(2)包围盒回归器保持冻结,(3)分类器没有使用任何预训练的基类参数初始化,因为我们没有对基类进行微调。在这种设置中,可以直接扩展已部署的模型,而无需对新类别进行任何主干修改。我们将这个版本的TFA标记为TFA-ours,因为它是在我们的框架下实现的。由于我们的模型在元训练阶段依赖于大规模数据集,我们创建了一个变体,Sylph-LVIS,使用 LVIS 数据集,但不包括COCO中的新类别,同时冻结基础检测器的所有部分,以便能够保留预训练的 COCO 基础类别代码。

We report these benchmarking results on both datasets in Tables 1 and 2. An interesting observation here is that, for all the FSD methods we benchmarked, their precision on the novel split increases significantly with an increased number of base classes. As we go from the 60 base classes in COCO to 866 base classes in LVIS, Sylph achieves around a 9% gain in APr and APn in Tables 1 and 2, even surpassing the gains achieved by all the variants where training and finetuning is allowed. In terms of the overall precision across all classes, Sylph’s performance is not far from finetuningintensive methods: just 3 and 4 points lower compared to the best performing upper bound method on LVIS. On COCO, for example, with class augmentation in the meta-learning stage, our Sylph-LVIS achieves 3.8 AP, only 0.2 AP short of the joint-training approach.

我们在表 1 和表 2 中报告了这两个数据集的基准测试结果。这里一个有趣的现象是,对于我们基准测试的所有 FSD 方法来说,随着基类数量的增加,它们在新类的划分数据集上的精度也会显著提高。从 COCO 的 60 个基类到 LVIS 的 866 个基类,Sylph 在表 1 和表 2 中实现了约 9% 的 APr 和 APn 增益,甚至超过了允许训练和微调的所有变体所实现的增益。在所有类别的整体精确度方面,Sylph 的表现与微调密集型方法相差无几:与 LVIS 上表现最好的上限方法相比,仅低了 3 和 4 个百分点。例如,在 COCO 上,我们的 Sylph-LVIS 在元学习阶段进行了数据增强,达到了 3.8 AP,仅比联合训练法低 0.2 AP。

在这里插入图片描述

Table 1. Benchmarking on the eval split of LVIS-v1. We use K = 10 shots to infer base class codes and all available data for the rare classes (≤ 10). Both ONCE∗ and Sylph predict all classes in a single pass. The base and novel data checkmarks indicate whether the data is used to update model weights during an incremental learning step.

表1。LVIS-v1的eval分割的基准测试。我们使用K = 10次来推断基类代码和稀有类(≤10)的所有可用数据。ONCE *和Sylph都在一次遍历中预测所有类别。基础和新数据的复选标记表明数据是否用于在增量学习步骤中更新模型权重。
在这里插入图片描述

Table 2. Benchmarking on COCO Dataset, evaluated on minival set. We benchmark Sylph against ONCE for K = 1, 5, 10 shots, with additional K = 20, 30 shots for Sylph. We also include 10shot TFA, which finetunes on novel data. To mimic the training protocol of ONCE, we apply early-stop (at 30k steps) to Sylph pretraining (denoted Sylph-es).

表2。对COCO数据集进行基准测试,在minival set上进行评估。当K = 1,5,10时,我们将Sylph与ONCE进行基准测试,并为Sylph增加K = 20,30。我们还包括10shot TFA,它对新数据进行微调。为了模拟ONCE的训练方案,我们将早期停止(在30k步)应用于Sylph预训练(表示为Sylph-es)。

More FSD model behavior analysis On LVIS, we benchmarked all FSD methods across three pretraining strategies, showing that all methods benefit from the use of additional augmentations and large-scale weak supervision pretraining across both novel and base classes, including Sylph. This behavior is highly desirable for Sylph, as it benefits from any improvement to the base detector.

更多 FSD 模型行为分析 在 LVIS 上,我们对三种预训练策略下的所有 FSD 方法进行了基准测试,结果表明,所有方法都能从使用额外增强和大规模弱监督预训练中获益,包括 Sylph 在内。 这种行为对 Sylph 来说是非常理想的,因为它可以从基础检测器的任何改进中获益。

However, for a smaller scale dataset (e.g. COCO), the boost in novel class performance for Sylph is less than the gain on the base classes, as shown in Table 2 for rows corresponding to Sylph (All). This is related to episodic learning, where more tasks lead to improved learning compared to more per-task data. We further validate this with Sylph-LVIS, which has a similar amount of training data, but with more tasks; impressively, we find that Sylph-LVIS achieves comparable accuracy to the joint-training approach. Still, we see a large precision gap between Sylph-LVIS on the COCO novel split and Sylph on LVIS rare classes, indicating that large-scale pretraining is essential, as it results in (1) a more accurate bounding box locator, and (2) a feature extractor that can generalize better to novel classes.

但是,对于较小规模的数据集(如 COCO),Sylph 对新类别性能的提升要小于对基础类别的提升,如表 2 中与 Sylph(All)对应的行所示。这与episode学习有关,在episode学习中,相比于更多的任务数据,更多的任务会带来更好的学习效果。我们用 Sylph-LVIS 进一步验证了这一点,Sylph-LVIS 的训练数据量类似,但任务量更大;令人印象深刻的是,我们发现 Sylph-LVIS 的准确率与联合训练方法相当。尽管如此,我们还是发现,Sylph-LVIS 在 COCO 的新类分割上与 Sylph 在 LVIS 的罕见类上存在很大的精度差距,这表明大规模预训练是必不可少的,因为它能产生:(1)更精确的边界框定位器;(2)能更好地泛化到新分类的特征提取器。

Also, surprisingly, the adapted simple approach TF-Aours is able to achieve better precision on LVIS than its standard counterpart TFA with our selected base detector, with the advantage of not revisiting the base data at all.

此外,令人惊讶的是,经过调整的简单方法 TFA-ours 与其标准方法 TFA 相比,在我们选定的基础检测器上,能够在 LVIS 上实现更高的精度,而且具有完全无需重新访问基础数据的优势。

5. Ablations and Further Discussion

We run all experiments in this section with the Default pretraining strategy unless otherwise stated.

除非另有说明,否则我们使用默认预训练策略运行本节中的所有实验。

How does the number of base classes impact the novel class precision? For this experiment, we randomly choose 50 classes from the frequent classes of LVIS as a novel set. We report base and novel mAP in Figure 3 on this fixed novel while gradually increasing the number of base classes, starting from the frequent, moving to common, and then the rare classes. For all the plotted points, we complete the training of Sylph on the base split, and use K = 10 for class codes inference. We confirm the effect of a larger base set in novel class detection precision in Fig. 3. Indeed, novel class mAP rapidly increases in the frequent classes region, slows down when adding common classes, and starts to fully stabilize at around 800 base classes in the rare classes region. Not surprisingly, novel class score increases more rapidly when the backbone is pretrained with classes that have more samples. These results indicate for the first time that challenging incremental few-shot detection is feasible when there is a large enough base dataset.

基类的数量如何影响新类的精度? 在本实验中,我们从LVIS的频繁类中随机选择50个类作为一个新类集。我们在图3中报告了基类和新类的mAP,同时逐渐增加基类的数量,从频繁类开始,到常见类,然后是罕见类。对于所有绘制的点,我们在基础分割上完成Sylph的训练,并使用K = 10进行类代码推理。我们在图3中证实了更大的基集对新型类检测精度的影响。实际上,新类mAP在频繁类区域迅速增加,在添加普通类时减慢,在罕见类区域开始完全稳定在大约800个基类。毫不奇怪,当主干用样本更多的类别进行预训练时,新类别得分增加得更快。这些结果首次表明,当有足够大的基础数据集时,具有挑战性的增量小样本检测是可行的。
在这里插入图片描述

Figure 3. The effect of the number of base classes in sylph metatraining. The blue, orange, and green backgrounds denote frequent, common, and rare classes, respectively.

图3。基类数量对sylph元训练的影响。蓝色、橙色和绿色背景分别表示频繁、常见和罕见的类别。

Model ablations We study the effect of several key elements of our model, including the normalization scheme (GroupNorm (GN) vs L 2 L^2 L2-Norm), the weight scaling factor g, predicted bias, and the number of convolutional layers stacked in CPH for the classifier. These results are shown in Table 3, from which we can see that overall, using either L 2 L^2 L2-Norm or GN is very beneficial for Sylph, improving the overall AP across all subsets of classes by around 6 absolute points w.r.t. the baseline. However, when both GN and L 2 L^2 L2-Norm are applied, there is no obvious extra gain. Comparing the last three rows in Table 3, we see that the use of g and bias results in a small accuracy improvement. We plotted the loss of different configurations in Fig. 4. From the leftmost figure, we can see that L 2 L^2 L2-Norm has the largest impact in curbing the loss than any other configuration. Additionally, from the rightmost figure, we can see that both L 2 L^2 L2-Norm and GN converge better than models without normalization. Overall, we can conclude that the elements that form Sylph are effective at both learning base classes and generalizing to novel ones.

更多消融 我们研究了模型的几个关键元素的影响,包括归一化方案(GroupNorm (GN) vs L 2 L^2 L2-Norm)、权重缩放因子 g g g、预测偏置以及在CPH中堆叠的卷积层数。这些结果如表3所示,从中我们可以看到,总体而言,使用 L 2 L^2 L2-Norm或GN对Sylph非常有益,在基线的基础上将所有类别子集的总体AP提高了大约6个绝对点。然而,当同时应用GN和 L 2 L^2 L2-Norm时,没有明显的额外增益。比较表3中的最后三行,我们可以看到g和bias的使用导致了精度的小幅提高。我们在图4中绘制了不同构型的损失。从最左边的图中,我们可以看到 L 2 L^2 L2-Norm在抑制损失方面比任何其他配置都有最大的影响。此外,从最右边的图中,我们可以看到 L 2 L^2 L2-Norm和GN都比没有归一化的模型收敛得更好。总的来说,我们可以得出结论,构成Sylph的元素在学习基础类和推广新类方面都是有效的。

在这里插入图片描述

Table 3. Ablation study: Modeling choices of Sylph on the LVIS.

表3。消融研究:Sylph在LVIS上的建模选择。

在这里插入图片描述

Figure 4. The loss comparison for different model setups. On the right side, we plot the losses starting from training step 100.

图4。不同模型设置的损耗比较。在右边,我们绘制了从训练步骤100开始的损失。

Training recipe ablations We also explored different training recipes, (1) FA: strictly freezing the whole base detector during meta-training, preserving the pretrained base class codes. (2) Joint: pretraining and meta-training on all available classes with the default setup. We report the result in Table 4. As we can see, Sylph’s training recipe beats that of FA by 2 points on the All setup. This means that allowing the classification convolutional subnetwork in the base detector to adapt during meta-training is important in our proposed framework. However, Joint performs comparably to F A, falling behind Sylph even though it has seen the novel classes during training. We think this might be explained by two reasons: (1) As the number of base classes increases, Joint struggles to recover the base class AP in meta-training. (2) As we use uniform sampling on the class-level, when mixed with rare classes, the more frequent classes get sampled less, thus leading to AP drop on those splits.

训练方式消融 我们还探索了不同的训练方法,(1)FA:在元训练期间严格冻结整个基类检测器,保留预训练的基类代码。(2)Joint:使用默认设置对所有可用类进行预训练和元训练。我们在表4中报告了结果。正如我们所看到的,Sylph的训练方式在所有设置上击败了FA 2点。这意味着在我们提出的框架中,允许基础检测器中的分类卷积子网络在元训练期间适应是很重要的。然而,Joint的表现与FA相当,落后于Sylph,即使它在训练中看到了新类别。我们认为这可能由两个原因来解释:(1)随着基类数量的增加,Joint在元训练中努力恢复基类AP。(2)由于我们在类级别上使用均匀抽样,当与稀有类混合时,频繁类得到的抽样较少,从而导致在这些拆分上的AP下降。

在这里插入图片描述

Table 4. Effect of the training recipe on the Sylph Framework. We report average precision across five runs.

表4。训练配方对Sylph框架的影响。我们报告五次运行的平均精度。

Does freezing the base detector in the meta-test stage limit few-shot continual learning capabilities? We set up a simple two step continual learning task and solve it with finetuning [42, 65]. In particular, given a pretrained FCOS model on base classes, we freeze most parts of the detector and finetune the remaining parts on all available novel data. On COCO, we keep the same base and novel splits. On LVIS, we use 100 randomly selected frequent classes as the novel split and the remaining 1103 classes as base classes. We follow TFA∗-st where the box regressor and classifier are finetuned and TFA-ours where only the classifier is trained. The results, along with a normal training on the novel set from scratch, are shown in Table 5. We see that, surprisingly, there is no obvious performance drop for the finetune approach, and even with the strict setup TFAours, the APn only decreases by 3 points. As TFA-ours closely resembles the training scheme of Sylph, we can conclude that the formulation we propose here does not entail a large sacrifice to the novel class learning potential.

在元测试阶段冻结基础检测器是否会限制少量连续学习能力? 我们设置了一个简单的两步连续学习任务,并通过微调来解决它[42,65]。特别是,给定基类上预训练的FCOS模型,我们冻结检测器的大部分部分,并在所有可用的新数据上微调剩余部分。在COCO上,我们保持了相同的基类和新类的数据集分割。在LVIS上,我们使用100个随机选择的频繁类作为新类分割,剩下的1103个类作为基类。我们遵循TFA * -st,其中盒回归器和分类器是微调的,而TFA-our只训练分类器。表5显示了结果,以及对新集合从零开始的正常训练。我们看到,令人惊讶的是,微调方法没有明显的性能下降,即使严格设置TFA-ours, APn也只降低了3点。由于TFA-ours的训练方案与Sylph的训练方案非常相似,我们可以得出结论,我们在这里提出的公式并不需要对新类的学习潜力做出很大的牺牲。

在这里插入图片描述

Table 5. Novel set accuracy comparison across the finetuning approaches. Full training of the detector is denoted Scratch.

表5所示。各种微调方法的新类数据的准确率比较。检测器的全面训练用 Scratch 表示。

6. Conclusion

We introduce Sylph, an object detection framework capable of extending to new classes from only a few examples in a continual manner without any training. We empirically validate that our design choices lead to effective training and improved accuracy, showing for the first time that an iFSD without test-time training can achieve performance close to finetune-based methods on large scale datasets like LVIS. While we view Sylph as an improvement over existing methods, there are limitations. Though we have demonstrated that pretraining a class-agnostic detector can surface novel objects with high recall, it is not infallible and still dependent on large-scale datasets. Unlabeled objects due to annotator error or a class not being in the label set can result in false negatives in the dataset, which may lead to the model failing to surface such objects [27, 30, 71]. Additionally, more sophisticated aggregation methods to fuse support set features may also lead to further improvements.

我们介绍了 Sylph,这是一种能够在不进行任何训练的情况下从少量示例持续扩展到新类别的目标检测框架。我们通过实证验证了我们的设计选择能带来有效的训练和更高的准确性,首次证明了无需测试时间训练的 iFSD 在 LVIS 等大规模数据集上能达到接近基于微调的方法的性能。虽然我们认为 Sylph 是对现有方法的改进,但它也有局限性。虽然我们已经证明,对类别无关的检测器进行预训练能以高召回率发现新目标,但它并非万无一失,仍然依赖于大规模数据集。由于标注者的错误或标签集中不包含某个类别而导致的未标注对象会在数据集中产生假阴性,这可能会导致模型无法显示此类对象[27, 30, 71]。此外,融合支持集特征的更复杂的聚合方法也可能带来进一步的改进。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值