【翻译】DiffusionDet: Diffusion Model for Object Detection

最新推荐文章于 2024-06-25 10:15:47 发布

异想天开的长颈鹿

最新推荐文章于 2024-06-25 10:15:47 发布

阅读量528

点赞数

分类专栏：翻译文章标签：目标检测计算机视觉深度学习

原文链接：https://arxiv.org/pdf/2211.09788.pdf

版权

翻译专栏收录该内容

24 篇文章 2 订阅

订阅专栏

DiffusionDet: Diffusion Model for Object Detection

Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

论文：https://arxiv.org/pdf/2211.09788.pdf
项目：https://github.com/ShoufaChen/DiffusionDet

Abstract（摘要）

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. The extensive evaluations on the standard benchmarks, including MS-COCO and LVIS, show that DiffusionDet achieves favorable performance compared to previous well-established detectors. Our work brings two important findings in object detection. First, random boxes, although drastically different from pre-defined anchors or learned queries, are also effective object candidates. Second, object detection, one of the representative perception tasks, can be solved by a generative way. Our code is available at https://github.com/ShoufaChen/DiffusionDet.
　　我们提出了DiffusionDet，这是一个新的框架，它将目标检测表述为从噪声框到目标框的去噪扩散过程。在训练阶段，目标框从真实框扩散到随机分布，模型学会逆转这个噪声过程。在推理中，模型以渐进的方式将一组随机生成的框细化为输出结果。对包括MS-COCO和LVIS在内的标准基准的广泛评估表明，与之前成熟的检测器相比，DiffusionDet实现了良好的性能。我们的工作在目标检测方面带来了两个重要发现。首先，随机框虽然与预定义锚点或学习queries有很大不同，但也是有效的候选目标。其次，目标检测是代表性的感知任务之一，可以通过生成的方式解决。我们的代码可在https://github.com/ShoufaChen/DiffusionDet获取。

1. Introduction（介绍）

Object detection aims to predict a set of bounding boxes and associated category labels for targeted objects in one image. As a fundamental visual recognition task, it has become the cornerstone of many related recognition scenarios, such as instance segmentation [33, 48], pose estimation [9, 20], action recognition [29, 73], object tracking [41, 58], and visual relationship detection [40, 56].
　　目标检测旨在为一幅图像中的目标对象预测一组边界框和相关类别标签。作为一项基本的视觉识别任务，它已成为许多相关识别场景的基石，例如实例分割[33, 48]、姿态估计[9, 20]、动作识别[29, 73]、目标跟踪[41, 58]]，以及视觉关系检测[40, 56]。
　　Modern object detection approaches have been evolving with the development of object candidates, i.e., from empirical object priors [24, 53, 64, 66] to learnable object queries [10,81,102]. Specifically, the majority of detectors solve detection tasks by defining surrogate regression and classification on empirically designed object candidates, such as sliding windows [25, 71], region proposals [24, 66], anchor boxes [50, 64] and reference points [17, 97, 101]. Recently, DETR [10] proposes learnable object queries to eliminate the hand-designed components and set up an end-to-end detection pipeline, attracting great attention on query-based detection paradigm [21, 46, 81, 102].
　　现代目标检测方法随着候选目标的发展而不断发展，即从经验目标先验[24、53、64、66] 到可学习目标queries[10,81,102]。具体来说，大多数检测器通过对经验设计的候选目标定义代理回归和分类来解决检测任务，例如滑动窗口[25、71]、区域proposals[24、66]、锚框[50、64]和参考点[ 17, 97, 101]。最近，DETR[10]提出可学习的目标queries来消除手工设计的组件并建立端到端的检测管道，引起了人们对基于queries的检测范式的极大关注[21、46、81、102]。
　　While these works achieve a simple and effective design, they still have a dependency on a fixed set of learnable queries. A natural question is: is there a simpler approach that does not even need the surrogate of learnable queries?
　　虽然这些作品实现了简单有效的设计，但它们仍然依赖于一组固定的可学习queries。一个自然的问题是：是否有一种更简单的方法甚至不需要可学习queries的替代？
　　We answer this question by designing a novel framework that directly detects objects from a set of random boxes. Starting from purely random boxes, which do not contain learnable parameters that need to be optimized in training, we expect to gradually refine the positions and sizes of these boxes until they perfectly cover the targeted objects. This noise-to-box approach does not require heuristic object priors nor learnable queries, further simplifying the object candidates and pushing the development of the detection pipeline forward.
　　我们通过设计一个新颖的框架来回答这个问题，该框架直接从一组随机框中检测目标。从不包含需要在训练中优化的可学习参数的纯随机框开始，我们期望逐渐细化这些框的位置和大小，直到它们完美地覆盖目标对象。这种噪声到框的方法不需要启发式目标先验，也不需要可学习的queries，进一步简化了目标候选并推动了检测管道的发展。
　　Our motivation is illustrated in Figure 1. We think of the philosophy of noise-to-box paradigm is analogous to noise-to-image process in the denoising diffusion models [15, 35, 79], which are a class of likelihood-based models to generate the image by gradually removing noise from an image via the learned denoising model. Diffusion models have achieved great success in many generation tasks [3, 4, 37, 63, 85] and start to be explored in perception tasks like image segmentation [1, 5, 6, 12, 28, 42, 89]. However, to the best of our knowledge, there is no prior arts that successfully adopt it to object detection.
　　我们的动机如图1所示。我们认为噪声到框范式的哲学类似于去噪扩散模型[15、35、79]中的噪声到图像过程，这是一类基于似然的模型，通过学习去噪模型逐渐去除图像中的噪声来生成图像。扩散模型在许多生成任务[3, 4, 37, 63, 85]中取得了巨大成功，并开始在图像分割等感知任务中进行探索[1, 5, 6, 12, 28, 42, 89]。然而，据我们所知，还没有成功地将其应用于目标检测的现有技术。
在这里插入图片描述
　　In this work, we propose DiffusionDet, which tackles the object detection task with a diffusion model by casting detection as a generative task over the space of the positions (center coordinates) and sizes (widths and heights) of bounding boxes in the image. At training stage, Gaussian noise controlled by a variance schedule [35] is added to ground truth boxes to obtain noisy boxes. Then these noisy boxes are used to crop [33, 66] features of Region of Interest (RoI) from the output feature map of the backbone encoder, e.g., ResNet [34], Swin Transformer [54]. Finally, these RoI features are sent to the detection decoder, which is trained to predict the ground-truth boxes without noise. With this training objective, DiffusionDet is able to predict the ground truth boxes from random boxes. At inference stage, DiffusionDet generates bounding boxes by reversing the learned diffusion process, which adjusts a noisy prior distribution to the learned distribution over bounding boxes.
　　在这项工作中，我们提出了 DiffusionDet，它通过在图像中边界框的位置（中心坐标）和大小（宽度和高度）的空间上将检测作为生成任务来处理扩散模型的目标检测任务。在训练阶段，将由方差时间表[35]控制的高斯噪声添加到地面真值框以获得噪声框。然后，这些噪声框用于从骨干编码器的输出特征图中裁剪感兴趣区域 (RoI) 的[33、66]特征，例如ResNet [34]、Swin Transformer[54]。最后，这些RoI特征被发送到检测解码器，该解码器被训练来预测没有噪声的地面实况框。有了这个训练目标，DiffusionDet就能够从随机框中预测出真实框在推理阶段，DiffusionDet通过反转学习的扩散过程生成边界框，该过程将嘈杂的先验分布调整为边界框上的学习分布。
　　The noise-to-box pipeline of DiffusionDet has the appealing advantage of Once-for-All: we can train the network once and use the same network parameters under diverse settings in inference. (1) Dynamic boxes: Leveraging random boxes as object candidates, DiffusionDet decouples the training and evaluation. DiffusionDet can be trained with $N_{train}$ random boxes while being evaluated with $N_{eval}$ random boxes, where the $N_{eval}$ is arbitrary and does not need to be equal to $N_{train}$ . (2) Progressive refinement: The diffusion model benefits DiffusionDet by iterative refinement. We can adjust the number of denoising sampling steps to improve the detection accuracy or accelerate the inference speed. This flexibility enables DiffusionDet to suit different detection scenarios where accuracy and speed are required differently.
　　DiffusionDet的noise-to-box管道具有Once-for-All的吸引人优势：我们可以训练一次网络，并在不同的设置下使用相同的网络参数进行推理。(1)动态框：利用随机框作为候选目标，DiffusionDet将训练和评估解耦。DiffusionDet可以用 $N_{train}$ 个随机框进行训练，同时用 $N_{eval}$ 个随机框进行评估，其中 $N_{eval}$ 是任意的，不需要等于 $N_{train}$ 。(2)渐进细化：扩散模型通过迭代细化使DiffusionDet受益。我们可以调整去噪采样步数来提高检测精度或加快推理速度。这种灵活性使DiffusionDet能够适应对精度和速度有不同要求的不同检测场景。
　　We evaluate DiffusionDet on MS-COCO [51] dataset. With ResNet-50 [34] backbone, DiffusionDet achieves 45.5 AP using a single sampling step, which significantly outperforms Faster R-CNN [66] (40.2 AP), DETR [10] (42.0 AP) and on par with Sparse R-CNN [81] (45.0 AP). Besides, we can further improve DiffusionDet up to 46.2 AP by increasing the number of sampling steps. On the contrary, existing approaches [10, 81, 102] do not have this refinement property and would have a remarkable performance drop when evaluated in an iterative way. Moreover, we further conduct experiments on challenging LVIS [31] dataset, and DiffusionDet also performs well on this long-tailed dataset, achieving 42.1 AP with Swin-Base [54] backbone.
　　我们在MS-COCO[51]数据集上评估DiffusionDet。使用ResNet-50[34]主干，DiffusionDet使用单个采样步骤实现45.5 AP，这显着优于Faster R-CNN[66] (40.2 AP)、DETR[10] (42.0 AP)，与Sparse R-CNN相当[81]（45.0 AP）。此外，我们可以通过增加采样步数进一步将DiffusionDet提高到46.2 AP。相反，现有方法[10, 81, 102]不具有这种细化属性，并且在以迭代方式评估时会出现显着的性能下降。此外，我们进一步对具有挑战性的LVIS[31]数据集进行了实验，DiffusionDet在这个长尾数据集上也表现良好，使用Swin-Base[54]主干实现了42.1 AP。
　　Our contributions are summarized as follows:
　　• We formulate object detection as a generative denoising process, which is the first study to apply the diffusion model to object detection to the best of our knowledge.
　　• Our noise-to-box detection paradigm has several appealing properties, such as decoupling training and evaluation stage for dynamic boxes and progressive refinement.
　　• We conduct extensive experiments on MS-COCO and LVIS benchmarks. DiffusionDet achieves favorable performance against previous well-established detectors.
　　我们的贡献总结如下：
　　• 我们将目标检测制定为生成去噪过程，据我们所知，这是第一项将扩散模型应用于目标检测的研究。
　　• 我们的噪声到框检测范式具有几个吸引人的特性，例如动态框的解耦训练和评估阶段以及渐进式细化。
　　• 我们对MS-COCO和LVIS基准进行了大量实验。DiffusionDet相对于以前行之有效的检测器实现了良好的性能。

2. Related Work（相关工作）

Object detection. Most modern object detection approaches perform box regression and category classification on empirical object priors, such as proposals [24, 66], anchors [50, 64, 65], points [84, 87, 101]. Recently, Carion et al. proposed DETR [10] to detect objects using a fixed set of learnable queries. Since then, the query-based detection paradigm has attracted great attention and inspired a series of following works [46, 52, 57, 80, 81, 102]. In this work, we push forward the development of the object detection pipeline further with DiffusionDet, as compared in Figure 2.
　　目标检测。大多数现代目标检测方法对经验目标先验执行框回归和类别分类，例如proposals[24、66]、锚点[50、64、65]、点[84、87、101]。最近，Carion等人提出了DETR[10]来使用一组固定的可学习queries来检测目标。从那时起，基于queries的检测范式引起了极大的关注，并激发了一系列后续工作[46、52、57、80、81、102]。在这项工作中，我们使用 DiffusionDet 进一步推进了目标检测管道的开发，如图2所示。
在这里插入图片描述
　　Diffusion model. As a class of deep generative models, diffusion models [35, 77, 79] start from the sample in random distribution and recover the data sample via a gradual denoising process. Diffusion models have recently demonstrated remarkable results in fields including computer vision [4, 19, 30, 32, 36, 60, 63, 68, 69, 74, 96, 99], nature language processing [3, 27, 47], audio processing [38, 43, 45, 62, 82, 92, 95], interdisciplinary applications [2, 37, 39, 70, 85, 91, 94], etc. More applications of diffusion models can be found in recent surveys [8, 96].
　　扩散模型。作为一类深度生成模型，扩散模型[35,77,79]从随机分布的样本开始，通过渐进的去噪过程恢复数据样本。扩散模型最近在计算机视觉 [4、19、30、32、36、60、63、68、69、74、96、99]、自然语言处理[3、27、47]、音频处理[38、43、45、62、82、92、95]、跨学科应用[2、37、39、70、85、91、94]等领域取得了显著成果。扩散模型的更多应用可以在最近的调查中找到[8 , 96]。
　　Diffusion model for perception tasks. While Diffusion models have achieved great success in image generation [15,35,79], their potential for discriminative tasks has yet to be fully explored. Some pioneer works tried to adopt the diffusion model for image segmentation tasks [1, 5, 6, 12, 28, 42, 89], for example, Chen et al. [12] adopted Bit Diffusion model [13] for panoptic segmentation [44] of images and videos. However, despite significant interest in this idea, there are no previous solutions that successfully adapt generative diffusion models for object detection, the progress of which remarkably lags behind that of segmentation. We argue that this may be because segmentation tasks are processed in an image-to-image style, which is more conceptually similar to the image generation tasks, while object detection is a set prediction problem [10] which requires assigning object candidates [10, 49, 66] to ground truth objects. To the best of our knowledge, this is the first work that adopts a diffusion model for object detection.
　　感知任务的扩散模型。虽然扩散模型在图像生成方面取得了巨大成功 [15,35,79]，但它们在判别任务中的潜力尚未得到充分探索。一些先驱作品尝试采用扩散模型进行图像分割任务[1,5,6,12,28,42,89]，例如，Chen等[12]采用位扩散模型[13]进行图像和视频的全景分割[44]。然而，尽管人们对这个想法很感兴趣，但以前没有成功地将生成扩散模型用于目标检测的解决方案，其进展明显落后于分割。我们认为这可能是因为分割任务是以图像到图像的方式处理的，这在概念上更类似于图像生成任务，而目标检测是一个集合预测问题 [10]，需要分配候选目标 [10, 49, 66] 到地面实况目标。据我们所知，这是第一个采用扩散模型进行目标检测的工作。
在这里插入图片描述

3. Approach（方法）

3.1. Preliminaries（前期准备工作）

Object detection. The learning objective of object detection is input-target pairs ( $\boldsymbol{x, b, c}$ ), where $\boldsymbol{x} is the input image, $\boldsymbol{b}$ and $\boldsymbol{c}$ are a set of bounding boxes and category labels for objects in the image $\boldsymbol{x}, respectively. More specifically, we formulate the $i$ -th box in the set as $\boldsymbol{b}_i = (c_x^i, c_x^i, w^i, h^i)$ , where $c_x^i, c_x^i)$ is the center coordinates of the bounding box, $w^i, h^i)$ are width and height of that bounding box, respectively.
　　目标检测。目标检测的学习目标是输入-目标对（ $\boldsymbol{x, b, c}$ ），其中 $\boldsymbol{x}$ 是输入图像， $\boldsymbol{b}$ 和 $\boldsymbol{c }$ 分别是图像 $\boldsymbol{x}$ 中目标的一组边界框和类别标签。更具体地说，我们将集合中的第 $i$ 个框表示为 $\boldsymbol{b}_i = (c_x^i, c_x^i, w^i, h^i)$ ，其中 $c_x^i, c_x^ i)$ 是边界框的中心坐标， $w^i, h^i)$ 分别是该边界框的宽度和高度。
　　Diffusion model. Diffusion models [35, 75–77] are a classes of likelihood-based models inspired by nonequilibrium thermodynamics [77, 78]. These models define a Markovian chain of diffusion forward process by gradually adding noise to sample data. The forward noise process is defined as
　　扩散模型。扩散模型 [35, 75–77] 是一类受非平衡热力学启发的基于似然的模型[77, 78]。这些模型通过逐渐向样本数据中添加噪声来定义扩散前向过程的马尔可夫链。前向噪声过程定义为
在这里插入图片描述
which transforms data sample $z_0$ to a latent noisy sample $z_t$ for $t ∈ \{0, 1, ..., T\}$ by adding noise to $z_0$ . $\bar{\alpha}_t:= ∏_{s=0}^t α_s = ∏_{s=0}^t(1 − β_s)$ and $β_s$ represents the noise variance schedule [35]. During training, a neural network $f_θ(z_t, t)$ is trained to predict $z_0$ from $z_t$ by minimizing the training objective with $\ell_2$ loss [35]:
通过向 $z_0$ 添加噪声，将数据样本 $z_0$ 转换为潜在噪声样本 $z_t$ ，其中 $t ∈ \{0, 1, ..., T\}$ 。 $\bar{\alpha}_t:= ∏_{s=0}^t α_s = ∏_{s=0}^t(1 − β_s)$ 和 $β_s$ 表示噪声方差表[35]。在训练期间，神经网络 $f_θ(z_t, t)$ 被训练为通过使用 $\ell_2$ 损失最小化训练目标从 $z_t$ 预测 $z_0$ [35]：
在这里插入图片描述
　　At inference stage, data sample $z_0$ is reconstructed from noise $z_T$ with the model $f_θ$ and an updating rule [35, 76] in an iterative way, i.e., $z_T → z_{T −∆} → ... → z_0$ . More detailed formulation of diffusion models can be found in Appendix A.
　　在推理阶段，数据样本 $z_0$ 通过模型 $f_θ$ 和更新规则[35, 76]以迭代方式从噪声 $z_T$ 重构，即 $z_T → z_{T −Δ} → .. . → z_0$ 。可以在附录A中找到更详细的扩散模型公式。
　　In this work, we aim to solve the object detection task via the diffusion model. In our setting, data samples are a set of bounding boxes $z_0 = \boldsymbol{b}$ , where $\boldsymbol{b} ∈ \mathbb{R}^{N×4}$ is a set of $N$ boxes. A neural network $f_θ(z_t, t, \boldsymbol{x})$ is trained to predict $z_0$ from noisy boxes $z_t$ , conditioned on the corresponding image $\boldsymbol{x}$ . The corresponding category label $\boldsymbol{c}$ is produced accordingly.

3.2. Architecture（架构）

Since the diffusion model generates data samples iteratively, it needs $t_o$ run model $f_θ$ multiple times at the inference stage. However, it would be computationally intractable to directly apply $f_θ$ on the raw image at every iterative step. Therefore, we propose to separate the whole model into two parts, image encoder and detection decoder, where the former runs only once to extract a deep feature representation from the raw input image $\boldsymbol{x}$ , and the latter takes this deep feature as condition, instead of the raw image, to progressively refine the box predictions from noisy boxes $z_t$ .
　　由于扩散模型迭代生成数据样本，因此在推理阶段需要多次运行模型 $t_o$ 。然而，在每个迭代步骤中直接将 $f_θ$ 应用于原始图像在计算上是难以处理的。因此，我们建议将整个模型分成两部分，图像编码器和检测解码器，其中前者只运行一次以从原始输入图像 $\boldsymbol{x}$ 中提取深度特征表示，而后者将这个深度特征作为条件，而不是原始图像，逐步细化来自嘈杂框 $z_t$ 的框预测。
　　Image encoder. Image encoder takes as input the raw image and extracts its high-level features for the following detection decoder. We implement DiffusionDet with both Convolutional Neural Networks such as ResNet [34] and Transformer-based models like Swin [54]. Feature Pyramid Network [49] is used to generate multi-scale feature maps for both ResNet and Swin backbones following [49, 54, 81].
　　图像编码器。图像编码器将原始图像作为输入，并为接下来的检测解码器提取其高级特征。我们使用ResNet[34]等卷积神经网络和Swin[54]等基于Transformer的模型来实现DiffusionDet。ResNet和Swin主干后面使用特征金字塔网络[49]生成多尺度特征图。
　　Detection decoder. Borrowed from Sparse R-CNN [81], the detection decoder takes as input a set of proposal boxes to crop RoI-feature [33, 66] from feature map generated by image encoder, and sends these RoI-features to detection head to obtain box regression and classification results. Following [10,81,102], our detection decoder is composed of 6 cascading stages. The differences between our decoder and the one in Sparse R-CNN are that (1) DiffusionDet begins from random boxes while Sparse R-CNN uses a fixed set of learned boxes in inference; (2) Sparse R-CNN takes as input pairs of the proposal boxes and its corresponding proposal feature, while DiffusionDet needs the proposal boxes only; (3) DiffusionDet re-uses the detector head in iterative sampling steps and the parameters are shared across different steps, each of which is specified to the diffusion process by timestep embedding [35, 86], while Sparse R-CNN uses the detection decoder only once in the forward pass.
　　检测解码器。借鉴Sparse R-CNN[81]，检测解码器将一组proposals框作为输入，从图像编码器生成的特征图中裁剪RoI特征 [33、66]，并将这些RoI特征发送到检测头以获得框回归和分类结果。在[10,81,102]之后，我们的检测解码器由6个级联阶段组成。我们的解码器与Sparse R-CNN解码器的区别在于：（1）DiffusionDet从随机框开始，而Sparse R-CNN在推理中使用一组固定的学习框； (2) Sparse R-CNN以proposal boxes及其对应的proposal feature对作为输入，而DiffusionDet只需要proposal boxes； (3) DiffusionDet在迭代采样步骤中重新使用检测器头，并且参数在不同步骤之间共享，每个步骤都通过时间步长嵌入[35、86]指定到扩散过程，而Sparse R-CNN 使用检测解码器只有一次在前传中。

3.3. Training（训练）

During training, we first construct the diffusion process from ground-truth boxes to noisy boxes and then train the model to reverse this process. Algorithm 1 provides the pseudo-code of DiffusionDet training procedure.
　　在训练过程中，我们首先构建从ground-truth boxes到noisy boxes的扩散过程，然后训练模型来反转这个过程。算法1提供了DiffusionDet训练过程的伪代码。
在这里插入图片描述
　　Ground truth boxes padding. For modern object detection benchmarks [18, 31, 51, 72], the number of instances of interest typically varies across images. Therefore, we first pad some extra boxes to original ground truth boxes such that all boxes are summed up to a fixed number $N_{train}$ . We explore several padding strategies, for example, repeating existing ground truth boxes, concatenating random boxes or image-size boxes. Comparisons of these strategies are in Section 4.4, and concatenating random boxes works best.
　　地面实况框填充。对于现代目标检测基准 [18、31、51、72]，感兴趣实例的数量通常因图像而异。因此，我们首先将一些额外的框填充到原始地面真值框，以便所有框加起来为固定数 $N_{train}$ 。我们探索了几种填充策略，例如，重复现有的地面真值框、连接随机框或图像大小框。这些策略的比较在4.4节中，连接随机框的效果最好。
　　Box corruption. We add Gaussian noises to the padded ground truth boxes. The noise scale is controlled by $α_t$ (in Eq. (1)), which adopts the monotonically decreasing cosine schedule for $α_t$ in different time step $t$ , as proposed in [59]. Notably, the ground truth box coordinates need to be scaled as well since the signal-to-noise ratio has a significant effect on the performance of diffusion model [12]. We observe that object detection favors a relatively higher signal scaling value than image generation stask [13,15,35]. More discussions are in Section 4.4.
　　Box corruption。我们将高斯噪声添加到填充的地面实况框。噪声尺度由 $α_t$ （在等式 (1) 中）控制，如[59]中所提出的，它在不同时间步长 $t$ 中采用 $α_t$ 的单调递减余弦调度。值得注意的是，由于信噪比对扩散模型的性能有重大影响[12]，因此还需要缩放地面实况框坐标。我们观察到目标检测比图像生成任务更倾向于使用相对更高的信号缩放值[13,15,35]。更多讨论在第4.4节。
　　Training losses. The detection detector takes as input $N_{train}$ corrupted boxes and predicts $N_{train} predictions of category classification and box coordinates. We apply set prediction loss [10, 81, 102] on the set of $N_{train}$ predictions. We assign multiple predictions to each ground truth by selecting the top $k$ predictions with the least cost by an optimal transport assignment method [16, 22, 23, 90].
　　训练损失。检测器将 $N_{train}$ 个corrupted框作为输入，并预测$N_{train}个类别分类和框坐标的预测。我们在 $N_{train}$ 预测集上应用集预测损失 [10, 81, 102]。我们通过最佳传输分配方法[16、22、23、90]选择成本最低的前 $k$ 预测，为每个ground truth分配多个预测。

3.4. Inference（推理）

The inference procedure of DiffusionDet is a denoising sampling process from noise to object boxes. Starting from boxes sampled in Gaussian distribution, the model progressively refines its predictions, as shown in Algorithm 2.
　　DiffusionDet的推理过程是从噪声到目标框的去噪采样过程。从以高斯分布采样的框开始，模型逐渐改进其预测，如算法2所示。在这里插入图片描述
　　Sampling step. In each sampling step, the random boxes or the estimated boxes from the last sampling step are sent into the detection decoder to predict the category classification and box coordinates. After obtaining the boxes of the current step, DDIM [76] is adopted to estimate the boxes for the next step. We note that sending the predicted boxes without DDIM to the next step is also an optional progressive refinement strategy. However, it brings significant deterioration, as discussed in Section 4.4.
　　采样步骤。在每个采样步骤中，随机框或来自上一个采样步骤的估计框被发送到检测解码器以预测类别分类和框坐标。在获得当前步骤的框后，采用DDIM[76]来估计下一步的框。我们注意到，将没有DDIM的预测框发送到下一步也是一种可选的渐进细化策略。然而，如第4.4节所述，它会带来显着的恶化。
　　Box renewal. After each sampling step, the predicted boxes can be coarsely categorized into two types, desired and undesired predictions. The desired predictions contain boxes that are properly located at corresponding objects, while the undesired ones are distributed arbitrarily. Directly sending these undesired boxes to the next sampling iteration would not bring a benefit since their distribution is not constructed by box corruption in training. To make inference better align with training, we propose the strategy of box renewal to revive these undesired boxes by replacing them with random boxes. Specifically, we first filter out undesired boxes with scores lower than a particular threshold. Then, we concatenate the remaining boxes with new random boxes sampled from a Gaussian distribution.
　　框更新。在每个采样步骤之后，预测框可以粗略地分为两类，期望预测和非期望预测。期望的预测包含正确位于相应目标的框，而不期望的预测是任意分布的。直接将这些不需要的框发送到下一个采样迭代不会带来好处，因为它们的分布不是由训练中的框损坏构建的。为了使推理更好地与训练保持一致，我们提出了框更新策略，通过用随机框替换它们来恢复这些不需要的框。具体来说，我们首先过滤掉分数低于特定阈值的不需要的框。然后，我们将剩余的框与从高斯分布中采样的新随机框连接起来。
　　Once-for-all. Thanks to the random boxes design, we can evaluate DiffusionDet with an arbitrary number of random boxes and the number of sampling steps, which do not need to be equal to the training stage. As a comparison, previous approaches [10, 81, 102] rely on the same number of processed boxes during training and evaluation, and their detection decoders are used only once in the forward pass.
　　一次就好。得益于随机框设计，我们可以使用任意数量的随机框和采样步骤数来评估DiffusionDet，而无需等于训练阶段。作为比较，以前的方法[10、81、102]在训练和评估期间依赖于相同数量的处理框，并且它们的检测解码器在前向传递中仅使用一次。

4. Experiments（实验）

We first show the once-for-all property of DiffusionDet. Then we compare DiffusionDet with previous well-established detectors on MS-COCO [51] and LVIS [31] dataset. Finally, we provide ablation studies on the components of DiffusionDet.
　　我们首先展示DiffusionDet的一次性属性。然后我们将DiffusionDet与之前在MS-COCO[51]和 LVIS [31]数据集上表现良好的检测器进行比较。最后，我们提供了DiffusionDet组件的消融研究。
　　MS-COCO [51] dataset contains about 118K training images in the train2017 set and 5K validation images in the val2017 set. There are 80 object categories in total. We report box average precision over multiple IoU thresholds (AP), threshold 0.5 ( $AP_{50}$ ) and 0.75 ( $AP_{75}$ ).
　　MS-COCO[51]数据集包含train2017训练集中约1.18万训练图像和val2017验证集中约5千验证图像。总共有80个对象类别。我们报告了多个IoU阈值(AP)、阈值0.5( $AP_{50}$ )和 0.75( $AP_{75}$ )的框平均精度。
　　LVIS v1.0 [31] dataset is a large-vocabulary object detection and instance segmentation dataset which has 100K training images and 20K validation images. LVIS shares the same source images as MS-COCO, while its annotations capture the long-tailed distribution in 1203 categories. We adopt MS-COCO style box metric AP, $AP_{50}$ and $AP_{75}$ in LVIS evaluation.
　　LVIS v1.0[31]数据集是一个包含10万训练图像和2万验证图像的大词汇目标检测和实例分割数据集。LVIS与MS-COCO共享相同的源图像，而其注释捕获了1203个类别中的长尾分布。我们在LVIS评估中采用MS-COCO风格的框度量AP、 $AP_{50}$ 和 $AP_{75}$ 。

4.1. Implementation Details（实施细节）

Training schedules. The ResNet and Swin backbone are initialized with pre-trained weights on ImageNet-1K and ImageNet-21K [14], respectively. The newly added detection decoder is initialized with Xavier init [26]. We train DiffusionDet using AdamW [55] optimizer with the initial learning rate as $2.5 × 10^{−5}$ and the weight decay as $10^{−4}$ . All models are trained with a mini-batch size 16 on 8 GPUs. For MS-COCO, the default training schedule is 450K iterations, with the learning rate divided by 10 at 350K and 420K iterations. For LVIS, the training schedule is 210K, 250K, 270K. Data augmentation strategies contain random horizontal flip, scale jitter of resizing the input images such that the shortest side is at least 480 and at most 800 pixels while the longest is at most 1333 [93], and random crop augmentations. We do not use the EMA and some strong data augmentation like MixUp [98] or Mosaic [23].
　　训练计划表。ResNet和Swin主干分别在ImageNet-1K和ImageNet-21K[14]上使用预训练的权重进行初始化。新添加的检测解码器使用Xavier init[26]进行初始化。我们使用AdamW[55]优化器训练DiffusionDet，初始学习率为 $2.5 × 10^{−5}$ ，权重衰减为 $10^{−4}$ 。所有模型都在8个GPU上使用大小为16的小批量进行训练。对于MS-COCO，默认的训练计划是450K次迭代，学习率在350K和420K次迭代时除以10。对于LVIS，训练计划是210K、250K、270K。数据增强策略包括随机水平翻转、调整输入图像大小的缩放抖动，使得最短边至少为480且最多为800像素，而最长边最多为1333[93]，以及随机裁剪增强。我们不使用EMA和一些强大的数据增强，如MixUp[98] 或Mosaic[23]。
　　Inference details. At the inference stage, the detection decoder iteratively refines the predictions from Gaussian random boxes. We select top-100 and top-300 scoring predictions for MS-COCO and LVIS, respectively. The predictions at each sampling step are ensembled together by NMS to get the final predictions.
　　推理细节。在推理阶段，检测解码器迭代地改进高斯随机框的预测。我们分别为MS-COCO和LVIS选择前100名和前300名的评分预测。NMS将每个采样步骤的预测组合在一起以获得最终预测。

4.2. Main Properties（主要属性）

The main properties of DiffusionDet lie on once training for all inference cases. Once the model is trained, it can be used with changing the number of boxes and number of sample steps in inference, as shown in Figure 4. DiffusionDet can achieve better accuracy by using more boxes or/and more refining steps at the cost of higher latency. Therefore, we can deploy a single DiffusionDet to multiple scenarios and obtain a desired speed-accuracy trade-off without retraining the network.
　　DiffusionDet的主要属性在于对所有推理案例进行一次训练。模型训练完成后，可以通过改变框的数量和样本步骤的数量来进行推理，如图4所示。DiffusionDet可以通过使用更多的框或(或者和)更多的精炼步骤来获得更好的准确性，但延迟代价更高。因此，我们可以将单个DiffusionDet部署到多个场景，并在不重新训练网络的情况下获得所需的速度-精度权衡。
　　Dynamic boxes. We compare DiffusionDet with DETR [10] to show the advantage of dynamic boxes. Comparisons with other detectors are in Appendix B. We reproduce DETR [10] with 300 object queries using the official code and default settings for 300 epochs training. We train DiffusionDet with 300 random boxes such that the number of candidates is consistent with DETR for a fair comparison. The evaluation is on {50, 100, 300, 500, 1000, 2000, 4000} queries or boxes.
　　动态框。我们将DiffusionDet与DETR[10]进行比较，以显示动态框的优势。与其他检测器的比较在附录B中。我们使用官方代码和300轮训练的默认设置，通过300个目标queries重现DETR[10]。我们用300个随机框训练DiffusionDet，使得候选的数量与DETR一致，以进行公平比较。评估针对{50, 100, 300, 500, 1000, 2000, 4000}个queries或框。
　　Since the learnable queries are fixed after training in the original setting of DETR, we propose a simple workaround to enable DETR work with different number of queries: when $N_{eval} < N_{train}$ , we directly choose $N_{eval}$ queries from $N_{train}$ queries; when $N_{eval} > N_{train}$ , we concatenate extra $N_{eval} − N_{train}$ randomly initialized queries (a.k.a. concat random). We present results in Figure 4a. The performance of DiffusionDet increases steadily with the number of boxes used for evaluation. On the contrary, DETR has a clear performance drop when the $N_{eval}$ is different with $N_{train}$ , i.e., 300. Besides, this performance drop becomes larger when the difference between $N_{eval}$ and $N_{train}$ increases. For example, when the number of boxes increases to 4000, DETR only has 26.4 AP with concat random strategy, which is 12.4 lower than the peak value (i.e., 38.8 AP with 300 queries). As a comparison, DiffusionDet can achieve 1.1 AP gain with 4000 boxes.
　　由于在DETR的原始设置中训练后可学习的queries是固定的，我们提出了一个简单的解决方法，使DETR能够处理不同数量的queries：当 $N_{eval} < N_{train}$ 时，我们直接选择从 $N_{train}$ 个queries中选择 $N_{eval}$ 个queries；当 $N_{eval} > N_{train}$ 时，我们连接额外的 $N_{eval} − N_{train}$ 随机初始化queries（又名 concat random）。我们在图4a中展示了结果。DiffusionDet的性能随着用于评估的框数的增加而稳步提高。相反，当 $N_{eval}$ 与 $N_{train}$ 不同，即300时，DETR有明显的性能下降。此外，当 $N_{eval}$ 与 $N_{eval}$ 之间的差异增加时，这种性能下降变得更大 $N_{train}$ 。例如，当框数增加到4000时，带有concat 随机策略的DETR只有26.4个AP，比峰值低12.4（即38.8个AP和300个queries）。作为对比，DiffusionDet在4000个boxes上可以实现1.1的AP增益。
在这里插入图片描述
　　We also implement another method for DETR when $N_{eval} > N_{train}$ , cloning existing $N_{train}$ queries up to $N_{eval}$ (a.k.a. clone). We observe that concat random strategy consistently performs better than the clone. It is reasonable because the cloned queries will produce similar detection results as the original queries. In contrast, random queries introduce more diversity to the detection results.
　　当 $N_{eval} > N_{train}$ 时，我们还为DETR实现了另一种方法，将现有的 $N_{train}$ 个queries克隆到 $N_{eval}$ （也称为克隆）。我们观察到concat随机策略始终比克隆策略表现更好。这是合理的，因为克隆queries将产生与原始queries相似的检测结果。相比之下，随机queries为检测结果引入了更多的多样性。
　　Progressive refinement. The performance of DiffusionDet can be improved not only by increasing the number of random boxes but by iterating more steps. We evaluate DiffusionDet with 100, 300, and 500 random boxes by increasing their iterative steps from 1 to 9. The results are presented in Figure 4b. We see that DiffusionDet with these three settings all have steady performance gains with more refining steps. Besides, DiffusionDet with fewer random boxes tends to have a larger gain with refinement. For example, the AP of DiffusionDet instance with 100 random boxes increases from 42.4 (1 step) to 45.9 (9 steps), an absolute 3.5 AP improvement. This accuracy performance is comparable with 45.0 (1 step) of 300 random boxes and 45.5 (1 step) of 500, showing that high accuracy in DiffusionDet could be achieved by either increasing the number of proposal boxes or the iterative steps.
　　渐进式细化。DiffusionDet的性能不仅可以通过增加随机框的数量来提高，还可以通过迭代更多的步骤来提高。我们通过将迭代步骤从1增加到 9，使用100、300和500个随机框评估DiffusionDet。结果如图4b所示。我们看到具有这三种设置的DiffusionDet都具有稳定的性能提升和更多的优化步骤。此外，具有较少随机框的DiffusionDet往往会随着细化而获得更大的增益。例如，具有100个随机框的DiffusionDet实例的AP从42.4（1 步）增加到45.9（9 步），完全提高了3.5AP。这种精度性能与300个随机框的 45.0（1 步）和500个随机框的 45.5（1 步）相当，表明DiffusionDet的高精度可以通过增加proposal框的数量或迭代步骤来实现。
　　In comparison, we find that previous approaches [10, 81, 102] do not have this refinement property. They can use the detection decoder only once. Using two or more iteration steps will drop the performance. More detailed comparison can be found in Appendix C.
　　相比之下，我们发现以前的方法[10、81、102]没有这种细化特性。他们只能使用一次检测解码器。使用两个或更多迭代步骤会降低性能。更详细的比较可以在附录 C 中找到。

4.3. Benchmarking on Detection Datasets（检测数据集的基准测试）

We compare DiffusionDet with previous detectors [7,10, 50, 66, 81, 102] on MS-COCO and LVIS dataset. We adopt 500 boxes for both training and inference in this subsection. More detailed experimental settings are in Appendix D.
　　我们在MS-COCO和LVIS数据集上将DiffusionDet与之前的检测器[7,10, 50, 66, 81, 102]进行比较。在本小节中，我们采用500个框进行训练和推理。更详细的实验设置在附录D中。
　　MS-COCO. In Table 1 we compare the object detection performance of DiffusionDet with previous detectors on MS-COCO. DiffusionDet without refinement (i.e., step 1) achieves 45.5 AP with ResNet-50 backbone, outperforming well-established methods such as Faster R-CNN, RetinaNet, DETR and Sparse R-CNN by a non-trival margin. Besides, DiffusionDet can make its superiority more remarkable when using more iterative refinements. For example, when using ResNet-50 as the backbone, DiffusionDet outperforms Sparse R-CNN by 0.5 AP (45.5 vs. 45.0) with a single step while by 1.2 AP (46.2 vs. 45.0) with 8 steps.
　　MS-COCO。在表1中，我们比较了DiffusionDet与之前在MS-COCO上的检测器的目标检测性能。没有细化的DiffusionDet（即1步）使用ResNet-50主干实现45.5AP，以非常重要的优势优于Faster R-CNN、RetinaNet、DETR和Sparse R-CNN等成熟方法。此外，DiffusionDet在使用更多的迭代细化时可以使其优势更加显着。例如，当使用 ResNet-50 作为主干时，DiffusionDet比Sparse R-CNN 有0.5 AP的提升（45.5对45.0）单步，而8步有1.2 AP的提升（46.2对45.0）。
在这里插入图片描述
　　DiffusionDet shows steady improvement when the backbone size scales up. DiffusionDet with ResNet-101 achieves 46.6 AP (1 step) and 47.1 AP (8 steps). When using ImageNet-21k pre-trained Swin-Base [54] as the backbone, DiffusionDet obtains 52.3 AP for a single step and 52.8 AP with 8 steps, outperforming strong baselines such as Cascade R-CNN and Sparse R-CNN.
　　当骨干尺寸扩大时，DiffusionDet表现出稳定的改进。骨干网络为ResNet-101的DiffusionDet达到46.6 AP（1 步）和47.1 AP（8 步）。当使用ImageNet-21k预训练的Swin-Base[54]作为主干时，DiffusionDet单步获得52.3 AP，8 步获得52.8 AP，优于Cascade R-CNN和Sparse R-CNN等强基线。
　　LVIS v1.0. We compare the results on a more challenging LVIS dataset in Table 2. We reproduce Faster R-CNN and Cascade R-CNN based on detectron2 [93] while Sparse R-CNN on its original code. We first reproduce Faster R-CNN and Cascade R-CNN using the default settings of detectron2, achieving 22.5/24.8 and 26.3/28.8 AP (with † in Table 2) with ResNet-50/101 backbone, respectively. Further, we boost their performance using the federated loss in [100]. Since images in LVIS are annotated in a federated way [31], the negative categories are sparsely annotated, which deteriorates the training gradients, especially for rare classes [83]. Federated loss is proposed to mitigate this issue by sampling a subset S of classes for each training image that includes all positive annotations and a random subset of negative ones. Following [100], we choose |S| = 50 in all experiments. Faster R-CNN and Cascade R-CNN earn about 3 AP gains with federated loss. All following comparisons are based on this loss.
　　LVIS v1.0。我们在表2中比较了在更具挑战性的LVIS数据集上的结果。我们基于detectron2[93]重现了Faster R-CNN和Cascade R-CNN，同时在其原始代码上重现了Sparse R-CNN。我们首先使用detectron2的默认设置重现Faster R-CNN和Cascade R-CNN，分别使用 ResNet-50/101骨干实现22.5/24.8和26.3/28.8的AP（表2中带有 †）。此外，我们使用[100]中的联合损失来提高它们的性能。由于LVIS中的图像是以联合方式注释的[31]，负类别注释稀疏，这会恶化训练梯度，特别是对于稀有类别[83]。联邦损失被提议通过为每个训练图像采样类的子集 S 来缓解这个问题，其中包括所有正注释和负注释的随机子集。在[100]之后，我们选择|S| = 50 在所有实验中。Faster R-CNN和Cascade R-CNN通过联邦损失获得大约3的AP提升。以下所有比较均基于此损失。
在这里插入图片描述
　　We see that DiffusionDet attains remarkable gains using more refinement steps, with both small and large backbones. Moreover, we note that refinement brings more gains on LVIS compared with MS-COCO. For example, its performance increases from 45.5 to 46.2 (+ 0.7 AP) on MS-COCO while from 30.4 to 31.9 (+1.5 AP) on LVIS, which demonstrates that our refining strategy would become more helpful for a more challenging benchmark.
　　我们看到DiffusionDet使用更多的细化步骤获得了显着的收益，同时具有小型和大型主干。此外，我们注意到与MS-COCO相比，细化在LVIS上带来了更多收益。例如，它在MS-COCO上的性能从45.5增加到 46.2（+ 0.7 AP），而在LVIS上从30.4增加到31.9（+1.5 AP），这表明我们的精炼策略将对更具挑战性的基准更有帮助。

4.4. Ablation Study（消融研究）

We conduct ablation experiments on MS-COCO to study DiffusionDet in detail. All experiments use ResNet-50 with FPN as the backbone and 300 boxes for training and inference without further specification.
　　我们在MS-COCO上进行消融实验以详细研究DiffusionDet。所有实验都使用带有FPN的ResNet-50作为主干和300个框用于训练和推理，没有进一步说明。
　　Signal scaling. The signal scaling factor controls the signal-to-noise ratio (SNR) of the diffusion process. We study the influence of scaling factors in Table 3a. Results demonstrate that the scaling factor of 2.0 achieves optimal AP performance, outperforming the standard value of 1.0 in image generation task [13, 35] and 0.1 used for panoptic segmentation [12]. We explain that it is because one box only has four representation parameters, i.e., center coordinates $c_x, c_y)$ and box size $(w, h)$ , which is coarsely analogous to an image with only four pixels in image generation. The box representation is more fragile than the dense representation, e.g., 512 × 512 mask presentation in panoptic segmentation [13]. Therefore, DiffusionDet prefers an easier training objective with an increased signal-to-noise ratio compared to image generation and panoptic segmentation.
　　信号缩放。信号比例因子控制扩散过程的信噪比 (SNR)。我们在表3a中研究了比例因子的影响。结果表明，2.0的缩放因子实现了最佳AP性能，优于图像生成任务中的标准值1.0[13、35]和用于全景分割的标准值0.1[12]。我们解释是因为一个框只有四个表示参数，即中心坐标 $c_x, c_y)$ 和框大小 $(w, h)$ ，这在图像生成中粗略地类似于只有四个像素的图像。框表示比密集表示更脆弱，例如全景分割中的512×512掩码表示[13]。因此，与图像生成和全景分割相比，DiffusionDet更喜欢具有更高信噪比的更简单的训练目标。
在这里插入图片描述
　　GT boxes padding strategy. As introduced in Section 3.3, we need to pad additional boxes to the original ground truth boxes such that each image has the same number of boxes. We study different padding strategies in Table 3b, including (1.) repeating original ground truth boxes evenly until the total number reaches pre-defined value $N_{train}$ ; (2.) padding random boxes that follow Gaussian distribution; (3.) padding random boxes that follow uniform distribution; (4.) padding boxes that have the same size as the whole image, which is the default initialization of learnable boxes in [81]. Concatenating Gaussian random boxes works best for DiffusionDet. We use this padding strategy as default.
　　GT框填充策略。如第3.3节所述，我们需要将额外的框填充到原始地面真值框，以便每个图像具有相同数量的框。我们在表3b中研究了不同的填充策略，包括 (1) 均匀地重复原始ground truth框，直到总数达到预定义值 $N_{train}$ ； (2) 填充服从高斯分布的随机框； (3) 填充服从均匀分布的随机框； (4) 填充与整个图像大小相同的框，这是[81]中可学习框的默认初始化。串联高斯随机框最适合DiffusionDet。我们默认使用这种填充策略。
　　Sampling strategy. We compare different sampling strategies in Table 3c. When evaluating DiffusionDet that does not use DDIM, we directly take the output prediction of the current step as input of the next step. We found that the AP of DiffusionDet degrades with more evaluation steps when neither DDIM nor box renewal is adopted. Besides, only using DDIM or box renewal would bring slight benefits at 4 steps and does not bring further improvements when using more steps. Moreover, our DiffusionDet attains remarkable gains when equipped with both DDIM and renewal. These experiments together verify the necessity of both DDIM and box renewal in the sampling step.
　　抽样策略。我们在表3c中比较了不同的采样策略。在评估不使用DDIM的DiffusionDet时，我们直接将当前步骤的输出预测作为下一步的输入。我们发现，当既不采用DDIM也不采用框更新时，DiffusionDet的AP会随着评估步骤的增加而降低。此外，仅使用DDIM或box renewal 在4步时会带来轻微的好处，而在使用更多步时不会带来进一步的改进。此外，我们的DiffusionDet在配备DDIM和更新时获得了显着的收益。这些实验共同验证了采样步骤中DDIM和框更新的必要性。
　　Box renewal threshold. As discussed in Section 3.4, the box renewal strategy is proposed to reactivate the predictions whose scores are lower than a specific threshold. Table 3d shows the effect of the score threshold for box renewal. A threshold of 0.0 is no box renewal used. The results demonstrate the threshold of 0.5 performs slightly better than other ones.
　　框更新阈值。如第3.4节所述，提出了框更新策略以重新激活分数低于特定阈值的预测。表3d显示了框更新的分数阈值的影响。阈值0.0表示不使用框更新。结果表明0.5的阈值比其他阈值表现稍好。
　　Matching between $N_{train}$ and $N_{eval}$ . As discussed in Sec. 4.2, DiffusionDet has an appealing property of evaluating with an arbitrary number of random boxes. To study how the number of training boxes effects inference performance, we train DiffusionDet with $N_{train}$ ∈ {100, 300, 500} random boxes separately and then evaluate each of these models with $N_{eval}$ ∈ {100, 300, 500, 1000}. The results are summarized in Table 3e. First, no matter how many random boxes DiffusionDet uses for training, the accuracy increases steadily with the $N_{eval}$ until the saturated point at around 2000 random boxes. Second, DiffusionDet tends to perform better when the $N_{train}$ and $N_{eval}$ matches with each other. For example, DiffusionDet trained with $N_{train}$ = 100 boxes behaves better than $N_{train}$ = 300 and 500 when $N_{eval}$ = 100.
　　 $N_{train}$ 和 $N_{eval}$ 之间的匹配。如第4.2节所述，DiffusionDet具有使用任意数量的随机框进行评估的吸引人的特性。为了研究训练框的数量如何影响推理性能，我们分别用 $N_{train} ∈ \{100, 300, 500\}$ 个随机框训练DiffusionDet，然后用 $N_{eval} ∈ \{100 , 300, 500, 1000\}$ 评估这些模型中的每一个。结果总结在表3e中。首先，无论DiffusionDet使用多少个随机框进行训练，准确度都会随着 $N_{eval}$ 的增加而稳定增加，直到在大约2000个随机框处达到饱和点。其次，当 $N_{train}$ 和 $N_{eval}$ 相互匹配时，DiffusionDet往往表现更好。例如，当 $N_{train} = 100$ 时，使用 $N_{train} = 100$ 个框训练的DiffusionDet 表现优于 $N_{train}$ = 300 和 500。
　　Accuracy vs. speed. We test the inference speed of DiffusionDet in Table 3f. The run time is evaluated on a single NVIDIA A100 GPU with a mini-batch size of 1. We experiment with multiple choices of $N_{train}$ ∈ {100, 300} and keep $N_{eval}$ same as the corresponding $N_{train}$ . We see that increasing $N_{train}$ from 100 to 300 brings 2.5 AP gains while negligible latency cost (31.6 FPS vs. 31.3 FPS) We also test the inference speed of 4 steps when $N_{train}$ = 300. We observe that more refinements cost brings more inference times and results in less FPS. Increasing the refining step from 1 to 4 provides 0.8 AP gains but makes the detector slower. For reference, we compare DiffusionDet against Sparse R-CNN with 300 proposals, and DiffusionDet with 300 boxes has a very close FPS to Sparse R-CNN.
　　准确性与速度。我们在表3f中测试了DiffusionDet的推理速度。运行时间是在单个NVIDIA A100 GPU上评估的，小批量大小为1。我们用 $N_{train} ∈ {100, 300}$ 的多种选择进行实验，并保持 $N_{eval}$ 与相应的相同 $N_{train}$ 。我们看到将 $N_{train}$ 从100增加到300会带来2.5 AP 增益，而延迟成本可以忽略不计（31.6 FPS 对 31.3 FPS）。我们还测试了当 $N_{train}$ = 300时4步的推理速度。我们观察到，更多的改进成本会带来更多的推理时间，并导致更少的FPS。将精炼步骤从1增加到4可以提供0.8个AP增益，但会使检测器的速度变慢。作为参考，我们将具有300个proposal的Sparse R-CNN进行了比较，并且具有300个框的DiffusionDet与Sparse R-CNN有非常接近的FPS。
　　Random seed. Since DiffusionDet is given random boxes as input at the start of inference, one may ask whether there is a large performance variance across different random seeds. We evaluate the stability of DiffusionDet by training five models independently with strictly the same configurations except for random seed. Then, we evaluate each model instance with ten different random seeds to measure the distribution of performance, inspired by [61, 88]. As shown in Figure 5, most evaluation results are distributed closely to 45.0 AP. Besides, the mean values are all above 45.0 AP, and the performance differences among different model instances are marginal, which demonstrates that DiffusionDet is robust to the random boxes and produces reliable results.
　　Random seed。由于DiffusionDet在推理开始时给出了随机框作为输入，人们可能会问，在不同的随机种子之间是否存在很大的性能差异。我们通过独立训练除随机种子外的5个模型来评估DiffusionDet的稳定性。然后，受[61,88]的启发，我们用10个不同的随机种子来评估每个模型实例，以衡量性能的分布。如图5所示，大多数评价结果都分布在45.0 AP附近。平均值均在45.0 AP以上，不同模型实例之间的性能差异边际，说明DiffusionDet对随机框具有鲁棒性，并产生可靠的结果。
在这里插入图片描述

5. Conclusion and Future Work（结论和未来工作）

In this work, we propose a novel detection paradigm, DiffusionDet, by viewing object detection as a denoising diffusion process from noisy boxes to object boxes. Our noise-to-box pipeline has several appealing properties, including dynamic box and progressive refinement, enabling us to use the same network parameters to obtain the desired speed-accuracy trade-off without re-training the model. Experiments on standard detection benchmarks show that DiffusionDet achieves favorable performance compared to well-established detectors.
　　在这项工作中，我们提出了一种新的检测范式，DiffusionDet，通过将目标检测视为一个从噪声框到目标框的去噪扩散过程。我们的噪声到框管道有几个吸引人的特性，包括动态框和逐步细化，使我们能够使用相同的网络参数来获得所需的速度-精度的权衡，而不需要重新训练模型。在标准检测基准上的实验表明，与成熟的探测器相比，DiffusionDet取得了良好的性能。
　　To further explore the potential of diffusion model to solve object-level recognition tasks, several future works are beneficial. An attempt is to apply DiffusionDet to videolevel tasks, for example, object tracking and action recognition. Another is to extend DiffusionDet from close-world to open-world or open-vocabulary object detection.
　　为了进一步探索扩散模型在解决目标级识别任务中的潜力，未来的一些工作是有益的。尝试将DiffusionDet应用于视频任务，例如目标跟踪和动作识别。另一种方法是将DiffusionDet从封闭世界扩展到开放世界或开放词汇表目标检测。