【翻译】QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection

异想天开的长颈鹿

已于 2022-11-07 12:40:05 修改

阅读量2.6k

点赞数 10

分类专栏：翻译卷积神经网络文章标签：深度学习计算机视觉目标检测

于 2022-04-10 15:52:24 首次发布

原文链接：https://arxiv.org/pdf/2103.09136.pdf

版权

翻译同时被 2 个专栏收录

24 篇文章 2 订阅

订阅专栏

卷积神经网络

13 篇文章 0 订阅

订阅专栏

QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection
QueryDet：用于加速高分辨率小目标检测的级联稀疏查询

论文地址：https://arxiv.org/pdf/2103.09136.pdf
项目地址：https://github.com/ ChenhongyiYang/QueryDet-PyTorch

Abstract（摘要）

While general object detection with deep learning has achieved great success in the past few years, the performance and efficiency of detecting small objects are far from satisfactory. The most common and effective way to promote small object detection is to use high-resolution images or feature maps. However, both approaches induce costly computation since the computational cost grows squarely as the size of images and features increases. To get the best of two worlds, we propose QueryDet that uses a novel query mechanism to accelerate the inference speed of feature-pyramid based object detectors. The pipeline composes two steps: it first predicts the coarse locations of small objects on low-resolution features and then computes the accurate detection results using high-resolution features sparsely guided by those coarse positions. In this way, we can not only harvest the benefit of high-resolution feature maps but also avoid useless computation for the background area. On the popular COCO dataset, the proposed method improves the detection mAP by 1.0 and mAP-small by 2.0, and the high-resolution inference speed is improved to 3.0× on average. On VisDrone dataset, which contains more small objects, we create a new state-of-theart while gaining a 2.3× high-resolution acceleration on average. Code is available at
https://github.com/ ChenhongyiYang/QueryDet-PyTorch.
　　虽然深度学习的通用目标检测在过去几年中取得了巨大成功，但检测小目标的性能和效率却远不能令人满意。促进小目标检测的最常见和最有效的方法是使用高分辨率图像或特征图。然而，这两种方法都会导致计算成本高昂，因为计算成本会随着图像和特征大小的增加而成正比增长。为了两全其美，我们提出了QueryDet，它使用一种新颖的查询机制来加速基于特征金字塔的目标检测器的推理速度。该管道由两个步骤组成：首先在低分辨率特征上预测小目标的粗略位置，然后使用由这些粗略位置稀疏引导的高分辨率特征计算准确的检测结果。这样，我们不仅可以收获高分辨率特征图的好处，还可以避免对背景区域进行无用计算。在流行的COCO数据集上，该方法将检测mAP提高了1.0，mAP-small提高了2.0，高分辨率推理速度平均提高了3.0倍。在包含更多小目标的VisDrone数据集上，我们创建了一个新的最先进的结果，同时平均获得2.3倍的高分辨率加速。代码可在
https://github.com/ChenhongyiYang/QueryDet-PyTorch。

1. 介绍

With the recent advances of deep learning [15, 53], visual object detection has achieved massive improvements in both performance and speed [3, 12, 26, 27, 29, 37, 39, 49]. It has become the foundation for widespread applications, such as autonomous driving and remote sensing. However, detecting small objects is still a challenging problem. There is a large performance gap between small and normal scale objects. Taking RetinaNet [27], one of the state-of-the-art object detectors, as an example, it achieves 44.1 and 51.2 mAP on objects with medium and large sizes but only obtains 24.1 mAP on small objects on COCO [28] test-dev set. Such degradation is mainly caused by three factors: 1) the features that highlight the small objects are extinguished because of the down-sampling operations in the backbone of Convolutional Neural Networks (CNN); hence the features of small objects are often contaminated by noise in the background; 2) the receptive field on low-resolution features may not match the size of small objects as pointed in [25]; 3) localizing small objects is more difficult than large objects because a small perturbation of the bounding box may cause a significant disturbance in the Intersection over Union (IoU) metric.
　　随着深度学习 [15, 53] 的最新进展，视觉目标检测在性能和速度方面都取得了巨大的进步 [3, 12, 26, 27, 29, 37, 39, 49]。它已成为广泛应用的基础，例如自动驾驶和遥感。然而，检测小目标仍然是一个具有挑战性的问题。小尺寸目标和普通尺寸目标之间存在很大的性能差距。以最先进的目标检测器之一 RetinaNet [27] 为例，在 COCO [28] test-dev数据集中，它在中大尺寸目标上分别达到44.1和 51.2 mAP，但在小目标上仅获得24.1 mAP 。这种退化主要是由三个因素造成的：1）由于卷积神经网络（CNN）主干中的下采样操作，突出小目标的特征被消除了；因此，小目标的特征经常被背景中的噪声污染； 2）低分辨率特征的感受野可能与[25]中指出的小目标的大小不匹配； 3）定位小目标比大目标更困难，因为边界框的小扰动可能会导致对IoU度量的显着干扰。
　　Small object detection can be improved by scaling the size of input images or reducing the down-sampling rate of CNN to maintain high-resolution features, as they in crease the effective resolution in the resulted feature map. However, merely increasing the resolution of feature maps can incur considerable computation costs. Several works [1, 26, 29] proposed to build a feature pyramid by reusing the multi-scale feature maps from different layers of a CNN to address this issue. Objects with various scales are handled on different levels: large objects tend to be detected on high-level features, while small objects are usually detected on low levels. The feature pyramid paradigm saves the computation cost of maintaining high-resolution feature maps from shallow to deep in the backbone. Nevertheless, the computation complexity of detection heads on low-level features is still enormous. For example, adding an extra pyramid level $P_2$ into RetinaNet will bring about 300% more computation (FLOPs) and memory cost in the detection head; hence severely lowering down the inference speed from 13.6 FPS to 4.85 FPS on NVIDIA 2080Ti GPU.
　　可以通过缩放输入图像的大小或降低CNN的下采样率以保持高分辨率特征来改进小目标检测，因为它们增加了结果特征图中的有效分辨率。然而，仅仅增加特征图的分辨率会产生相当大的计算成本。几项工作[1, 26, 29]提出通过重用来自CNN不同层的多尺度特征图来构建特征金字塔来解决这个问题。不同尺度的目标在不同的层次上被处理：大目标倾向于在高层次特征上被检测到，而小目标通常在低层次上被检测到。特征金字塔范式节省了在主干中从浅到深维护高分辨率特征图的计算成本。尽管如此，检测头对低级特征的计算复杂度仍然是巨大的。例如，在RetinaNet中添加一个额外的金字塔级别 $P_2$ 将在检测头中带来大约 300% 的计算量（FLOPs）和内存成本；因此在 NVIDIA 2080Ti GPU上将推理速度从 13.6 FPS严重降低到4.85 FPS。
　　In this paper, we propose a simple and effective method, QueryDet, to save the detection head’s computation while promoting the performance of small objects. The motivation comes from two key observations: 1) the computation on low-level features is highly redundant. In most cases, the spatial distribution of small objects is very sparse: they occupy only a few portions of the high-resolution feature maps; hence a large amount of computation is wasted. 2) The feature pyramids are highly structured. Though we cannot accurately detect the small objects in low-resolution feature maps, we can still infer their existence and rough locations with high confidence.
　　在本文中，我们提出了一种简单有效的方法 QueryDet，以节省检测头的计算量，同时提高小目标的性能。动机来自两个关键观察：1）对低级特征的计算是高度冗余的。在大多数情况下，小目标的空间分布非常稀疏：它们只占据高分辨率特征图的一小部分；因此浪费了大量的计算。 2）特征金字塔是高度结构化的。虽然我们无法准确检测低分辨率特征图中的小目标，但我们仍然可以高度自信地推断出它们的存在和大致位置。
　　A natural idea to utilize these two observations is that we can only apply the detection head to small objects’ spatial locations. This strategy requires locating the rough location of small objects at a low cost and sparse computation on the desired feature map. In this work, we present QueryDet that is based on a novel query mechanism Cascade Sparse Query (CSQ), as illustrated in Fig. 1. We recursively predict the rough locations of small objects (queries) on lower resolution feature maps and use them to guide the computations in higher resolution feature maps. With the help of sparse convolution [13, 55], we significantly reduce the computation cost of detection heads on low-level features while keeping the detection accuracy for small objects. Note that our approach is designed to save the computation spatially, so it is compatible with other accelerating methods like lightweighted backbones [44], model pruning [16], model quantization [51], and knowledge distillation [5].
　　利用这两种观察结果的一个自然想法是，我们只能将检测头应用于小目标的空间位置。这种策略需要以低成本定位小目标的粗糙位置，并在所需的特征图上进行稀疏计算。在这项工作中，我们提出了基于一种新的查询机制、级联稀疏查询(CSQ)的QueryDet，如图1所示。我们递归地预测更低分辨率特征图上的小目标（查询）的粗略位置，并使用它们来指导更高分辨率特征图中的计算。在稀疏卷积[13,55]的帮助下，我们在保持小目标检测精度的同时，显著降低了低级特征检测头的计算成本。请注意，我们的方法是为了保存空间计算，因此它与其他加速方法兼容，如轻加权骨干[44]、模型剪枝[16]、模型量化[51]和知识蒸馏[5]。
　在这里插入图片描述
　　We evaluate our QueryDet on the COCO detection benchmark [28] and a challenging dataset, VisDrone [59], that contains a large amount of small objects. We show our method can significantly accelerate inference while improving the detection performance. In summary, we make two main contributions:
　　• We propose QueryDet, in which a simple and effective Cascade Sparse Query (CSQ) mechanism is designed. It can reduce the computation costs of all feature pyramid based object detectors. Our method can improve the detection performance for small objects by effectively utilizing high-resolution features while keeping fast inference speed.
　　• On COCO, QueryDet improves the RetinaNet baseline by 1.1 AP and 2.0 AP $_S$ by utilizing high-resolution features, and the high-resolution detection speed is improved by 3.0× on average when CSQ is adopted. On VisDrone, we advance the state-of-the-art results in terms of the detection mAP and enhance the highresolution detectopm speed by 2.3× on average.
　　我们在COCO检测基准[28]和一个具有挑战性的数据集，VISDrone[59] (它包含大量的小对象)上评估了我们的QueryDet。我们表明，我们的方法可以显著加速推理，同时提高检测性能。总之，我们做出了两个主要贡献：
　　• 我们提出了一种简单而有效的级联稀疏查询(CSQ)机制。它可以降低所有基于特征金字塔的目标检测器的计算成本。该方法在保持快速推理速度的同时，有效地利用高分辨率特征，提高了对小目标的检测性能。
　　• 在COCO上，QueryDet利用高分辨率特性将RetinaNet基准提高了1.1 AP和2.0 AP $_S$ ，当采用CSQ时，高分辨率检测速度平均提高了3.0×。在VisDrone上，我们提高了最先进的检测结果，并提高了高分辨率的检测速度平均2.3×。

2. 相关工作

Object Detection. Deep Learning based object detection can be mainly divided into two streams: the two-stage detectors [2, 11, 12, 26, 39] and the one-stage detectors [17, 29, 35–37, 58] pioneered by YOLO. Generally speaking, two-stage methods tend to be more accurate than one-stage methods because they use the RoIAlign operation [14] to align an object’s features explicitly. However, the performance gap between these two streams is narrowed recently. RetinaNet [27] is the first one-stage anchor-based detector that matches the performance of two-stage detectors. It uses feature pyramid network (FPN) [26] for multi-scale detections and proposes FocalLoss to handle the foreground-background imbalance problem in dense training. Recently, one-stage anchor-free detectors [7, 7, 21, 23, 45, 56] have attracted academic attentions because of their simplicity. In this paper, we implement our QueryDet based on RetinaNet and FCOS [45] to show its effectiveness and generalization ability.
　　目标检测。基于深度学习的目标检测主要可分为两条分流：二阶段检测器[2,11,12,26,39]和由YOLO首创的一阶段检测器[17,29,35-37,58]。一般来说，两阶段方法往往比一阶段方法更准确，因为它们使用RoIAlign操作[14]来显式地对齐对象的特征。然而，这两个分流之间的性能差距最近被缩小了。RetinaNet[27]是第一个与二阶段检测器的性能相匹配的单阶段基于锚框的检测器。利用特征金字塔网络(FPN)[26]进行多尺度检测，并提出焦点损失来处理密集训练中的前景-背景不平衡问题。近年来，单阶段无锚检测器[7,7,21,23,45,56]因其简单性而引起了学术界的广泛关注。在本文中，我们实现了基于RetinaNet和FCOS[45]的QuereyDet，以证明其有效性和泛化能力。
　　Small Object Recognition. Small object recognition, like detection and segmentation, is a challenging computer vision task because of low-resolution features. To tackle this problem, a large amount of works have been proposed. These methods can be mainly categorized into four types: 1) increasing the resolution of input features [1, 10, 22, 24, 26, 29, 41, 48]; 2) oversampling and strong data augmentation [20, 29, 60]; 3) incorporating context information [4, 6, 57], and 4) scale-aware training [25, 26, 42, 43].
　　小目标识别。由于低分辨率特征，小目标识别（如检测和分割）是一项具有挑战性的计算机视觉任务。为了解决这个问题，已经提出了大量的工作。这些方法主要可以分为四种类型：1）提高输入特征的分辨率[1、10、22、24、26、29、41、48]； 2）过采样和强大的数据增强[20,29,60]； 3) 结合上下文信息 [4, 6, 57]，和 4) 规模感知训练 [25, 26, 42, 43]。
　　Spatial Redundancy. Several methods have used sparse computation to utilize the spatial redundancy of CNNs in different ways to save computation costs. Perforated-CNN [9] generates masks with different deterministic sampling methods. Dynamic Convolution [47] uses a small gating network to predict pixel masks, and [54] proposes a stochastic sampling and interpolation network. Both of them adopt Gumbel-Softmax [18] and a sparsity loss for the training of sparse masks. On the other hand, the Spatially Adaptive Computation Time (SACT) [8] predicts a halting score for each spatial position that is supervised by a proposed ponder cost and the task-specific loss function. SBNet [38] adopts an offline road map or a mask to filter out ignored region. Unlike these methods, our QueryDet focuses on objects’ scale variation and simply adopts the provided ground-truth bounding box for supervision. Another stream of works adopts a two-stage framework: glance and focus for adaptive inference. [50] selects small regions from the original input image by reinforcement learning and processes these regions with a dynamic decision process. [46] adopts a similar idea on object detection task. One similar work to our QueryDet is AutoFocus [33]. AutoFocus first predicts and crop region of interest in coarse scales, then scaled to a larger resolution for final predictions. Compared with AutoFocus, our QueryDet is more efficient since the “focus” operation is conducted on feature pyramids other than image pyramids, which reduces the redundant computation in the backbone.
　　空间冗余。几种方法已经使用稀疏计算以不同的方式利用 CNN 的空间冗余来节省计算成本。 Perforated-CNN [9] 使用不同的确定性采样方法生成掩码。动态卷积 [47] 使用小型门控网络来预测像素掩码，[54] 提出了随机采样和插值网络。他们都采用Gumbel-Softmax[18] 和稀疏损失来训练稀疏掩码。另一方面，空间自适应计算时间 (SACT) [8] 预测每个空间位置的暂停分数，该分数由proposed的思考成本和任务特定的损失函数监督。SBNet [38] 采用离线路线图或掩码来过滤掉被忽略的区域。与这些方法不同，我们的 QueryDet 关注目标的尺度变化，并简单地采用提供的真实边界框进行监督。另一类作品采用两阶段框架：用于自适应推理的glance和focus。 [50]通过强化学习从原始输入图像中选择小区域，并使用动态决策过程处理这些区域。 [46]在目标检测任务上采用了类似的想法。与我们的 QueryDet 类似的一项工作是 AutoFocus [33]。 AutoFocus 首先以粗略的比例预测和裁剪感兴趣的区域，然后缩放到更大的分辨率以进行最终预测。与 AutoFocus 相比，我们的 QueryDet 效率更高，因为“聚焦”操作是在图像金字塔以外的特征金字塔上进行的，这减少了主干中的冗余计算。

3. 方法

In this section, we describe our QueryDet for accurate and fast small object detection. We illustrate our approach based on RetinaNet [27], a popular anchor-based dense detector. Note that our approach is not limited to RetinaNet, as it can be applied to any one-stage detectors and the region proposal network (RPN) in two-stage detectors with FPN. We will first revisit RetinaNet and analyze the computational cost distribution of different components. Then we will introduce how we use the proposed Cascade Sparse Query to save computation costs during inference. Finally, the training details will be presented.
　　在本节中，我们将描述我们的 QueryDet，以实现准确和快速的小目标检测。我们说明了基于 RetinaNet [27] 的方法，这是一种流行的基于锚的密集检测器。请注意，我们的方法不限于 RetinaNet，因为它可以应用于任何一级检测器和带有 FPN 的两级检测器中的region proposal网络 (RPN)。我们将首先重温 RetinaNet 并分析不同组件的计算成本分布。然后我们将介绍我们如何使用提出的 Cascade Sparse Query 在推理过程中节省计算成本。最后，将介绍训练细节。

3.1. 重新审视RetinaNet

RetinaNet has two parts: a backbone network with FPN that outputs multi-scale feature maps and two detection heads for classification and regression. When the size of input image is $\times W$ , the sizes of FPN features are $\mathcal{P}=\{P_l \in \mathbb{R}^{H' \times W' \times C}\}$ . Here $l$ indicates the pyramid level and (H′, W ′) is usually equals to $(\lfloor\frac{H}{2^l} \rfloor, \lfloor\frac{W}{2^l} \rfloor)$ in a typical FPN implementation. The detection heads consist of four 3 × 3 convolution layers, followed by an extra 3 × 3 convolution layer for final prediction. For parameter efficiency, different feature levels share the same detection heads (parameters). However, the computation costs are highly imbalanced across different layers: the FLOPs of detection heads from $P_7$ to $P_3$ increases in quadratic order by the scaling of feature resolutions. As shown in Figure 2, the $P_3$ head occupies nearly half FLOPs while the cost of lowresolution features $P_4$ to $P_7$ only accounts for 15%. Thus, if we want to extend the FPN to $P_2$ for better small object performance, the cost is unaffordable: high-resolution $P_2$ and $P_3$ will occupy 75% of the overall cost. In the following, we describe how our QueryDet reduce the computation on high-resolution features and promote the inference speed of RetinaNet, even with an extra high-resolution $P_2$ .
　　RetinaNet 有两部分：一个带有 FPN 的骨干网络，输出多尺度特征图和两个用于分类和回归的检测头。当输入图像的大小为 $H\times W$ 时，FPN特征的大小为 $\mathcal{P}=\{P_l \in \mathbb{R}^{H' \times W' \times C}\}$ 。这里 $l$ 表示金字塔等级，(H′, W′)通常等于 $(\lfloor\frac{H}{2^l}\rfloor, \lfloor\frac{W}{2^l} \rfloor)$ 在典型的 FPN 实现中。检测头由四个 3 × 3 卷积层组成，然后是一个额外的 3 × 3 卷积层用于最终预测。为了参数效率，不同的特征级别共享相同的检测头（参数）。然而，不同层的计算成本高度不平衡：检测头的 FLOPs 从 $P_7$ 到 $P_3$ 通过特征分辨率的缩放以二次顺序增加。如图 2 所示， $P_3$ head 占据了将近一半的 FLOPs，而低分辨率特征 $P_4$ 到 $P_7$ 的成本仅占 15%。因此，如果我们想将 FPN 扩展到 $P_2$ 以获得更好的小目标性能，成本是无法承受的：高分辨率 $P_2$ 和 $P_3$ 将占据总成本的 75%。在下文中，我们将描述我们的QueryDet如何减少高分辨率特征的计算并提高 RetinaNet 的推理速度，即使使用额外的高分辨率 $P_2$ 。
　　在这里插入图片描述

3.2. 通过稀疏查询加速推理

In the design of modern FPN based detectors, small objects tend to be detected from high-resolution low-level feature maps. However, as the small objects are usually sparsely populated in space, the dense computation paradigm on high-resolution feature maps is highly inefficient. Inspired by this observation, we propose a coarse-tofine approach to reduce the computation cost of low-level pyramids: First, the rough locations of small objects are predicted on coarse feature maps, and then the corresponding locations on fine feature maps are intensively computed. This process can be viewed as a query process: the rough locations are query keys, and the high-resolution features used to detect small objects are query values; thus we call our approach QueryDet. The whole pipeline of our method is presented in Figure 3.
　　在现代基于 FPN 的检测器的设计中，小目标倾向于从高分辨率低级特征图中检测到。然而，由于小目标通常在空间中分布稀疏，高分辨率特征图上的密集计算范式效率非常低。受此观察的启发，我们提出了一种从粗到细的方法来降低低级金字塔的计算成本：首先，在粗略特征图上预测小目标的粗略位置，然后集中计算精细特征图上的相应位置。这个过程可以看作是一个查询过程：粗略的位置是查询键，用于检测小目标的高分辨率特征是查询值；因此我们称我们的方法为 QueryDet。我们方法的整个流程如图 3 所示。
在这里插入图片描述
　　To predict the coarse locations of small objects, we add a query head that is parallel to the classification and regression heads. The query head receives feature map $P_l$ with stride $2^l$ as input, and output a heatmap $V_l \in \mathbb{R}^{H' \times W'}$ with $V_l^{i,j}$ indicating the probability of that the grid (i, j) contains a small object. During training, we define small objects on each level as objects whose scale is smaller than a pre-defined threshold $s_l$ . Here we set $s_l$ to the minimum anchor scale on $P_l$ for simplicity, and for anchor-free detectors it is set to the minimum regression range on $P_l$ . For a small object o, we encode the target map for Query Head by computing distance between its center location $x_o, y_o)$ and every location on the feature map, and set locations whose distance is smaller than $s_l$ to 1, otherwise 0. Then the Query Head is trained using FocalLoss [27]. During inference, we choose the locations whose predicted scores are larger than a threshold σ as queries. Then $q_l^o$ will be mapped to its four nearest neighbors on $P_{l−1}$ as key positions ${k_{l−1}^o\}$ :
　　为了预测小目标的粗略位置，我们添加了一个与分类和回归头平行的查询头。查询头接收步长为 $2^l$ 的特征图 $P_l$ 作为输入，并输出一个热力图 $V_l \in \mathbb{R}^{H' \times W'}$ ， $V_l^{i,j}$ 表示网格 $(i, j)$ 包含一个小目标的概率。在训练期间，我们将每个级别上的小目标定义为规模小于预定义阈值 $s_l$ 的对象。在这里，为了简单起见，我们将 $s_l$ 设置为 $P_l$ 上的最小锚定尺度，对于无锚定检测器，它设置为 $P_l$ 上的最小回归范围。对于一个小目标o，我们通过计算其中心位置 $x_o, y_o)$ 与特征图上每个位置之间的距离来编码Query Head的目标图，并将距离小于 $s_l$ 的位置设置为1，否则为 0。然后使用 FocalLoss [27] 训练查询头。在推理过程中，我们选择预测分数大于阈值 σ 的位置作为查询。然后 $q_l^o$ 将被映射到它在 $P_{l−1}$ 上的四个最近邻作为关键位置 ${k_{l−1}^o\}$ ：
　　 ${k_{l−1}^o}=\{(2x_l^o+i, 2y_l^o+j), \forall i,j \in \{0, 1\}\}$
All ${k_{l−1}^o\}$ on $P_{l−1}$ are collected to form the key position set ${k_{l−1}\}$ . Then the three heads will only process those positions to detect objects and compute next level’s queries. Specifically, we extract features from $P_{l−1}$ using ${k_{l−1}\}$ as indices to construct a sparse tensor $P_{l−1}^v$ that we call value features. Then the sparse convolution (spconv) [13] kernels are built using weights of the 4-conv dense heads to compute results on layer $l - 1$ .
收集 $P_{l−1}$ 上的所有 ${k_{l−1}^o\}$ ，形成关键位置集 ${k_{l−1}\}$ 。然后三个头将仅处理这些位置以检测目标并计算下一级的查询。具体来说，我们使用 ${k_{l−1}\}$ 作为索引从 $P_{l−1}$ 中提取特征，以构造我们称之为值特征的稀疏张量 $P_{l−1}^v$ 。然后使用 4-conv 密集头的权重构建稀疏卷积 (spconv) [13] 内核，以计算层 $l - 1$ 上的结果。
　　To maximize the inference speed, we apply the queries in a cascade manner. In particular, the queries for $P_{l−2}$ would only be generated from ${k_{l−1}\}$ . We name this paradigm as Cascade Sparse Query (CSQ) as illustrated in Figure 1. The benefit of our CSQ is that we can avoid generating the queries ${q_{l}\}$ from a single $P_l$ , which leads to exponentially increasing size of corresponding key position $k_l$ during query mapping as $l$ decreases.
　　为了最大化推理速度，我们以级联方式应用查询。特别是， $P_{l−2}$ 的查询只能从 ${k_{l−1}\}$ 生成。我们将此范例命名为级联稀疏查询 (CSQ)，如图 1 所示。我们的 CSQ 的好处是我们可以避免从单个 $P_l$ 生成查询 ${q_{l}\}$ ，这会导致在查询映射期间，随着 $l$ 的减少，相应键位置 $k_l$ 的大小呈指数增长。

3.3. 训练

We keep the training of classification and regression heads as same as in the original RetinaNet [27]. For the query head, we train it using FocalLoss [27] with the generated binary target map: Let the ground-truth bounding box of a small object $o$ on $P_l$ be $b_l^o=(x_l^o, y_l^o, w_l^o, h_l^o)$ . We first compute the minimum distance map Dl between each feature position $(x, y)$ on $P_l$ and all the small ground-truth centers ${(x_l^o, y_l^o)\}$ :
　　我们保持分类和回归头的训练与原始 RetinaNet [27] 中的相同。对于查询头，我们使用 FocalLoss [27] 和生成的二进制目标图对其进行训练：让 $P_l$ 上的小目标 $o$ 的真实边界框为 $b_l^o=(x_l^o, y_l ^o, w_l^o, h_l^o)$ 。我们首先计算 $P_l$ 上每个特征位置 $(x, y)$ 与所有小的真实中心 ${(x_l^o, y_l^o)\}$ 之间的最小距离图 Dl：
在这里插入图片描述
Then the ground truth query map $V_l^*$ is defined as：
那么ground truth query map $V_l^*$ 定义为：

　　For each level $P_l$ , the loss function is defined as following:
　　对于每个级别 $P_l$ ，损失函数定义如下：

where $U_l$ , $R_l$ , $V_l$ are the classification output, regressor output and the query score output, and $U_l^*$ , $R_l^*$ , and $V_l^*$ are their corresponding ground-truth maps; $\mathcal{L}_{FL}$ is the focal loss and $\mathcal{L}_{r}$ is the bounding box regression loss, which is smooth $l_1$ loss [11] in the original RetinaNet. The overall loss is:
其中 $U_l$ 、 $R_l$ 、 $V_l$ 是分类输出、回归器输出和查询分数输出， $U_l^*$ 、 $R_l^*$ 和 $V_l^*$ 是它们对应的ground-truth 特征图； $\mathcal{L}_{FL}$ 是焦点损失， $\mathcal{L}_{r}$ 是边界框回归损失，这是原始 RetinaNet 中的平滑 $l_1$ 损失 [11]。总损失为：
在这里插入图片描述

Here we re-balance the loss of each layer by $\beta_l$ . The reason is that as we add the higher-resolution features like $P_2$ , the distribution of the training samples has significantly changed. The total number of training samples on $P_2$ is even larger than the total number of training sample cross $P_3$ to $P_7$ . If we don’t reduce the weight of it, the training will be dominated by small objects. Thus, we need to re-balance the loss of different layers to make the model simultaneously learn from all layers.
在这里，我们通过 $\beta_l$ 重新平衡每一层的损失。原因是当我们添加更高分辨率的特征（如 $P_2$ ）时，训练样本的分布发生了显着变化。 $P_2$ 上的训练样本总数甚至大于 $P_3$ 到 $P_7$ 的训练样本总数。如果我们不减轻它的权重，训练将由小目标主导。因此，我们需要重新平衡不同层的损失，以使模型同时从所有层中学习。

3.4. 与相关工作的关系

Note that though our method bears some similarities with two-stage object detectors using RPN, they differ in the following aspects: 1), we only compute classification results in the coarse prediction, while RPN computes both classification and regression. 2), RPN is computed on all levels of full feature maps while the computation of our QueryDet is sparse and selective. 3), two-stage methods rely on operations like RoIAlign [14] or RoIPooling [11] to align the features with the first stage proposal. Nevertheless, they are not used in our approach since we do not have box output in the coarse prediction. It is worth noting that our proposed method is compatible with the FPN based RPN, so QueryDet can be incorporated into two-stage detectors to accelerate proposal generation.
　　请注意，尽管我们的方法与使用 RPN 的两阶段目标检测器有一些相似之处，但它们在以下方面有所不同：1）我们仅在粗略预测中计算分类结果，而 RPN 同时计算分类和回归。 2），RPN 是在所有级别的全特征图上计算的，而我们的 QueryDet 的计算是稀疏和选择性的。 3），二阶段方法依靠像 RoIAlign [14] 或 RoIPooling [11] 这样的操作来将特征与第一阶段propo对齐。尽管如此，它们并没有在我们的方法中使用，因为我们在粗略预测中没有框输出。值得注意的是，我们提出的方法与基于 FPN 的 RPN 兼容，因此可以将 QueryDet 合并到二阶段检测器中以加速proposal生成。
　　Another closely related work is PointRend [19], which computes high-resolution segmentation maps using very few adaptive selected points. The main differences between our QueryDet and PointRend are: 1) how the queries are generated and 2) how sparse computation is applied. For the first difference, PointRend selects the most uncertain regions based on the predicted score at each location, while we directly add an auxiliary loss as supervision. Our experiments show this simple method can generate high recall predictions and improve the final performance. As for the second, PointRend uses a multi-layer perceptron for per-pixel classification. It only requires the features from a single location in high-resolution feature maps, thus can be easily batched for high efficiency. On the other hand, as object detection requires more context information for accurate prediction, we use sparse convolution with 3 × 3 kernels.
　　另一个密切相关的工作是 PointRend [19]，它使用很少的自适应选择点来计算高分辨率分割图。我们的 QueryDet 和 PointRend 之间的主要区别是：1）如何生成查询和 2）如何应用稀疏计算。对于第一个差异，PointRend 根据每个位置的预测得分选择最不确定的区域，而我们直接添加一个辅助损失作为监督。我们的实验表明，这种简单的方法可以产生高召回率预测并提高最终性能。至于第二个，PointRend 使用多层感知器进行逐像素分类。它只需要高分辨率特征图中单个位置的特征，因此可以轻松地进行批处理以提高效率。另一方面，由于目标检测需要更多上下文信息来进行准确预测，因此我们使用具有 3 × 3 内核的稀疏卷积。

4. 实验

We conduct quantitative experiments on two object detection datasets: COCO [28] and VisDrone [59]. COCO is the most widely used dataset for general object detection; VisDrone is a dataset specialized to drone-shot image detection, in which small objects dominate the scale distribution.
　　我们对两个目标检测数据集进行定量实验：COCO [28] 和 VisDrone [59]。 COCO 是最广泛使用的通用目标检测数据集； VisDrone 是一个专门用于无人机拍摄图像检测的数据集，其中小目标在尺度分布中占主导地位。

4.1. 实施细节

We implement our approach based on PyTorch [34] and the Detectron2 toolkit [52]. All models are trained on 8 NVIDIA 2080Ti GPUs. For COCO, we follow the common training practices: We adopt the standard 1× schedule and the default data augmentation in Detectron2. Batch size is set to 16 with the initial learning rate of 0.01. The weights $β_l$ used to re-balance the loss between different layers are set to linearly growing from 1 to 3 across P2 to P7. For VisDrone, following [30], we equally split one image into four non-overlapping patches and process them independently during training. We train the network for 50k iterations with an initial learning rate of 0.01, and decay the learning rate by 10 at 30k and 40k iteration. The re-balance weights $β_l$ are set to linearly growing from 1 to 2.6. For both datasets, we freeze all the batch normalization (BN) layers in the backbone network during training and we did not add BN layers in the detection heads. Mixed precision training [32] is used in all experiments to save GPU memory. The query threshold σ is set to 0.15 and we start query from $P_4$ . Without specified description, our method is constructed on RetinaNet with ResNet-50 backbone.
　　我们基于 PyTorch [34] 和 Detectron2 工具包 [52] 实现了我们的方法。所有模型都在 8 个 NVIDIA 2080Ti GPU 上进行训练。对于 COCO，我们遵循常见的训练实践：我们采用标准的 1× 调度和 Detectron2 中的默认数据增强。批量大小设置为 16，初始学习率为 0.01。用于重新平衡不同层之间损失的权重 $β_l$ 设置为从 P2 到 P7 从 1 到 3 线性增长。对于 VisDrone，按照 [30]，我们将一张图像平均分成四个不重叠的块，并在训练期间独立处理它们。我们训练网络进行 50k 次迭代，初始学习率为 0.01，并在 30k 和 40k 次迭代时将学习率衰减 10。重新平衡权重 $β_l$ 设置为从 1 线性增长到 2.6。对于这两个数据集，我们在训练期间冻结了骨干网络中的所有批归一化（BN）层，并且我们没有在检测头中添加 BN 层。混合精度训练 [32] 用于所有实验以节省 GPU 内存。查询阈值 σ 设置为 0.15，我们从 $P_4$ 开始查询。没有具体描述，我们的方法是在 ResNet-50 主干的 RetinaNet 上构建的。

4.2. 我们方法的有效性

In Table 1, we compare the mean average precision (mAP) and average frame per second (FPS) between our methods and the baseline RetinaNet on COCO. The baseline runs at 13.6 FPS, and gets 37.46 overall AP and 22.64 AP $_S$ for small objects, which is slightly higher than the results in the original paper [27]. With the help of highresolution features, our approach achieves 38.53 AP and 24.64 AP $_S$ , improving the AP and AP $_S$ by 1.1 and 2.0. The results reveal the importance of using high-resolution features when detecting small objects. However, incorporating such a high-resolution feature map significantly decrease the inference speed to 4.85 FPS. When adopting our Cascade Sparse Query (CSQ), the inference speed is enhanced to 14.88 FPS, becoming even faster than the baseline RetinaNet that does not use the higher-resolution P2, while the performance loss is negligible. Additionally, Figure 2 shows how our CSQ save the computational cost. Compared with the RetinaNet with higher-resolution P2, in which P3 and P2 account for 74% of the total FLOPs, our CSQ successfully reduce those costs to around 1%. The reason is that in QueryDetall computations on high-resolution P3 and P2 are carried out on locations around the sparsely distributed small objects. These results sufficiently demonstrate the effectiveness of our method. We also show the results of 3× training schedule in Table 1. The stronger baseline does not weaken our improvement but brings more significant acceleration. We owe it to the stronger Query Head as the small object estimation becomes more accurate.
　　在表 1 中，我们比较了我们的方法与 COCO 上的基线 RetinaNet 之间的平均精度 (mAP) 和平均每秒帧数 (FPS)。基线以 13.6 FPS 运行，总体 AP 为 37.46，小目标的 AP $_S$ 为 22.64，略高于原始论文 [27] 中的结果。在高分辨率特征的帮助下，我们的方法实现了 38.53 AP 和 24.64 AP $_S$ ，将 AP 和 AP $_S$ 分别提高了 1.1 和 2.0。结果揭示了在检测小目标时使用高分辨率特征的重要性。然而，结合如此高分辨率的特征图，推理速度会显着降低到 4.85 FPS。当采用我们的级联稀疏查询 (CSQ) 时，推理速度提高到 14.88 FPS，甚至比不使用更高分辨率 P2 的基线 RetinaNet 还要快，而性能损失可以忽略不计。此外，图 2 显示了我们的 CSQ 如何节省计算成本。与具有更高分辨率 P2 的 RetinaNet（其中 P3 和 P2 占总 FLOP 的 74%）相比，我们的 CSQ 成功地将这些成本降低到 1% 左右。原因是在 QueryDetall 中，高分辨率 P3 和 P2 的计算是在稀疏分布的小目标周围的位置上进行的。这些结果充分证明了我们方法的有效性。我们还在表 1 中展示了 3 倍训练计划的结果。更强的基线不会削弱我们的改进，但会带来更显着的加速。我们将它归功于更强的查询头，因为小对象估计变得更加准确。
在这里插入图片描述
　　In VisDrone, as illustrated in Table 2, the discoveries are similar, but the results are even more significant. We improve the overall AP by 2.1 and AP50 by 3.2 on this small objects oriented dataset. The inference speed is improved to 2.3× from 1.16 FPS from 2.75 FPS.
　　在 VisDrone 中，如表 2 所示，发现相似，但结果更显着。在这个面向目标的小型数据集上，我们将整体 AP 提高了 2.1，AP50 提高了 3.2。推理速度从 2.75 FPS 提高到 2.3 倍，从 1.16 FPS。
在这里插入图片描述

4.3. 消融实验

We conduct ablation studies on COCO mini-val set to analyze how each component affects the detection accuracy and speed in Table 3. Our retrained RetinaNet achieves 37.46 AP. When we add the high-resolution P2, the AP dramatically drops by 1.34. As we discussed in Section 3.3, this problem is caused by the distribution shift in the training samples after adding P2. Then we re-balance the loss of those layers. The result is improved to 38.11, mostly addressing this problem. Interestingly, the re-balancing strategy only gives us a minor AP enhancement (0.2) when adopting on the original baseline, suggesting that the loss re-balance is more critical in the high-resolution scenario. Then we add our Query Head into the network, through which we get a further performance gain of 0.42 AP and 1.58 AP $_S$ , pushing the total AP and AP $_S$ to 38.53 and 24.64, verifying the effectiveness of the extra objectiveness supervision. Finally, with CSQ, the detection speed is largely improved to 14.88 FPS from 4.85 FPS, and the 0.17 loss in detection AP is negligible.
　　我们对 COCO mini-val set 进行了消融研究，以分析每个组件如何影响表 3 中的检测精度和速度。我们重新训练的 RetinaNet 达到 37.46 AP。当我们添加高分辨率 P2 时，AP 急剧下降 1.34。正如我们在 3.3 节中讨论的，这个问题是由添加 P2 后训练样本的分布偏移引起的。然后我们重新平衡这些层的损失。结果提高到 38.11，主要解决了这个问题。有趣的是，重新平衡策略在采用原始基线时仅给我们带来了较小的 AP 增强（0.2），这表明损失重新平衡在高分辨率场景中更为关键。然后我们将我们的 Query Head 添加到网络中，通过它我们获得了 0.42 AP 和 1.58 AP $_S$ 的进一步性能增益，将总 AP 和 AP $_S$ 推到 38.53 和 24.64，验证了额外客观性监督的有效性。最后，使用 CSQ，检测速度从 4.85 FPS 大幅提高到 14.88 FPS，检测 AP 的 0.17 损失可以忽略不计。
在这里插入图片描述

4.4. 讨论

Influence of the Query Threshold. Here we investigate the accuracy-speed trade-off in our Cascade Sparse Query. We measure the detection accuracy (AP) and detection speed (FPS) under different query thresholds σ whose role is to determine if a grid (low-resolution feature location) in the input image contains small objects. Intuitively, increasing this threshold will decrease the recall of small objects but accelerate the inference since fewer locations are considered. The accuracy-speed trade-off with different input sizes are presented in Figure 4. We increase σ by 0.05 sequentially for adjacent data markers in one curve, and the leftmost marker denotes the performance when CSQ is not applied. We observe that even a very low threshold (0.05) can bring us a massive speed improvement. This observation validates the effectiveness of our approach. Another observation is about the gap between the AP upper bound and lower bound of different input resolutions. This gap is small for large size images, but huge for small size images, which indicates that for higher-resolution input our CSQ can guarantee a good AP lower bound even if the query threshold is set to high.
　　查询阈值的影响。在这里，我们研究了级联稀疏查询中的准确性-速度权衡。我们测量不同查询阈值σ下的检测精度（AP）和检测速度（FPS），其作用是确定输入图像中的网格（低分辨率特征位置）是否包含小目标。直观地说，增加这个阈值会降低小目标的召回率，但会加速推理，因为考虑的位置更少。图 4 显示了不同输入大小的准确度-速度折衷。我们在一条曲线中将相邻数据标记依次增加 0.05，最左边的标记表示不应用 CSQ 时的性能。我们观察到，即使是非常低的阈值（0.05）也可以为我们带来巨大的速度提升。这一观察结果验证了我们方法的有效性。另一个观察是关于不同输入分辨率的 AP 上限和下限之间的差距。这种差距对于大尺寸图像来说很小，但对于小尺寸图像来说却很大，这表明对于更高分辨率的输入，即使查询阈值设置为高，我们的 CSQ 也可以保证良好的 AP 下限。
　　Which layer to start query? In our Cascade Sparse Query, we need to decide the starting layer, above which we run conventional convolutions to get the detection results for large objects. The reason we do not start our CSQ from the lowest resolution layer are in two folds: 1) The normal convolution operation is very fast for the low-resolution features, thus the time saved by CSQ cannot compensate the time needed to construct the sparse feature map; 2) It is hard to distinguish small objects on feature maps with very low resolution. The results are presented in Table 4. We find that the layer that gets the highest inference speed is P4, which validates that querying from very high-level layers such as P5 and P6 would cause loss of speed. We observe that the AP loss gradually increases as the starting layer becomes higher, suggesting the difficulty for the network to find small objects in very low-resolution layers.
　　从哪一层开始查询？ 在我们的 Cascade Sparse Query 中，我们需要确定起始层，在该层之上我们运行常规卷积以获得大目标的检测结果。我们不从最低分辨率层开始我们的 CSQ 的原因有两个：1）正常的卷积操作对于低分辨率特征非常快，因此 CSQ 节省的时间无法补偿构建稀疏特征图所需的时间; 2）很难在分辨率非常低的特征图上区分小目标。结果如表 4 所示。我们发现推理速度最高的层是 P4，这验证了从 P5 和 P6 等非常高级的层进行查询会导致速度损失。我们观察到 AP 损失随着起始层的升高而逐渐增加，这表明网络难以在分辨率非常低的层中找到小目标。
在这里插入图片描述
　　What is the best way to use queries? We demonstrate the high efficiency of our Cascade Sparse Query. We propose two alternative query operations for comparison. The first Crop Query (CQ), in which the corresponding regions indicated by queries are cropped from the high-resolution features for subsequent computations. Note this type of query is similar to the AutoFocus [33] approach. Another one is Complete Convolution Query (CCQ) where we use regular convolutions to compute the full feature map for each layer, but only extract results from queried positions for post-processing. For CQ, we crop a 11 × 11 patch from the feature map, which is chosen to fit the receptive field of the five 3 × 3 consecutive convolutions in the detection heads. We present the results in Table 5. Generally speaking, all three methods can successfully accelerate the inference with negligible AP loss. Among them, our CSQ can achieve the fastest inference speed.
　　使用查询的最佳方式是什么？ 我们展示了我们的级联稀疏查询的高效率。我们提出了两种可供选择的查询操作进行比较。第一个裁剪查询（CQ），其中查询指示的相应区域是从高分辨率特征中裁剪出来的，用于后续计算。请注意，这种类型的查询类似于 AutoFocus [33] 方法。另一种是完全卷积查询（CCQ），我们使用常规卷积来计算每一层的完整特征图，但只从查询位置提取结果以进行后处理。对于 CQ，我们从特征图中裁剪出一个 11 × 11 的补丁，选择它来拟合检测头中五个 3 × 3 连续卷积的感受野。我们在表 5 中展示了结果。一般而言，这三种方法都可以成功地加速推理，而 AP 损失可以忽略不计。其中，我们的 CSQ 可以达到最快的推理速度。
在这里插入图片描述
　　How much context do we need? To apply our CSQ, we need to construct a sparse feature map where only the positions of small objects are activated. We also need to activate the context area around the small objects to avoid decreasing accuracy. However, in practice, we found that too much context cannot improve the detection AP but only slow down the detection speed; on the other hand, too little context would severely decrease the detection AP. In this section, we explore how much context do we need to balance the speed-accuracy trade-off. Here, the context is defined as a patch with various sizes around the queried position, where our sparse detection head would also process the features within the patch. The result is reported in Table 6. From it we conclude that a 5x5 patch can brings us enough context to detect a small object. Although more context brings a small AP improvement, the accelerating effect of our CSQ is negatively affected, while fewer context cannot grantee a high detection AP.
　　我们需要多少上下文？为了应用我们的 CSQ，我们需要构建一个稀疏特征图，其中只有小目标的位置被激活。我们还需要激活小目标周围的上下文区域以避免降低准确性。但在实践中，我们发现过多的上下文并不能提高检测AP，只会减慢检测速度；另一方面，太少的上下文会严重降低检测 AP。在本节中，我们将探讨需要多少上下文来平衡速度和准确性的权衡。在这里，上下文被定义为在查询位置周围具有各种大小的补丁，我们的稀疏检测头也将处理补丁内的特征。结果在表 6 中报告。从中我们得出结论，一个 5x5 的补丁可以为我们带来足够的上下文来检测一个小目标。尽管更多的上下文带来了小的 AP 改进，但我们的 CSQ 的加速效果受到了负面影响，而更少的上下文不能授予高检测 AP。
在这里插入图片描述
　　Results on Light-weight Backbones. As we claim in Section 1, our method can be incorporated with light-weighted backbones to gain more speed improvement. Also, as our CSQ aims to accelerate the computation in the detection head, so the overall acceleration is more obvious when using such backbones, because the inference time for backbone network becomes less. We report the results with different light-weight backbones in Table 7.Specifically, the speed is on average improved to 4.1× for high-resolution detection with MobileNet V2 [40] and 3.8× with ShuffleNet V2 [31], which validates that our approach is ready to deploy on edge devices for real-time applications such as autonomous driving vehicles for effective small object detections.
　　轻量级骨干网的结果。正如我们在第 1 节中所声称的，我们的方法可以与轻量级骨干网结合以获得更多的速度提升。此外，由于我们的 CSQ 旨在加速检测头中的计算，因此使用此类骨干网时整体加速更加明显，因为骨干网的推理时间变得更少。我们在表 7 中报告了不同轻量级主干的结果。具体而言，使用 MobileNet V2 [40] 的高分辨率检测速度平均提高到 4.1 倍，使用 ShuffleNet V2 [31] 的速度平均提高到 3.8 倍，这验证了我们的方法已准备好部署在边缘设备上，用于实时应用，例如自动驾驶车辆，以进行有效的小目标检测。
在这里插入图片描述
　　Results on Anchor-Free Detectors. QueryDet can be applied to any FPN based detector to accelerate highresolution detection. Thus, we apply QueryDet on FCOS, a state-of-the-art anchor-free detector, and report the COCO results in Table 8. It can be concluded that QueryDet improves APs with the help of high-resolution features, and when Cascade Sparse Query (CSQ) is adopted, the high-resolution speed is improved by 1.8× on average, validating the universality of the proposed approach.
　　无锚检测器的结果。QueryDet 可以应用于任何基于 FPN 的检测器以加速高分辨率检测。因此，我们将 QueryDet 应用于最先进的无锚检测器 FCOS，并在表 8 中报告 COCO 结果。可以得出结论，QueryDet 在高分辨率特征的帮助下改进了 AP，并且当 Cascade 采用稀疏查询（CSQ），高分辨率速度平均提高了1.8倍，验证了该方法的普遍性。
在这里插入图片描述
　　Effectiveness on Two-stage Detectors Our CSQ can also be applied to FPN based two-stage detectors to reduce computaion cost in the high-resolution layers in RPN. To verify this claim, we apply CSQ to the Faster R-CNN detector [39]. In our implementation, the inputs to RPN are from $P_2$ to $P_6$ and we start query from $P_4$ . We modify the RPN structure to let it have 3 conv layers instead of 1 layer in the normal implementation, which is followed by 3 branches for objectiveness classification, bounding box regression and query key computation. The former two branches are trained following common practice [39], and the query branch is trained by Focal Loss with γ = 1.2 and α = 0.25. During inference, we set the query threadhold to 0.15. As shown in Table 9, our Faster R-CNN achieves 38.47 overall AP and 22.98 AP $_S$ with 17.57 FPS. When CSQ is utilized, the inference speed is improved to 19.03 FPS with a minor loss in APs. The results verify the effectiveness of our approach in accelerating two stage detectors. Note that in two-stage detecotrs our CSQ can not only save time for the dense computaion on in RPN, it can reduced the number of RoIs that are fed into the second stage.
　　二阶段检测器的有效性。我们的 CSQ 也可以应用于基于 FPN 的二阶段检测器，以降低 RPN 中高分辨率层的计算成本。为了验证这一说法，我们将 CSQ 应用于 Faster R-CNN 检测器 [39]。在我们的实现中，RPN 的输入是从 $P_2$ 到 $P_6$ ，我们从 $P_4$ 开始查询。我们修改了 RPN 结构，使其具有 3 个 conv 层，而不是正常实现中的 1 层，其后是 3 个分支，用于客观性分类、边界框回归和查询键计算。前两个分支按照惯例[39]进行训练，查询分支由Focal Loss训练，γ=1.2，α=0.25。在推理过程中，我们将查询 threadhold 设置为 0.15。如表 9 所示，我们的 Faster R-CNN 在 17.57 FPS 下实现了 38.47 的整体 AP 和 22.98 APS。当使用 CSQ 时，推理速度提高到 19.03 FPS，AP 损失很小。结果验证了我们的方法在加速二阶段检测器方面的有效性。请注意，在二阶段检测器中，我们的 CSQ 不仅可以为 RPN 中的密集计算节省时间，还可以减少输入第二阶段的 RoI 数量。
在这里插入图片描述

4.5. 可视化和失败案例

在这里插入图片描述
　　In Figure 5, we visualize the detection results and the query heatmaps for small objects on COCO and VisDrone. From the heatmaps, it can be seen that our query head can successfully find the coarse positions of the small objects, enabling our CSQ to detect them effectively. Additionally, through incorporating high-resolution features, our method can detect small objects very accurately.
　　在图 5 中，我们可视化了 COCO 和 VisDrone 上小目标的检测结果和查询热力图。从热力图中可以看出，我们的查询头可以成功找到小目标的粗略位置，使我们的 CSQ 能够有效地检测到它们。此外，通过结合高分辨率特征，我们的方法可以非常准确地检测小目标。
　　We also show two typical failure cases of our approach: 1) Even if the corase position of small objects is correctly extracted by the query head, the detection head may fails to localize them (the second image of VisDrone); 2) Positions of large objects is falsely activated, causing the detection head to process useless positions and hence slowing down the speed (the first image of COCO).
　　我们还展示了我们方法的两个典型失败案例：1）即使查询头正确提取了小目标的核心位置，检测头也可能无法定位它们（VisDrone 的第二张图像）； 2）大目标的位置被错误激活，导致检测头处理无用的位置，从而减慢速度（COCO的第一张图像）。

5. 结论

We propose QueryDet that uses a novel query mechanism Cascade Sparse Query (CSQ) to accelerate the inference of feature pyramid-based dense object detectors. QueryDet enables object detectors the ability to detect small objects at low cost and easily deploy, making it practical to deploy them on real-time applications such as autonomous driving. For future work, we plan to extend QueryDet to the more challenging 3D object detection task that takes LiDAR point clouds as input, where the 3D space is generally sparser than 2D image, and computational resources are more intense for the costly 3D convolution operations.
　　我们提出 QueryDet，它使用一种新颖的查询机制级联稀疏查询 (CSQ) 来加速基于特征金字塔的密集目标检测器的推理。 QueryDet 使目标检测器能够以低成本检测小目标并易于部署，使其能够在自动驾驶等实时应用程序中部署。对于未来的工作，我们计划将 QueryDet 扩展到以 LiDAR 点云作为输入的更具挑战性的 3D 目标检测任务，其中 3D 空间通常比 2D 图像更稀疏，并且计算资源对于昂贵的 3D 卷积操作来说更加密集。