CVPR2023论文速览Object下

Paper1 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification

摘要原文: Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets while requiring (up to30x) less computational cost for training. The code will be released to benefit the community.

中文总结: 这段话主要讨论了在未知环境中的目标导航(Object goal navigation, ObjectNav)是具有体现智能的基本任务。现有研究中的代理学习ObjectNav策略基于2D地图、场景图或图像序列。考虑到这一任务发生在3D空间中,一个具有3D意识的代理可以通过学习细粒度的空间信息来提升其ObjectNav能力。然而,利用3D场景表示可能会在这个地面级任务的策略学习中变得非常不切实际,因为样本效率低且计算成本高昂。在这项工作中,我们提出了一个基于两个直观子策略的具有挑战性的3D感知ObjectNav框架。这两个子策略,即角落引导探索策略和类别感知识别策略,通过利用在线融合的3D点作为观察同时执行。通过大量实验,我们展示了这个框架可以通过从3D场景表示中学习显著提高ObjectNav的性能。我们的框架在Matterport3D和Gibson数据集上实现了最佳性能,同时训练所需的计算成本要少(高达30倍)。我们将释放代码以造福社区。

Paper2 Multiclass Confidence and Localization Calibration for Object Detection

摘要原文: Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make overconfident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL

中文总结: 尽管深度神经网络(DNNs)在许多具有挑战性的计算机视觉问题上取得了高预测准确性,但最近的研究表明,DNNs倾向于做出过于自信的预测,导致它们的校准性较差。大多数现有的改进DNN校准性的尝试仅限于分类任务,并且局限于校准域内预测。令人惊讶的是,对于占据视觉安全敏感和安全关键应用关键位置的目标检测方法的校准几乎没有人进行研究。在本文中,我们提出了一种新的训练时技术,用于校准现代目标检测方法。它能够通过利用它们的预测不确定性,联合校准多类别置信度和框定位。我们在几个域内和域外检测基准上进行了大量实验。结果表明,我们提出的训练时校准方法在减少域内和域外预测的校准误差方面始终优于几个基线。我们的代码和模型可在https://github.com/bimsarapathiraja/MCCL上找到。

Paper3 Aligning Bag of Regions for Open-Vocabulary Object Detection

摘要原文: Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP 50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

中文总结: 这段话主要讨论了预训练的视觉-语言模型(VLMs)如何学习在大规模数据集上对齐视觉和语言表示,其中每个图像-文本对通常包含一组语义概念。然而,现有的开放词汇对象检测器只能将区域嵌入与VLMs提取的相应特征逐个对齐。这种设计使得场景中的语义概念的组合结构未充分利用,尽管这种结构可能已经被VLMs隐式学习到。作者提出了一种方法,将区域的嵌入对齐到整个区域组的嵌入。被提出的方法将相关联的区域分组为一个包。一个包中的区域嵌入被视为句子中的单词嵌入,并被发送到VLM的文本编码器以获得区域包的嵌入,该嵌入被学习对齐到由冻结的VLM提取的相应特征。将该方法应用于常用的Faster R-CNN,可以在新类别的开放词汇COCO和LVIS基准测试中将先前的最佳结果提高4.6个box AP 50和2.8个mask AP。源代码和模型可在https://github.com/wusize/ovdet找到。

Paper4 Annealing-Based Label-Transfer Learning for Open World Object Detection

摘要原文: Open world object detection (OWOD) has attracted extensive attention due to its practicability in the real world. Previous OWOD works manually designed unknown-discover strategies to select unknown proposals from the background, suffering from uncertainties without appropriate priors. In this paper, we claim the learning of object detection could be seen as an object-level feature-entanglement process, where unknown traits are propagated to the known proposals through convolutional operations and could be distilled to benefit unknown recognition without manual selection. Therefore, we propose a simple yet effective Annealing-based Label-Transfer framework, which sufficiently explores the known proposals to alleviate the uncertainties. Specifically, a Label-Transfer Learning paradigm is introduced to decouple the known and unknown features, while a Sawtooth Annealing Scheduling strategy is further employed to rebuild the decision boundaries of the known and unknown classes, thus promoting both known and unknown recognition. Moreover, previous OWOD works neglected the trade-off of known and unknown performance, and we thus introduce a metric called Equilibrium Index to comprehensively evaluate the effectiveness of the OWOD models. To the best of our knowledge, this is the first OWOD work without manual unknown selection. Extensive experiments conducted on the common-used benchmark validate that our model achieves superior detection performance (200% unknown mAP improvement with the even higher known detection performance) compared to other state-of-the-art methods. Our code is available at https://github.com/DIG-Beihang/ALLOW.git.

中文总结: 这段话主要讨论了开放世界目标检测(OWOD)在实际世界中的实用性引起了广泛关注。先前的OWOD工作手动设计了未知发现策略,从背景中选择未知提案,但存在缺乏适当先验知识的不确定性。作者认为目标检测的学习可以被视为一个对象级特征交织过程,通过卷积操作将未知特征传播到已知提案,并可以被提炼以有利于未知识别,而无需手动选择。因此,作者提出了一个简单而有效的基于退火的标签传递框架,充分探索已知提案以减轻不确定性。具体地,引入了一个标签传递学习范式来解耦已知和未知特征,同时进一步采用了锯齿状的退火调度策略来重建已知和未知类别的决策边界,从而促进已知和未知的识别。此外,先前的OWOD工作忽视了已知和未知性能的权衡,因此作者引入了一个称为平衡指数的度量标准,全面评估OWOD模型的有效性。据我们所知,这是第一个无需手动选择未知的OWOD工作。在常用基准测试上进行的大量实验证实了我们的模型相对于其他最先进方法实现了卓越的检测性能(未知mAP提高200%,甚至更高的已知检测性能)。我们的代码可在https://github.com/DIG-Beihang/ALLOW.git上找到。

Paper5 Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection

摘要原文: Few-shot object detection, the problem of modelling novel object detection categories with few training instances, is an emerging topic in the area of few-shot learning and object detection. Contemporary techniques can be divided into two groups: fine-tuning based and meta-learning based approaches. While meta-learning approaches aim to learn dedicated meta-models for mapping samples to novel class models, fine-tuning approaches tackle few-shot detection in a simpler manner, by adapting the detection model to novel classes through gradient based optimization. Despite their simplicity, fine-tuning based approaches typically yield competitive detection results. Based on this observation, we focus on the role of loss functions and augmentations as the force driving the fine-tuning process, and propose to tune their dynamics through meta-learning principles. The proposed training scheme, therefore, allows learning inductive biases that can boost few-shot detection, while keeping the advantages of fine-tuning based approaches. In addition, the proposed approach yields interpretable loss functions, as opposed to highly parametric and complex few-shot meta-models. The experimental results highlight the merits of the proposed scheme, with significant improvements over the strong fine-tuning based few-shot detection baselines on benchmark Pascal VOC and MS-COCO datasets, in terms of both standard and generalized few-shot performance metrics.

中文总结: 这段话主要讨论了少样本目标检测的问题,即如何用少量训练实例来建模新颖的目标检测类别。目前的技术可以分为两类:基于微调和基于元学习的方法。基于元学习的方法旨在学习专门的元模型,将样本映射到新颖类别模型;而基于微调的方法则通过梯度优化将检测模型适应新颖类别,处理少样本检测问题。尽管基于微调的方法相对简单,但通常能够获得有竞争力的检测结果。基于这一观察,作者聚焦于损失函数和数据增强在微调过程中的作用,并提出通过元学习原则调整它们的动态。这种训练方案允许学习归纳偏差,从而提升少样本检测性能,同时保留基于微调的方法的优势。此外,提出的方法产生可解释的损失函数,而不是高度参数化和复杂的少样本元模型。实验结果突显了该方案的优点,在Pascal VOC和MS-COCO数据集上相对于强基于微调的少样本检测基线,标准和广义少样本性能指标都有显著提升。

Paper6 itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection

摘要原文: Point-cloud based 3D object detectors recently have achieved remarkable progress. However, most studies are limited to the development of network architectures for improving only their accuracy without consideration of the computational efficiency. In this paper, we first propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer-based knowledge distillation. To learn the map-view feature of a teacher network, the features from teacher and student networks are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both student and teacher networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an head attention loss to match the 3D object detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can train the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets; e.g., Waymo and nuScenes.

中文总结: 这段话主要讨论了基于点云的三维物体检测器近年来取得了显著进展,但大多数研究仅限于开发网络架构以提高准确性,而没有考虑计算效率。作者在论文中首次提出了一种自编码器风格的框架,包括通过基于交换传递的知识蒸馏进行通道压缩和解压缩。为了学习教师网络的地图视图特征,从教师和学生网络中提取的特征分别通过共享的自编码器传递;在这里,使用一种压缩表示损失来将学生和教师网络的通道压缩知识绑定为一种正则化。解压缩的特征在相反方向传递以减小交换重构中的差距。最后,提出了一种头部注意力损失来匹配由多头自注意力机制绘制的三维物体检测信息。通过大量实验证明,我们的方法可以训练出与三维点云检测任务良好对齐的轻量级模型,并且通过使用公共数据集(例如Waymo和nuScenes)展示了其优越性。

Paper7 DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects

摘要原文: To enable general-purpose robots, we will require the robot to operate daily articulated objects as humans do. Current robot manipulation has heavily relied on using a parallel gripper, which restricts the robot to a limited set of objects. On the other hand, operating with a multi-finger robot hand will allow better approximation to human behavior and enable the robot to operate on diverse articulated objects. To this end, we propose a new benchmark called DexArt, which involves Dexterous manipulation with Articulated objects in a physical simulator. In our benchmark, we define multiple complex manipulation tasks, and the robot hand will need to manipulate diverse articulated objects within each task. Our main focus is to evaluate the generalizability of the learned policy on unseen articulated objects. This is very challenging given the high degrees of freedom of both hands and objects. We use Reinforcement Learning with 3D representation learning to achieve generalization. Through extensive studies, we provide new insights into how 3D representation learning affects decision making in RL with 3D point cloud inputs. More details can be found at https://www.chenbao.tech/dexart/.

中文总结: 这段话的主要内容是:为了实现通用用途的机器人,我们需要使机器人能够像人类一样操作日常的关节对象。目前的机器人操作主要依赖于使用平行夹具,这限制了机器人只能操作有限的一组物体。另一方面,使用多指机器人手将使机器人更好地模拟人类行为,并使其能够操作各种关节对象。为此,我们提出了一个名为DexArt的新基准,其中涉及在物理模拟器中进行关节对象的熟练操作。在我们的基准中,我们定义了多个复杂的操作任务,机器人手将需要在每个任务中操作各种关节对象。我们的主要重点是评估学习策略在未见过的关节对象上的泛化能力。鉴于双手和物体的高自由度,这是非常具有挑战性的。我们使用强化学习和三维表示学习来实现泛化。通过广泛的研究,我们提供了新的见解,了解三维表示学习如何影响强化学习中基于三维点云输入的决策制定。更多详细信息可在https://www.chenbao.tech/dexart/找到。

Paper8 PROB: Probabilistic Objectness for Open World Object Detection

摘要原文: Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ( 2x unknown recall) and known object detection ( mAP). Our code is available at https://github.com/orrzohar/PROB.

中文总结: 这段话主要讨论了开放世界目标检测(OWOD)作为一项新的具有挑战性的计算机视觉任务,它在经典目标检测(OD)基准和真实世界目标检测之间架起了桥梁。OWOD算法除了可以检测和分类已知/标记的对象外,还需要能够检测新颖/未知的对象,并能够对其进行分类和增量学习。在标准OD中,与已标记对象不重叠的对象提议会自动被分类为背景。因此,简单地将OD方法应用于OWOD会失败,因为未知对象会被预测为背景。检测未知对象的挑战源于在区分未知对象和背景对象提议时缺乏监督。先前的OWOD方法尝试通过生成伪标签来克服这个问题,但未知对象的检测效果仍然不佳。概率/生成模型可能为解决这一挑战提供了解决方案。在此,我们介绍了一种新颖的用于目标性评估的概率框架,其中我们在嵌入特征空间中的已知对象的概率分布估计和目标性可能性最大化之间交替进行,最终使我们能够估计不同提议的目标性概率。由此产生的基于概率的目标性变换器开放世界检测器PROB将我们的框架集成到传统的目标检测模型中,使其适应开放世界环境。对OWOD基准的全面实验表明,PROB在未知对象检测(2倍未知召回率)和已知对象检测(mAP)方面均优于所有现有的OWOD方法。我们的代码可在https://github.com/orrzohar/PROB 上获得。

Paper9 Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

摘要原文: With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage detectors generally obtain limited promotions compared with two-stage clusters. We experimentally find that the root lies in two kinds of ambiguities: (1) Selection ambiguity that selected pseudo labels are less accurate, since classification scores cannot properly represent the localization quality. (2) Assignment ambiguity that samples are matched with improper labels in pseudo-label assignment, as the strategy is misguided by missed objects and inaccurate pseudo boxes. To tackle these problems, we propose a Ambiguity-Resistant Semi-supervised Learning (ARSL) for one-stage detectors. Specifically, to alleviate the selection ambiguity, Joint-Confidence Estimation (JCE) is proposed to jointly quantifies the classification and localization quality of pseudo labels. As for the assignment ambiguity, Task-Separation Assignment (TSA) is introduced to assign labels based on pixel-level predictions rather than unreliable pseudo boxes. It employs a ‘divide-and-conquer’ strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity. Comprehensive experiments demonstrate that ARSL effectively mitigates the ambiguities and achieves state-of-the-art SSOD performance on MS COCO and PASCAL VOC. Codes can be found at https://github.com/PaddlePaddle/PaddleDetection.

中文总结: 这段话主要讨论了基本的半监督目标检测(SSOD)技术中,一阶段检测器通常与两阶段集群相比取得的提升有限。作者通过实验证明,这种根源在于两种模糊性:(1)选择模糊性,即选定的伪标签不够准确,因为分类分数不能正确表示定位质量;(2)分配模糊性,即样本与伪标签的匹配不当,因为该策略受到错过的目标和不准确的伪框的误导。为了解决这些问题,作者提出了一种面向一阶段检测器的抗模糊性半监督学习(ARSL)。具体来说,为了减轻选择模糊性,作者提出了联合置信度估计(JCE)来同时量化伪标签的分类和定位质量。至于分配模糊性,作者引入了任务分离分配(TSA)来基于像素级预测而不是不可靠的伪框来分配标签。TSA采用“分而治之”的策略,分别利用正样本进行分类和定位任务,更能抵抗分配模糊性。全面的实验表明,ARSL有效地减轻了模糊性,并在MS COCO和PASCAL VOC上取得了最先进的SSOD性能。源代码可在https://github.com/PaddlePaddle/PaddleDetection 找到。

Paper10 Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus

摘要原文: Stereo-based 3D object detection, which aims at detecting 3D objects with stereo cameras, shows great potential in low-cost deployment compared to LiDAR-based methods and excellent performance compared to monocular-based algorithms. However, the impressive performance of stereo-based 3D object detection is at the huge cost of high-quality manual annotations, which are hardly attainable for any given scene. Semi-supervised learning, in which limited annotated data and numerous unannotated data are required to achieve a satisfactory model, is a promising method to address the problem of data deficiency. In this work, we propose to achieve semi-supervised learning for stereo-based 3D object detection through pseudo annotation generation from a temporal-aggregated teacher model, which temporally accumulates knowledge from a student model. To facilitate a more stable and accurate depth estimation, we introduce Temporal-Aggregation-Guided (TAG) disparity consistency, a cross-view disparity consistency constraint between the teacher model and the student model for robust and improved depth estimation. To mitigate noise in pseudo annotation generation, we propose a cross-view agreement strategy, in which pseudo annotations should attain high degree of agreements between 3D and 2D views, as well as between binocular views. We perform extensive experiments on the KITTI 3D dataset to demonstrate our proposed method’s capability in leveraging a huge amount of unannotated stereo images to attain significantly improved detection results.

中文总结: 这段话主要讨论了基于立体视觉的3D物体检测方法,该方法旨在利用立体摄像头检测3D物体,在低成本部署方面与基于LiDAR的方法相比具有巨大潜力,并且在性能方面与基于单目视觉的算法相比表现出色。然而,立体视觉的3D物体检测性能出色的背后是高质量手动标注的巨大成本,这在任何给定场景中几乎难以实现。半监督学习是一种有限标注数据和大量未标注数据结合以实现令人满意模型的方法,是解决数据不足问题的一种有前途的方法。在这项工作中,我们提出通过从时间聚合的教师模型生成伪标注来实现基于立体视觉的3D物体检测的半监督学习,该教师模型从学生模型中积累知识。为了促进更稳定和准确的深度估计,我们引入了Temporal-Aggregation-Guided (TAG)视差一致性,这是教师模型和学生模型之间的跨视图视差一致性约束,用于稳健和改进的深度估计。为了减轻伪标注生成中的噪声,我们提出了一种跨视图一致性策略,其中伪标注应在3D和2D视图之间以及双目视图之间达到高度一致。我们在KITTI 3D数据集上进行了大量实验,以展示我们提出的方法在利用大量未标注的立体图像实现显著改进的检测结果的能力。

Paper11 NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations

摘要原文: Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.

中文总结: 这段话主要讨论了在三维场景中对物体属性和关系进行基础的概念化是实现各种人工智能任务的先决条件,例如基于视觉的对话和具身操纵。然而,三维领域的变异性引发了两个基本挑战:1)标记的成本昂贵,2)三维概念化语言的复杂性。因此,模型的基本要求是要具备数据效率,能够泛化到不同的数据分布和任务中,具有看不见语义形式的概念,以及概念化复杂语义(例如视点锚定和多对象引用)。为了解决这些挑战,作者提出了NS3D,这是一个用于三维概念化的神经符号框架。NS3D通过利用大型语言到代码模型,将语言翻译成具有层次结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D通过引入有效推理高阶关系(即多于两个对象之间的关系)的功能模块,扩展了先前的神经符号视觉推理方法,这在消除复杂三维场景中的物体歧义方面至关重要。模块化和组合式架构使NS3D能够在ReferIt3D视角依赖任务上取得最先进的结果,这是一个三维指称表达理解基准测试。重要的是,NS3D在数据效率和泛化设置上显示出显著改进的性能,并展示了对未见过的三维问答任务的零-shot转移。

Paper12 PACO: Parts and Attributes of Common Objects

摘要原文: Object models are gradually progressing from predicting just category labels to providing detailed descriptions of object instances. This motivates the need for large datasets which go beyond traditional object masks and provide richer annotations such as part masks and attributes. Hence, we introduce PACO: Parts and Attributes of Common Objects. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. We provide 641K part masks annotated across 260K object boxes, with roughly half of them exhaustively annotated with attributes as well. We design evaluation metrics and provide benchmark results for three tasks on the dataset: part mask segmentation, object and part attribute prediction and zero-shot instance detection. Dataset, models, and code are open-sourced at https://github.com/facebookresearch/paco.

中文总结: 这段话主要讲述了目标模型逐渐从仅预测类别标签发展到提供对象实例的详细描述。这促使对大型数据集的需求超越传统的对象掩模,提供更丰富的注释,如部分掩模和属性。因此,介绍了PACO:共同对象的部分和属性。它涵盖了75个对象类别、456个对象部件类别和55个属性,跨越图像(LVIS)和视频(Ego4D)数据集。我们提供了641K个部分掩模的注释,涵盖了260K个对象框,其中大约一半的对象框也被详尽地注释了属性。我们设计了评估指标,并为数据集上的三项任务提供了基准结果:部分掩模分割、对象和部分属性预测以及零样本实例检测。数据集、模型和代码均在https://github.com/facebookresearch/paco 上开源。

Paper13 Learning Transformations To Reduce the Geometric Shift in Object Detection

摘要原文: The performance of modern object detectors drops when the test distribution differs from the training one. Most of the methods that address this focus on object appearance changes caused by, e.g., different illumination conditions, or gaps between synthetic and real images. Here, by contrast, we tackle geometric shifts emerging from variations in the image capture process, or due to the constraints of the environment causing differences in the apparent geometry of the content itself. We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts without leveraging any labeled data in the new domain, nor any information about the cameras. We evaluate our method on two different shifts, i.e., a camera’s field of view (FoV) change and a viewpoint change. Our results evidence that learning geometric transformations helps detectors to perform better in the target domains.

中文总结: 这段话主要讨论了现代目标检测器在测试分布与训练分布不同时性能下降的问题。大多数解决这一问题的方法集中在处理由不同照明条件或合成与真实图像之间的差异导致的目标外观变化。相比之下,该段提出了一种处理由图像捕获过程的变化或环境约束引起的几何变化的方法。作者引入了一种自学习方法,学习一组几何变换以最小化这些变化,而不利用新域中的任何标记数据或关于摄像机的任何信息。作者在两种不同的变化上评估了该方法,即相机视野变化和视角变化。结果表明,学习几何变换有助于目标检测器在目标域中表现更好。

Paper14 Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking

摘要原文: The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. Existing CoSOD models by default adopt the group consensus assumption. This brings about model robustness defect under the condition of irrelevant images in the testing image group, which hinders the use of CoSOD models in real-world applications. To address this issue, this paper presents a group exchange-masking (GEM) strategy for robust CoSOD model learning. With two group of image containing different types of salient object as input, the GEM first selects a set of images from each group by the proposed learning based strategy, then these images are exchanged. The proposed feature extraction module considers both the uncertainty caused by the irrelevant images and group consensus in the remaining relevant images. We design a latent variable generator branch which is made of conditional variational autoencoder to generate uncertainly-based global stochastic features. A CoSOD transformer branch is devised to capture the correlation-based local features that contain the group consistency information. At last, the output of two branches are concatenated and fed into a transformer-based decoder, producing robust co-saliency prediction. Extensive evaluations on co-saliency detection with and without irrelevant images demonstrate the superiority of our method over a variety of state-of-the-art methods.

中文总结: 这段话主要内容是介绍了传统的共同显著目标检测(CoSOD)任务的定义,即在一组相关图像中分割共同显著的对象。现有的CoSOD模型默认采用群体共识假设,这在测试图像组中存在无关图像的情况下会导致模型鲁棒性缺陷,从而阻碍了CoSOD模型在实际应用中的使用。为了解决这个问题,本文提出了一种用于强化CoSOD模型学习的群体交换屏蔽(GEM)策略。GEM首先通过提出的基于学习的策略从每个图像组中选择一组图像,然后交换这些图像。所提出的特征提取模块考虑了无关图像引起的不确定性以及剩余相关图像中的群体共识。我们设计了一个由条件变分自动编码器组成的潜变量生成器分支,用于生成基于不确定性的全局随机特征。还设计了一个CoSOD变换器分支,用于捕获包含群体一致性信息的基于相关性的局部特征。最后,两个分支的输出被串联并输入到基于变换器的解码器中,产生鲁棒的共同显著性预测。对具有和不具有无关图像的共同显著性检测进行了广泛评估,证明了我们的方法优于各种最先进的方法。

Paper15 Multi-Object Manipulation via Object-Centric Neural Scattering Functions

摘要原文: Learned visual dynamics models have proven effective for robotic manipulation tasks. Yet, it remains unclear how best to represent scenes involving multi-object interactions. Current methods decompose a scene into discrete objects, yet they struggle with precise modeling and manipulation amid challenging lighting conditions since they only encode appearance tied with specific illuminations. In this work, we propose using object-centric neural scattering functions (OSFs) as object representations in a model-predictive control framework. OSFs model per-object light transport, enabling compositional scene re-rendering under object rearrangement and varying lighting conditions. By combining this approach with inverse parameter estimation and graph-based neural dynamics models, we demonstrate improved model-predictive control performance and generalization in compositional multi-object environments, even in previously unseen scenarios and harsh lighting conditions.

中文总结: 学习到的视觉动力学模型已被证明对机器人操作任务有效。然而,如何最好地表示涉及多物体交互的场景仍不清楚。当前的方法将场景分解为离散的对象,但它们在具有挑战性的光照条件下很难进行精确建模和操作,因为它们只编码与特定照明相关的外观。在这项工作中,我们提出在模型预测控制框架中使用基于对象的神经散射函数(OSFs)作为对象表示。OSFs模拟每个对象的光传输,使得在对象重新排列和不同光照条件下进行组合场景重新渲染成为可能。通过将这种方法与逆参数估计和基于图的神经动力学模型相结合,我们展示了在组合多物体环境中改进的模型预测控制性能和泛化能力,即使在以前未见的情况和恶劣的光照条件下也是如此。

Paper16 Unbalanced Optimal Transport: A Unified Framework for Object Detection

摘要原文: During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models.

中文总结: 在训练过程中,监督式目标检测试图将预测的边界框和相关的分类分数正确地匹配到真实标签上。这对于确定哪些预测结果应该被推向哪些解决方案,或者被丢弃是至关重要的。流行的匹配策略包括与最接近的真实标签框匹配(主要与锚点结合使用),或者通过匈牙利算法进行匹配(主要用于无锚点方法)。每种策略都具有其自身的特性、基础损失和启发式方法。我们展示了如何使用不平衡最优传输统一这些不同的方法,并在两者之间打开了一整个方法的连续性。这使得可以更精细地选择所需的特性。在实验中,我们展示了使用不平衡最优传输训练目标检测模型能够达到最先进的性能,无论是在平均精度和平均召回率方面,还是在提供更快的初始收敛方面。这种方法非常适合GPU实现,这对于大规模模型来说是一个优势。

Paper17 Target-Referenced Reactive Grasping for Dynamic Objects

摘要原文: Reactive grasping, which enables the robot to successfully grasp dynamic moving objects, is of great interest in robotics. Current methods mainly focus on the temporal smoothness of the predicted grasp poses but few consider their semantic consistency. Consequently, the predicted grasps are not guaranteed to fall on the same part of the same object, especially in cluttered scenes. In this paper, we propose to solve reactive grasping in a target-referenced setting by tracking through generated grasp spaces. Given a targeted grasp pose on an object and detected grasp poses in a new observation, our method is composed of two stages: 1) discovering grasp pose correspondences through an attentional graph neural network and selecting the one with the highest similarity with respect to the target pose; 2) refining the selected grasp poses based on target and historical information. We evaluate our method on a large-scale benchmark GraspNet-1Billion. We also collect 30 scenes of dynamic objects for testing. The results suggest that our method outperforms other representative methods. Furthermore, our real robot experiments achieve an average success rate of over 80 percent.

中文总结: 这段话主要讨论了反应式抓取在机器人领域的重要性,目前的方法主要关注预测抓取姿势的时间平滑性,但很少考虑其语义一致性,因此在拥挤场景中不能保证预测的抓取点会落在同一对象的同一部分。作者提出了一种通过跟踪生成的抓取空间来解决目标引导设置下的反应式抓取方法。该方法包括两个阶段:1) 通过注意力图神经网络发现抓取姿势的对应关系,并选择与目标姿势最相似的一个;2) 根据目标和历史信息对所选抓取姿势进行优化。作者在大规模基准数据集GraspNet-1Billion上评估了该方法,并收集了30个动态对象场景进行测试。结果表明,该方法优于其他代表性方法。此外,作者的真实机器人实验实现了超过80%的平均成功率。

Paper18 LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

摘要原文: LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at https://github.com/sankin97/LoGoNet.

中文总结: 这段话主要介绍了LiDAR-相机融合方法在3D物体检测中表现出色的性能。最近的先进多模态方法主要进行全局融合,即在整个场景中融合图像特征和点云特征。这种做法缺乏细粒度的区域级信息,导致融合性能不佳。在本文中,我们提出了一种新颖的本地到全局融合网络(LoGoNet),在本地和全局级别上执行LiDAR-相机融合。具体而言,LoGoNet的全局融合(GoF)建立在先前的文献基础上,我们专门使用点云中心来更精确地表示体素特征的位置,从而实现更好的跨模态对齐。至于本地融合(LoF),我们首先将每个提议划分为均匀网格,然后将这些网格中心投影到图像上。围绕投影网格点的图像特征被采样以与位置装饰的点云特征融合,最大限度地利用提议周围丰富的上下文信息。进一步提出了特征动态聚合(FDA)模块,以实现这些本地和全局融合特征之间的信息交互,从而产生更具信息量的多模态特征。在Waymo Open Dataset(WOD)和KITTI数据集上进行了大量实验,结果显示LoGoNet优于所有最先进的3D检测方法。值得注意的是,LoGoNet在Waymo 3D物体检测排行榜上排名第一,获得了81.02 mAPH(L2)的检测性能。值得注意的是,首次有三个类别的检测性能同时超过80 APH(L2)。代码将在https://github.com/sankin97/LoGoNet 上提供。

Paper19 ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector

摘要原文: Despite the prominent success of general object detection, the performance and efficiency of Small Object Detection (SOD) are still unsatisfactory. Unlike existing works that struggle to balance the trade-off between inference speed and SOD performance, in this paper, we propose a novel Scale-aware Knowledge Distillation (ScaleKD), which transfers knowledge of a complex teacher model to a compact student model. We design two novel modules to boost the quality of knowledge transfer in distillation for SOD: 1) a scale-decoupled feature distillation module that disentangled teacher’s feature representation into multi-scale embedding that enables explicit feature mimicking of the student model on small objects. 2) a cross-scale assistant to refine the noisy and uninformative bounding boxes prediction student models, which can mislead the student model and impair the efficacy of knowledge distillation. A multi-scale cross-attention layer is established to capture the multi-scale semantic information to improve the student model. We conduct experiments on COCO and VisDrone datasets with diverse types of models, i.e., two-stage and one-stage detectors, to evaluate our proposed method. Our ScaleKD achieves superior performance on general detection performance and obtains spectacular improvement regarding the SOD performance.

中文总结: 尽管一般物体检测取得了显著的成功,但小目标检测(SOD)的性能和效率仍然令人不满意。与现有的研究不同,这些研究努力在推理速度和SOD性能之间取得平衡,本文提出了一种新颖的尺度感知知识蒸馏(ScaleKD)方法,将复杂教师模型的知识转移给紧凑的学生模型。我们设计了两个新颖的模块来提升SOD中知识蒸馏的质量:1)一个尺度解耦特征蒸馏模块,将教师的特征表示解开成多尺度嵌入,从而使学生模型在小目标上能够明确地模仿教师模型的特征。2)一个跨尺度助手,用于改进学生模型中嘈杂且无信息的边界框预测,这些预测可能误导学生模型并损害知识蒸馏的效力。我们建立了一个多尺度交叉注意力层,以捕获多尺度语义信息以改善学生模型。我们在COCO和VisDrone数据集上进行了实验,使用不同类型的模型,即两阶段和一阶段检测器,来评估我们提出的方法。我们的ScaleKD在一般检测性能上取得了优越的表现,并在SOD性能方面取得了显著的改进。

Paper20 Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration

摘要原文: In comparison with most methods focusing on 3D rigid object recognition and manipulation, deformable objects are more common in our real life but attract less attention. Generally, most existing methods for deformable object manipulation suffer two issues, 1) Massive demonstration: repeating thousands of robot-object demonstrations for model training of one specific instance; 2) Poor generalization: inevitably re-training for transferring the learned skill to a similar/new instance from the same category. Therefore, we propose a category-level deformable 3D object manipulation framework, which could manipulate deformable 3D objects with only one demonstration and generalize the learned skills to new similar instances without re-training. Specifically, our proposed framework consists of two modules. The Nocs State Transform (NST) module transfers the observed point clouds of the target to a pre-defined unified pose state (i.e., Nocs state), which is the foundation for the category-level manipulation learning; the Neural Spatial Encoding (NSE) module generalizes the learned skill to novel instances by encoding the category-level spatial information to pursue the expected grasping point without re-training. The relative motion path is then planned to achieve autonomous manipulation. Both the simulated results via our Cap40 dataset and real robotic experiments justify the effectiveness of our framework.

中文总结: 这段话主要讨论了与大多数专注于3D刚性物体识别和操作的方法相比,可变形物体在现实生活中更常见但受到的关注较少的问题。现有的大多数可变形物体操作方法存在两个问题:1)大量演示:为了训练特定实例的模型,需要重复进行成千上万次的机器人-物体演示;2)泛化能力差:将学习到的技能转移到同一类别的相似/新实例时,不可避免地需要重新训练。因此,作者提出了一个基于类别的可变形3D物体操作框架,可以通过仅一次演示来操作可变形3D物体,并将学习到的技能推广到新的相似实例而无需重新训练。具体来说,他们的框架包括两个模块。NOCS状态转换(NST)模块将目标的观察点云转换为预定义的统一姿态状态(即NOCS状态),这是类别级别操作学习的基础;神经空间编码(NSE)模块通过将类别级别的空间信息编码来推广学习到的技能到新的实例,以追求预期的抓取点而无需重新训练。然后规划相对运动路径以实现自主操作。通过他们的Cap40数据集的模拟结果和实际机器人实验,验证了他们框架的有效性。

Paper21 Learning To Detect and Segment for Open Vocabulary Object Detection

摘要原文: Open vocabulary object detection has been greately advanced by the recent development of vision-language pre-trained model, which helps recognizing the novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parametrize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated heads and dynamically generated heads. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The Latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the prior state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

中文总结: 这段话主要讲述了最近发展的视觉-语言预训练模型在开放词汇目标检测方面取得了巨大进展,帮助识别新颖对象并仅通过语义类别进行识别。先前的研究主要集中在将知识转移至对象提议分类,并采用类别不可知的框和掩模预测。在这项工作中,他们提出了CondHead,一种基于原则的动态网络设计,以更好地推广用于开放词汇设置的框回归和掩模分割。其核心思想是在语义嵌入上有条件地参数化网络头部,因此模型在类别特定知识的指导下更好地检测新颖类别。具体而言,CondHead由两个网络头部流组成,即动态聚合头部和动态生成头部。前者由一组静态头部实例化,这些头部被有条件地聚合,被优化为专家,并期望学习复杂的预测。后者由动态生成的参数实例化,并编码一般的类别特定信息。通过这种有条件的设计,检测模型通过语义嵌入与强大的可泛化类别框和掩模预测相连。他们的方法在几乎没有额外开销的情况下显著改进了先前最先进的开放词汇目标检测方法,例如,它在新颖类别上的检测AP比RegionClip模型高出3.0,仅多出1.1%的计算量。

Paper22 Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning

摘要原文: Humans have the innate capability to answer diverse questions, which is rooted in the natural ability to correlate different concepts based on their semantic relationships and decompose difficult problems into sub-tasks. On the contrary, existing visual reasoning methods assume training samples that capture every possible object and reasoning problem, and rely on black-boxed models that commonly exploit statistical priors. They have yet to develop the capability to address novel objects or spurious biases in real-world scenarios, and also fall short of interpreting the rationales behind their decisions. Inspired by humans’ reasoning of the visual world, we tackle the aforementioned challenges from a compositional perspective, and propose an integral framework consisting of a principled object factorization method and a novel neural module network. Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects. With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space and makes decisions with a compositional reasoning process. It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions. In addition to the enhanced generalizability, our framework also provides an interpretable interface for understanding the decision-making process of models. Our code is available at https://github.com/szzexpoi/POEM.

中文总结: 这段话主要讨论了人类具有回答各种问题的内在能力,这源于其自然的能力,即基于语义关系将不同概念相关联,并将复杂问题分解为子任务。与此相反,现有的视觉推理方法假设训练样本捕捉了每种可能的对象和推理问题,并依赖于通常利用统计先验的黑匣子模型。它们尚未开发出处理新颖对象或真实场景中的偏见的能力,也无法解释其决策背后的原因。受到人类对视觉世界推理的启发,作者从组合的角度解决了上述挑战,并提出了一个由原则性对象因子化方法和新颖的神经模块网络组成的综合框架。他们的因子化方法根据对象的关键特征对对象进行分解,并自动推导出代表各种对象的原型。借助这些编码重要语义的原型,提出的网络通过在共同语义空间上测量对象的相似性来相关联对象,并通过组合推理过程做出决策。该方法能够回答具有多样对象的问题,而无论其在训练期间是否可用,并克服了问题-答案分布的偏见。除了增强的泛化能力外,他们的框架还为理解模型决策过程提供了可解释的接口。他们的代码可在https://github.com/szzexpoi/POEM 上找到。

Paper23 PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav

摘要原文: We study ObjectGoal Navigation – where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) using behavior cloning (BC) on a dataset of human demonstrations achieves promising results. However, this has limitations – 1) BC policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present PIRLNav, a two-stage learning scheme for BC pretraining on human demonstrations followed by RL-finetuning. This leads to a policy that achieves a success rate of 65.0% on ObjectNav (+5.0% absolute over previous state-of-the-art). Using this BC->RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with ‘free’ (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that BC->RL on human demonstrations outperforms BC->RL on SP and FE trajectories, even when controlled for the same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the BC pretraining dataset. We find that as we increase the size of the BC-pretraining dataset and get to high BC accuracies, the improvements from RL-finetuning are smaller, and that 90% of the performance of our best BC->RL policy can be achieved with less than half the number of BC demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.

中文总结: 这段话主要讨论了对目标导航的研究,其中一个虚拟机器人被要求在新环境中导航到一个目标物体。先前的研究表明,使用行为克隆(BC)在人类示范数据集上进行模仿学习(IL)可以取得令人满意的结果。然而,这种方法存在一些局限性:1)BC策略在新状态下泛化能力差,因为训练是模仿动作而不是它们的后果;2)收集示范数据费时费力。另一方面,强化学习(RL)具有可扩展性,但需要精心设计奖励机制才能实现理想的行为。作者提出了PIRLNav,一个两阶段学习方案,首先对人类示范进行BC预训练,然后进行RL微调。这导致一个成功率为65.0%的策略,在ObjectNav上比之前的最先进技术提高了5.0%。作者通过使用这种BC->RL训练方法进行了严格的实证分析设计选择。首先,作者调查了人类示范是否可以用“自由”(自动生成的)示范来源替代,例如最短路径(SP)或任务无关的前沿探索(FE)轨迹。作者发现,尽管在相同的BC预训练成功率下进行控制,即使在BC预训练成功率偏向SP或FE策略的val episodes子集上,BC->RL在人类示范上的表现优于SP和FE轨迹上的BC->RL。接下来,作者研究了RL微调性能随BC预训练数据集大小的变化。作者发现,随着BC预训练数据集大小的增加并且达到高BC准确率,从RL微调中获得的改进变小,我们最佳BC->RL策略的90%性能可以在不到一半的BC示范数量下实现。最后,作者分析了ObjectNav策略的失败模式,并提出了进一步改进的指导方针。

Paper24 Objaverse: A Universe of Annotated 3D Objects

摘要原文: Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today’s benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.

中文总结: 大规模数据语料库,如WebText、维基百科、概念标题、WebImageText和LAION,推动了人工智能领域最近的巨大进展。在这些数据集上训练的大型神经模型产生了令人印象深刻的结果,并在许多当今的基准测试中名列前茅。在这一大规模数据集家族中一个值得注意的缺失是3D数据。尽管在3D视觉领域存在着相当大的兴趣和潜在应用,但高保真度的3D模型数据集仍然规模中等,且物体类别的多样性有限。为了填补这一空白,我们提出了Objaverse 1.0,这是一个包含超过80万个(且不断增长)3D模型的大型数据集,具有描述性标题、标签和动画。Objaverse在规模、类别数量以及类别内实例的视觉多样性方面改进了当今的3D存储库。我们通过四种不同的应用展示了Objaverse的巨大潜力:训练生成式3D模型、改进LVIS基准测试上的尾部类别分割、训练面向Embodied AI的开放词汇物体导航模型,以及创建一个用于视觉模型鲁棒性分析的新基准测试。Objaverse可以为研究开辟新的方向,并在人工智能领域实现新的应用。

Paper25 OVTrack: Open-Vocabulary Multiple Object Tracking

摘要原文: The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. The project page is at https://www.vis.xyz/pub/ovtrack/.

中文总结: 这段话主要讨论了在现实世界的应用中,识别、定位和跟踪场景中的动态物体对于许多应用至关重要,例如自动驾驶和机器人系统。然而,传统的多目标跟踪(MOT)基准仅依赖于几个物体类别,这些类别很难代表在现实世界中遇到的各种可能的物体。这使得当代MOT方法仅限于一小部分预定义的物体类别。本文通过解决一项新颖的任务,即开放词汇MOT,旨在评估超出预定义训练类别的跟踪,来解决这一限制。我们进一步开发了OVTrack,一种开放词汇跟踪器,能够跟踪任意物体类别。其设计基于两个关键要素:首先,利用视觉语言模型进行分类和关联,通过知识蒸馏;其次,采用数据幻觉策略,通过去噪扩散概率模型进行稳健外观特征学习。结果是一种极其数据高效的开放词汇跟踪器,在大规模、大词汇TAV基准测试中取得了新的最先进水平,同时仅在静态图像上进行训练。项目页面位于https://www.vis.xyz/pub/ovtrack/。

Paper26 Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

摘要原文: Open-vocabulary object detection aims to provide object detectors trained on a fixed set of object categories with the generalizability to detect objects described by arbitrary text queries. Previous methods adopt knowledge distillation to extract knowledge from Pretrained Vision-and-Language Models (PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal cropping and single-level feature mimicking processes, they suffer from information destruction during knowledge extraction and inefficient knowledge transfer. To remedy these limitations, we propose an Object-Aware Distillation Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE) module and a Distillation Pyramid (DP) mechanism. When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects. The latter introduces global and block distillation for more comprehensive knowledge transfer to compensate for the missing relation information in object distillation. Extensive experiments show that our method achieves significant improvement compared to current methods. Especially on the MS-COCO dataset, our OADP framework reaches 35.6 mAP^N_50, surpassing the current state-of-the-art method by 3.3 mAP^N_50. Code is anonymously provided in the supplementary materials.

中文总结: 这段话主要介绍了开放词汇目标检测的概念,旨在为固定对象类别集训练的对象检测器提供泛化能力,使其能够检测由任意文本查询描述的对象。先前的方法采用知识蒸馏从预训练的视觉语言模型(PVLMs)中提取知识并将其转移到检测器。然而,由于非自适应的提议裁剪和单级特征模仿过程,它们在知识提取过程中存在信息破坏和知识传递效率低下的问题。为了解决这些限制,提出了一个包括对象感知知识提取(OAKE)模块和蒸馏金字塔(DP)机制的对象感知蒸馏金字塔(OADP)框架。在从PVLMs中提取对象知识时,前者自适应地转换对象提议并采用对象感知蒸馏注意力以获取对象的精确和完整知识。后者引入全局和块蒸馏以进行更全面的知识传递,以弥补对象蒸馏中缺失的关系信息。大量实验证明,我们的方法相对于当前方法取得了显著的改进。特别是在MS-COCO数据集上,我们的OADP框架达到了35.6 mAP^N_50,超过当前最先进的方法3.3 mAP^N_50。代码在附加材料中匿名提供。

Paper27 AeDet: Azimuth-Invariant Multi-View 3D Object Detection

摘要原文: Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimuth-equivariant convolution (AeConv) and an azimuth-equivariant anchor. The sampling grid of AeConv is always in the radial direction, thus it can learn azimuth-invariant BEV features. The proposed anchor enables the detection head to learn predicting azimuth-irrelevant targets. In addition, we introduce a camera-decoupled virtual depth to unify the depth prediction for the images with different camera intrinsic parameters. The resultant detector is dubbed Azimuth-equivariant Detector (AeDet). Extensive experiments are conducted on nuScenes, and AeDet achieves a 62.0% NDS, surpassing the recent multi-view 3D object detectors such as PETRv2 and BEVDepth by a large margin.

中文总结: 最近基于LSS的多视角3D物体检测取得了巨大进展,通过使用卷积检测器处理Bird-Eye-View(BEV)中的特征。然而,典型的卷积忽略了BEV特征的径向对称性,增加了检测器优化的难度。为了保留BEV特征的固有属性并简化优化过程,我们提出了一个方位等变卷积(AeConv)和一个方位等变锚点。AeConv的采样网格始终沿径向方向,因此可以学习方位不变的BEV特征。所提出的锚点使得检测头能够学习预测方位无关的目标。此外,我们引入了一个与相机解耦的虚拟深度,以统一具有不同相机内参的图像的深度预测。最终的检测器被称为方位等变检测器(AeDet)。在nuScenes上进行了大量实验,AeDet实现了62.0%的NDS,远远超过最近的多视角3D物体检测器,如PETRv2和BEVDepth。

Paper28 Zero-Shot Object Counting

摘要原文: Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. Such a counting system does not require human annotators in the loop and can operate automatically. Starting from a class name, we propose a method that can accurately identify the optimal patches which can then be used as counting exemplars. Specifically, we first construct a class prototype to select the patches that are likely to contain the objects of interest, namely class-relevant patches. Furthermore, we introduce a model that can quantitatively measure how suitable an arbitrary patch is as a counting exemplar. By applying this model to all the candidate patches, we can select the most suitable patches as exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method.

中文总结: 这段话主要介绍了无关类别的物体计数技术,旨在在测试时对任意类别的物体实例进行计数。这是具有挑战性的,但也能够实现许多潜在的应用。当前的方法需要人工标注的样本作为输入,但对于新颖的类别,特别是对于自主系统而言,这些样本通常是不可用的。因此,作者提出了零样本物体计数(ZSC)的新设定,即在测试时只有类别名称可用。这种计数系统不需要人类标注者参与,可以自动运行。从类别名称开始,作者提出了一种方法,可以准确识别最佳的补丁,这些补丁可以用作计数样本。具体地,首先构建一个类别原型来选择可能包含感兴趣对象的补丁,即与类别相关的补丁。此外,引入了一个模型,可以定量地衡量任意补丁作为计数样本的适用性。通过将该模型应用于所有候选补丁,可以选择最适合的补丁作为计数样本。在最近的一个无关类别计数数据集FSC-147上的实验结果验证了该方法的有效性。

Paper29 Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection

摘要原文: With the vigorous development of computer vision, oriented object detection has gradually been featured. In this paper, a novel differentiable angle coder named phase-shifting coder (PSC) is proposed to accurately predict the orientation of objects, along with a dual-frequency version (PSCD). By mapping the rotational periodicity of different cycles into the phase of different frequencies, we provide a unified framework for various periodic fuzzy problems caused by rotational symmetry in oriented object detection. Upon such a framework, common problems in oriented object detection such as boundary discontinuity and square-like problems are elegantly solved in a unified form. Visual analysis and experiments on three datasets prove the effectiveness and the potentiality of our approach. When facing scenarios requiring high-quality bounding boxes, the proposed methods are expected to give a competitive performance. The codes are publicly available at https://github.com/open-mmlab/mmrotate.

中文总结: 本文介绍了随着计算机视觉的蓬勃发展,定向目标检测逐渐成为主要特点。提出了一种新颖的可微分角度编码器,名为相位移位编码器(PSC),用于准确预测物体的方向,以及双频版本(PSCD)。通过将不同周期的旋转周期性映射到不同频率的相位中,我们为定向目标检测中由旋转对称性引起的各种周期性模糊问题提供了一个统一框架。在这样的框架下,定向目标检测中常见的问题,如边界不连续和类似正方形的问题,以统一的形式优雅地解决。对三个数据集进行的视觉分析和实验证明了我们方法的有效性和潜力。在面对需要高质量边界框的场景时,期望所提出的方法能够具有竞争性的性能。相关代码已公开在 https://github.com/open-mmlab/mmrotate。

Paper30 Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans

摘要原文: 3D scene understanding has seen significant advances in recent years, but has largely focused on object understanding in 3D scenes with independent per-object predictions. We thus propose to learn Neural Part Priors (NPPs), parametric spaces of objects and their parts, that enable optimizing to fit to a new input 3D scan geometry with global scene consistency constraints. The rich structure of our NPPs enables accurate, holistic scene reconstruction across similar objects in the scene. Both objects and their part geometries are characterized by coordinate field MLPs, facilitating optimization at test time to fit to input geometric observations as well as similar objects in the input scan. This enables more accurate reconstructions than independent per-object predictions as a single forward pass, while establishing global consistency within a scene. Experiments on the ScanNet dataset demonstrate that NPPs significantly outperforms the state-of-the-art in part decomposition and object completion in real-world scenes.

中文总结: 这段话主要讨论了近年来在3D场景理解方面取得的重大进展,但主要集中在对3D场景中对象的理解,使用独立的对象预测。因此,提出了学习神经部件先验(NPPs)的概念,这是对象及其部件的参数空间,使得能够优化以适应新的输入3D扫描几何形状,并具有全局场景一致性约束。NPPs的丰富结构使得能够在场景中的类似对象之间实现准确、整体的场景重建。对象及其部件几何结构都由坐标场MLPs所描述,这有助于在测试时进行优化,以适应输入几何观察以及输入扫描中的类似对象。这比独立的对象预测能够实现更准确的重建,同时在场景中确立全局一致性。在ScanNet数据集上的实验表明,NPPs在实际场景中的部件分解和对象完成方面明显优于现有技术水平。

Paper31 Curricular Object Manipulation in LiDAR-Based Object Detection

摘要原文: This paper explores the potential of curriculum learning in LiDAR-based 3D object detection by proposing a curricular object manipulation (COM) framework. The framework embeds the curricular training strategy into both the loss design and the augmentation process. For the loss design, we propose the COMLoss to dynamically predict object-level difficulties and emphasize objects of different difficulties based on training stages. On top of the widely-used augmentation technique called GT-Aug in LiDAR detection tasks, we propose a novel COMAug strategy which first clusters objects in ground-truth database based on well-designed heuristics. Group-level difficulties rather than individual ones are then predicted and updated during training for stable results. Model performance and generalization capabilities can be improved by sampling and augmenting progressively more difficult objects into the training points. Extensive experiments and ablation studies reveal the superior and generality of the proposed framework. The code is available at https://github.com/ZZY816/COM.

中文总结: 这篇论文探讨了课程学习在基于LiDAR的3D物体检测中的潜力,提出了一种课程物体操作(COM)框架。该框架将课程化训练策略嵌入到损失设计和增强过程中。在损失设计方面,作者提出了COMLoss来动态预测物体级别的困难,并根据训练阶段强调不同难度的物体。在LiDAR检测任务中广泛使用的增强技术GT-Aug的基础上,作者提出了一种新颖的COMAug策略,首先根据设计良好的启发式对地面实况数据库中的物体进行聚类。然后在训练过程中预测和更新群体级别的困难,以获得稳定的结果。通过逐渐将更难的物体采样和增强到训练点中,可以提高模型的性能和泛化能力。大量实验和消融研究揭示了所提出框架的优越性和通用性。该代码可在https://github.com/ZZY816/COM找到。

Paper32 Detecting Everything in the Open World: Towards Universal Object Detection

摘要原文: In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

中文总结: 本文正式讨论了通用目标检测,旨在检测每个场景并预测每个类别。传统检测器的通用性受到人类注释的依赖、有限的视觉信息以及开放世界中的新类别严重限制。我们提出了UniDetector,一种通用目标检测器,具有识别开放世界中大量类别的能力。UniDetector通用性的关键点是:1)通过对齐图像和文本空间,利用多个来源和异构标签空间的图像进行训练,从而保证通用表示的充分信息。2)它易于泛化到开放世界,同时保持见过和未见过类别之间的平衡,得益于视觉和语言模态的丰富信息。3)通过我们提出的解耦训练方式和概率校准,进一步提升对新类别的泛化能力。这些贡献使UniDetector能够检测超过7k个类别,迄今为止可测量的最大类别数量,仅有大约500个类别参与训练。我们的UniDetector在大词汇数据集(如LVIS、ImageNetBoxes和VisualGenome)上表现出强大的零样本泛化能力,平均超过传统监督基线4%以上,而不需要看到任何相应图像。在包含各种场景的13个公共检测数据集上,UniDetector也仅使用3%的训练数据就实现了最先进的性能。

Paper33 Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection

摘要原文: Detecting arbitrarily oriented tiny objects poses intense challenges to existing detectors, especially for label assignment. Despite the exploration of adaptive label assignment in recent oriented object detectors, the extreme geometry shape and limited feature of oriented tiny objects still induce severe mismatch and imbalance issues. Specifically, the position prior, positive sample feature, and instance are mismatched, and the learning of extreme-shaped objects is biased and unbalanced due to little proper feature supervision. To tackle these issues, we propose a dynamic prior along with the coarse-to-fine assigner, dubbed DCFL. For one thing, we model the prior, label assignment, and object representation all in a dynamic manner to alleviate the mismatch issue. For another, we leverage the coarse prior matching and finer posterior constraint to dynamically assign labels, providing appropriate and relatively balanced supervision for diverse instances. Extensive experiments on six datasets show substantial improvements to the baseline. Notably, we obtain the state-of-the-art performance for one-stage detectors on the DOTA-v1.5, DOTA-v2.0, and DIOR-R datasets under single-scale training and testing. Codes are available at https://github.com/Chasel-Tsui/mmrotate-dcfl.

中文总结: 这段话主要讨论了检测任意方向微小物体对现有检测器提出了巨大挑战,特别是在标签分配方面。尽管最近的定向物体检测器中探索了自适应标签分配,但是定向微小物体的极端几何形状和有限特征仍然引起严重的不匹配和不平衡问题。具体来说,位置先验、正样本特征和实例不匹配,对极端形状物体的学习由于缺乏适当的特征监督而存在偏差和不平衡。为了解决这些问题,他们提出了一种动态先验以及粗到细的分配器,称为DCFL。一方面,他们以动态方式建模先验、标签分配和对象表示,以减轻不匹配问题。另一方面,他们利用粗先验匹配和更细的后验约束动态分配标签,为各种实例提供适当且相对平衡的监督。在六个数据集上进行了大量实验,对基线模型进行了实质性改进。值得注意的是,在单尺度训练和测试下,他们在DOTA-v1.5、DOTA-v2.0和DIOR-R数据集上获得了一阶段检测器的最先进性能。代码可在https://github.com/Chasel-Tsui/mmrotate-dcfl上找到。

Paper34 VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

摘要原文: 3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.

中文总结: 这段话主要内容是介绍了一种名为VoxelNext的全稀疏3D物体检测方法。传统的3D物体检测器通常依赖于手工设计的代理,如锚点或中心,并将经过充分研究的2D框架转化为3D。因此,稀疏体素特征需要被稠密化并通过密集预测头进行处理,这不可避免地会增加额外的计算成本。作者提出了VoxelNext,用于全稀疏3D物体检测,核心洞察是直接基于稀疏体素特征预测物体,而不依赖于手工设计的代理。他们的强大稀疏卷积网络VoxelNeXt通过体素特征完全检测和跟踪3D物体。这是一个优雅而高效的框架,无需稀疏到密集的转换或NMS后处理。作者的方法在nuScenes数据集上取得了比其他主流检测器更好的速度-精度折衷。他们首次展示了全稀疏基于体素的表示对于激光雷达3D物体检测和跟踪的良好效果。在nuScenes、Waymo和Argoverse2基准测试上的大量实验证实了他们方法的有效性。作者的模型在nuScenes跟踪测试基准上胜过了所有现有的激光雷达方法,而没有任何花哨的设计。

Paper35 Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

摘要原文: Multi-camera 3D object detection blossoms in recent years and most of state-of-the-art methods are built up on the bird’s-eye-view (BEV) representations. Albeit remarkable performance, these works suffer from low efficiency. Typically, knowledge distillation can be used for model compression. However, due to unclear 3D geometry reasoning, expert features usually contain some noisy and confusing areas. In this work, we investigate on how to distill the knowledge from an imperfect expert. We propose FD3D, a Focal Distiller for 3D object detection. Specifically, a set of queries are leveraged to locate the instance-level areas for masked feature generation, to intensify feature representation ability in these areas. Moreover, these queries search out the representative fine-grained positions for refined distillation. We verify the effectiveness of our method by applying it to two popular detection models, BEVFormer and DETR3D. The results demonstrate that our method achieves improvements of 4.07 and 3.17 points respectively in terms of NDS metric on nuScenes benchmark. Code is hosted at https://github.com/OpenPerceptionX/BEVPerception-Survey-Recipe.

中文总结: 这段话主要讨论了近年来多摄像头3D物体检测技术的发展以及目前大多数最先进的方法都是基于鸟瞰图(BEV)表示构建的。尽管这些方法表现出色,但它们存在效率低的问题。通常,知识蒸馏可以用于模型压缩。然而,由于对3D几何推理不清晰,专家特征通常包含一些嘈杂和混乱的区域。在这项工作中,研究人员探讨了如何从一个不完美的专家那里蒸馏知识。他们提出了FD3D,一种用于3D物体检测的焦点蒸馏器。具体来说,利用一组查询来定位实例级区域,以生成掩膜特征,以增强这些区域的特征表示能力。此外,这些查询还搜索出用于细化蒸馏的代表性细粒位置。他们通过将该方法应用于两种流行的检测模型,BEVFormer和DETR3D,验证了该方法的有效性。结果表明,我们的方法在nuScenes基准测试中的NDS指标上分别取得了4.07和3.17点的改进。源代码托管在https://github.com/OpenPerceptionX/BEVPerception-Survey-Recipe。

Paper36 Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

摘要原文: Weakly supervised dense object localization (WSDOL) relies generally on Class Activation Mapping (CAM), which exploits the correlation between the class weights of the image classifier and the pixel-level features. Due to the limited ability to address intra-class variations, the image classifier cannot properly associate the pixel features, leading to inaccurate dense localization maps. In this paper, we propose to explicitly construct multi-modal class representations by leveraging the Contrastive Language-Image Pre-training (CLIP), to guide dense localization. More specifically, we propose a unified transformer framework to learn two-modalities of class-specific tokens, i.e., class-specific visual and textual tokens. The former captures semantics from the target visual data while the latter exploits the class-related language priors from CLIP, providing complementary information to better perceive the intra-class diversities. In addition, we propose to enrich the multi-modal class-specific tokens with sample-specific contexts comprising visual context and image-language context. This enables more adaptive class representation learning, which further facilitates dense localization. Extensive experiments show the superiority of the proposed method for WSDOL on two multi-label datasets, i.e., PASCAL VOC and MS COCO, and one single-label dataset, i.e., OpenImages. Our dense localization maps also lead to the state-of-the-art weakly supervised semantic segmentation (WSSS) results on PASCAL VOC and MS COCO.

中文总结: 这段话主要讲述了弱监督密集目标定位(WSDOL)通常依赖于类激活映射(CAM),CAM利用图像分类器的类权重与像素级特征之间的相关性。由于图像分类器有限的处理类内变化的能力,无法正确关联像素特征,导致密集定位图不准确。本文提出了通过利用对比语言-图像预训练(CLIP)来显式构建多模态类表示,以指导密集定位。具体来说,我们提出了一个统一的Transformer框架来学习两种类特定的令牌,即类特定的视觉和文本令牌。前者从目标视觉数据中捕获语义,而后者利用CLIP中的类相关语言先验,提供了更好地感知类内多样性的互补信息。此外,我们提出用包含视觉上下文和图像语言上下文的样本特定上下文来丰富多模态类特定令牌。这使得更具适应性的类表示学习,进一步促进了密集定位。大量实验证明了所提方法在两个多标签数据集(PASCAL VOC和MS COCO)和一个单标签数据集(OpenImages)上对WSDOL的优越性。我们的密集定位图还在PASCAL VOC和MS COCO上实现了最先进的弱监督语义分割(WSSS)结果。

Paper37 Universal Instance Perception As Object Discovery and Retrieval

摘要原文: All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT.

中文总结: 这段话主要讲述了实例感知任务的目标是找到由一些查询指定的特定对象,如类别名称、语言表达和目标注释,但整个领域已经分为多个独立的子任务。在这项工作中,我们提出了下一代通用实例感知模型UNINEXT。UNINEXT将不同的实例感知任务重新构建为统一的对象发现和检索范式,可以通过简单更改输入提示来灵活地感知不同类型的对象。这种统一的表述带来了以下好处:(1)可以利用来自不同任务和标签词汇的大量数据共同训练通用的实例级表示,这对于缺乏训练数据的任务尤其有益。 (2)统一模型参数效率高,可以在同时处理多个任务时节省冗余计算。UNINEXT在包括经典图像级任务(目标检测和实例分割)、视觉与语言任务(指代表达理解和分割)以及六个视频级对象跟踪任务在内的10个实例级任务的20个具有挑战性的基准测试中表现出优异的性能。代码可在https://github.com/MasterBin-IIAU/UNINEXT 上找到。

Paper38 YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors

摘要原文: Real-time object detection is one of the most important research topics in computer vision. As new approaches regarding architecture optimization and training optimization are continually being developed, we have found two research topics that have spawned when dealing with these latest state-of-the-art methods. To address the topics, we propose a trainable bag-of-freebies oriented solution. We combine the flexible and efficient training tools with the proposed architecture and the compound scaling method. YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 120 FPS and has the highest accuracy 56.8% AP among all known realtime object detectors with 30 FPS or higher on GPU V100. Source code is released in https://github.com/ WongKinYiu/yolov7.

中文总结: 实时目标检测是计算机视觉中最重要的研究课题之一。随着架构优化和训练优化等新方法不断被开发,我们发现在处理这些最新的尖端方法时出现了两个研究课题。为了解决这些课题,我们提出了一种基于可训练的免费工具包的解决方案。我们将灵活高效的训练工具与所提出的架构和复合缩放方法相结合。YOLOv7在从5 FPS到120 FPS范围内超越了所有已知的目标检测器,其在GPU V100上以30 FPS或更高的速度具有56.8%的最高准确率AP。源代码已发布在https://github.com/WongKinYiu/yolov7。

Paper39 CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

摘要原文: In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that directly interacting 2D image features with global 3D PE could increase the difficulty of learning view transformation due to the variation of camera extrinsics. Thus we propose a novel method based on CAmera view Position Embedding, called CAPE. We form the 3D position embeddings under the local camera-view coordinate system instead of the global coordinate system, such that 3D position embedding is free of encoding camera extrinsic parameters. Furthermore, we extend our CAPE to temporal modeling by exploiting the object queries of previous frames and encoding the ego motion for boosting 3D object detection. CAPE achieves the state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on standard nuScenes dataset. Codes and models are available.

中文总结: 本文解决了从多视图图像中检测3D物体的问题。目前基于查询的方法依赖于全局3D位置嵌入(PE)来学习图像和3D空间之间的几何对应关系。我们认为直接将2D图像特征与全局3D PE进行交互可能会增加学习视角变换的难度,因为相机外参的变化。因此,我们提出了一种基于相机视角位置嵌入的新方法,称为CAPE。我们在本地相机视角坐标系下形成3D位置嵌入,而不是在全局坐标系下,这样3D位置嵌入就不受编码相机外参数的影响。此外,我们通过利用先前帧的对象查询和编码自我运动来扩展我们的CAPE以进行时间建模,以提高3D物体检测。CAPE在标准nuScenes数据集上实现了最先进的性能(61.0% NDS和52.5% mAP)。代码和模型均可获得。

Paper40 Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection

摘要原文: LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object’s 3D shape while Radar provides longer detection ranges as well as velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from that Radar data are extremely sparse and lack height information. Therefore, directly integrating Radar features into LiDAR-centric detection networks is not optimal. In this work, we introduce a bi-directional LiDAR-Radar fusion framework, termed Bi-LRFusion, to tackle the challenges and improve 3D detection for dynamic objects. Technically, Bi-LRFusion involves two steps: first, it enriches Radar’s local features by learning important details from the LiDAR branch to alleviate the problems caused by the absence of height information and extreme sparsity; second, it combines LiDAR features with the enhanced Radar features in a unified bird’s-eye-view representation. We conduct extensive experiments on nuScenes and ORR datasets, and show that our Bi-LRFusion achieves state-of-the-art performance for detecting dynamic objects. Notably, Radar data in these two datasets have different formats, which demonstrates the generalizability of our method. Codes will be published.

中文总结: 这段话主要介绍了LiDAR和雷达作为两种互补的感知方法,LiDAR专门用于捕捉物体的三维形状,而雷达则提供更长的检测范围以及速度信息。然而,如何有效地将它们结合起来以改善特征表示仍然不清楚。主要挑战在于雷达数据非常稀疏且缺乏高度信息。因此,直接将雷达特征整合到以LiDAR为中心的检测网络中并不是最佳选择。在这项工作中,我们引入了一个双向LiDAR-雷达融合框架,称为Bi-LRFusion,以解决挑战并改善动态物体的三维检测。技术上,Bi-LRFusion包括两个步骤:首先,通过从LiDAR分支学习重要细节来丰富雷达的局部特征,以减轻由于缺乏高度信息和极度稀疏性而引起的问题;其次,它将LiDAR特征与增强的雷达特征结合在一个统一的鸟瞰图表示中。我们在nuScenes和ORR数据集上进行了大量实验,并展示了我们的Bi-LRFusion在检测动态物体方面实现了最先进的性能。值得注意的是,这两个数据集中的雷达数据具有不同的格式,这证明了我们方法的泛化能力。代码将会发布。

Paper41 LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

摘要原文: Humans excel at acquiring knowledge through observation. For example, we can learn to use new tools by watching demonstrations. This skill is fundamental for intelligent systems to interact with the world. A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding. In this paper, we address this problem and propose a framework called LOCATE that can identify matching object parts across images, to transfer knowledge from images where an object is being used (exocentric images used for learning), to images where the object is inactive (egocentric ones used to test). To this end, we first find interaction areas and extract their feature embeddings. Then we learn to aggregate the embeddings into compact prototypes (human, object part, and background), and select the one representing the object part. Finally, we use the selected prototype to guide affordance grounding. We do this in a weakly supervised manner, learning only from image-level affordance and object labels. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a large margin on both seen and unseen objects.

中文总结: 这段话主要讨论了人类通过观察获取知识的优势,以及如何将这种技能转化为智能系统与世界互动的基础。其中重点介绍了通过识别物体的哪一部分适合执行哪一项动作来获得这种技能的关键步骤,这被称为“affordance grounding”。作者提出了一个名为LOCATE的框架,可以在图像之间识别匹配的物体部分,从展示物体被使用的图像(用于学习的外中心图像)转移到物体处于非活动状态的图像(用于测试的自中心图像)。通过找到交互区域并提取其特征嵌入,学习将这些嵌入聚合为紧凑的原型(人类、物体部分和背景),并选择代表物体部分的原型。最后,使用所选的原型来指导affordance grounding。作者表示,他们的方法在仅从图像级别的affordance和物体标签学习的弱监督条件下,在大量实验中表现出在已知和未知物体上都优于最先进方法的性能。

Paper42 DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection

摘要原文: Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (PASCAL VOC, MSCOCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes.

中文总结: 这段话主要讨论了广义的少样本目标检测的目标是在具有丰富注释的基础类别和具有有限训练数据的新类别上实现精确检测。现有方法在提高少样本泛化能力时牺牲了基础类别的性能,或者在保持高精度的基础类别检测的同时,在新类别适应方面的改进有限。作者指出不同类别的特征学习不足是造成这种情况的原因。因此,他们提出了一种新的训练框架DiGeo,用于学习几何感知特征,实现类间分离和类内紧凑性。为了引导特征簇的分离,他们推导了一种离线的简单正角紧框(ETF)分类器,其权重作为类中心并最大化且平均分离。为了缩紧每个类的簇,他们将自适应的类特定边界包含到分类损失中,并鼓励接近类中心的特征。在两个少样本基准数据集(PASCAL VOC,MSCOCO)和一个长尾数据集(LVIS)上的实验研究表明,使用单一模型,他们的方法可以有效提高新类别的泛化能力,而不会损害基础类别的检测。

Paper43 vMAP: Vectorised Object Mapping for Neural Field SLAM

摘要原文: We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised training, vMAP can optimise as many as 50 individual objects in a single scene, with an extremely efficient training speed of 5Hz map update. We experimentally demonstrate significantly improved scene-level and object-level reconstruction quality compared to prior neural field SLAM systems. Project page: https://kxhit.github.io/vMAP.

中文总结: 这段话主要介绍了一种名为vMAP的目标级别稠密SLAM系统,使用神经场表示。每个对象由一个小型MLP表示,实现了高效的、无需3D先验知识的完整对象建模。当RGB-D相机浏览一个没有先验信息的场景时,vMAP可以实时检测对象实例,并动态地将它们添加到地图中。由于向量化训练的强大性能,vMAP可以在单个场景中优化多达50个独立对象,训练速度极高,每秒更新地图5次。实验结果表明,与先前的神经场SLAM系统相比,vMAP在场景级别和对象级别的重建质量显著提高。项目页面:https://kxhit.github.io/vMAP。

Paper44 DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

摘要原文: This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13x more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.

中文总结: 这篇论文介绍了DetCLIPv2,这是一个高效且可扩展的训练框架,结合了大规模的图像-文本对,实现了开放词汇物体检测(OVD)。与先前通常依赖预训练视觉-语言模型(例如CLIP)或通过伪标记过程利用图像-文本对的OVD框架不同,DetCLIPv2直接以端到端的方式从海量图像-文本对中学习细粒度的单词-区域对齐。为了实现这一目标,我们利用区域提议和文本单词之间的最大单词-区域相似性来引导对比目标。为了使模型在学习广泛概念的同时获得定位能力,DetCLIPv2通过统一的数据表述在检测、定位和图像-文本对数据下进行混合监督训练。通过交替方案联合训练并采用低分辨率输入进行图像-文本对,DetCLIPv2高效且有效地利用图像-文本对数据:DetCLIPv2比DetCLIP利用了13倍更多的图像-文本对,在类似的训练时间内提高了性能。通过使用1300万个图像-文本对进行预训练,DetCLIPv2展示了卓越的开放词汇检测性能,例如DetCLIPv2与Swin-T骨干网络在LVIS基准上实现了40.4%的零样本AP,比以前的工作GLIP/GLIPv2/DetCLIP分别提高了14.4/11.4/4.5%的AP,甚至超过其完全监督的对应物体检测方法。

Paper45 ConQueR: Query Contrast Voxel-DETR for 3D Object Detection

摘要原文: Although DETR-based 3D detectors simplify the detection pipeline and achieve direct sparse predictions, their performance still lags behind dense detectors with post-processing for 3D object detection from point clouds. DETRs usually adopt a larger number of queries than GTs (e.g., 300 queries v.s. 40 objects in Waymo) in a scene, which inevitably incur many false positives during inference. In this paper, we propose a simple yet effective sparse 3D detector, named Query Contrast Voxel-DETR (ConQueR), to eliminate the challenging false positives, and achieve more accurate and sparser predictions. We observe that most false positives are highly overlapping in local regions, caused by the lack of explicit supervision to discriminate locally similar queries. We thus propose a Query Contrast mechanism to explicitly enhance queries towards their best-matched GTs over all unmatched query predictions. This is achieved by the construction of positive and negative GT-query pairs for each GT, and a contrastive loss to enhance positive GT-query pairs against negative ones based on feature similarities. ConQueR closes the gap of sparse and dense 3D detectors, and reduces 60% false positives. Our single-frame ConQueR achieves 71.6 mAPH/L2 on the challenging Waymo Open Dataset validation set, outperforming previous sota methods by over 2.0 mAPH/L2. Code: https://github.com/poodarchu/EFG.

中文总结: 尽管基于DETR的3D检测器简化了检测流程并实现了直接的稀疏预测,但它们的性能仍然落后于使用后处理的稠密检测器在点云中进行3D物体检测。DETR通常在场景中采用比GTs更多的查询数量(例如,Waymo中的300个查询与40个对象),这在推断过程中不可避免地会产生许多误报。在本文中,我们提出了一种简单而有效的稀疏3D检测器,名为Query Contrast Voxel-DETR(ConQueR),以消除具有挑战性的误报,并实现更准确和更稀疏的预测。我们观察到,大多数误报在局部区域高度重叠,这是由于缺乏明确的监督以区分局部相似的查询所导致的。因此,我们提出了一种查询对比机制,以明确增强查询与所有未匹配查询预测中的最佳匹配GTs。这是通过为每个GT构建正负GT-查询对和基于特征相似性增强正GT-查询对与负GT-查询对的对比损失来实现的。ConQueR弥合了稀疏和稠密3D检测器之间的差距,并减少了60%的误报。我们的单帧ConQueR在具有挑战性的Waymo Open Dataset验证集上实现了71.6 mAPH/L2,优于先前的sota方法超过2.0 mAPH/L2。源代码:https://github.com/poodarchu/EFG。

Paper46 Large-Scale Training Data Search for Object Re-Identification

摘要原文: We consider a scenario where we have access to the target domain, but cannot afford on-the-fly training data annotation, and instead would like to construct an alternative training set from a large-scale data pool such that a competitive model can be obtained. We propose a search and pruning (SnP) solution to this training data search problem, tailored to object re-identification (re-ID), an application aiming to match the same object captured by different cameras. Specifically, the search stage identifies and merges clusters of source identities which exhibit similar distributions with the target domain. The second stage, subject to a budget, then selects identities and their images from the Stage I output, to control the size of the resulting training set for efficient training. The two steps provide us with training sets 80% smaller than the source pool while achieving a similar or even higher re-ID accuracy. These training sets are also shown to be superior to a few existing search methods such as random sampling and greedy sampling under the same budget on training data size. If we release the budget, training sets resulting from the first stage alone allow even higher re-ID accuracy. We provide interesting discussions on the specificity of our method to the re-ID problem and particularly its role in bridging the re-ID domain gap. The code is available at https://github.com/yorkeyao/SnP.

中文总结: 这段话主要讨论了在无法实时进行训练数据标注的情况下,通过从大规模数据池中构建替代训练集来获得竞争性模型的方案。作者提出了一种针对对象再识别(re-ID)应用的搜索和修剪(SnP)解决方案,旨在匹配不同摄像头捕获的同一对象。具体而言,搜索阶段识别和合并源身份的簇,这些身份展现出与目标域相似的分布。第二阶段在预算的限制下,从第一阶段的输出中选择身份和其图像,以控制最终训练集的大小,以便进行高效训练。这两个步骤为我们提供了比源数据池小80%的训练集,同时实现了类似或甚至更高的re-ID准确性。这些训练集还被证明优于一些现有的搜索方法,如随机抽样和贪婪抽样,在相同的训练数据大小预算下。如果释放预算,仅从第一阶段得到的训练集甚至可以实现更高的re-ID准确性。作者还对他们的方法在re-ID问题中的特异性以及在弥合re-ID领域差距方面的作用进行了有趣的讨论。

Paper47 T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection

摘要原文: Compared to query-based black-box attacks, transfer-based black-box attacks do not require any information of the attacked models, which ensures their secrecy. However, most existing transfer-based approaches rely on ensembling multiple models to boost the attack transferability, which is time- and resource-intensive, not to mention the difficulty of obtaining diverse models on the same task. To address this limitation, in this work, we focus on the single-model transfer-based black-box attack on object detection, utilizing only one model to achieve a high-transferability adversarial attack on multiple black-box detectors. Specifically, we first make observations on the patch optimization process of the existing method and propose an enhanced attack framework by slightly adjusting its training strategies. Then, we analogize patch optimization with regular model optimization, proposing a series of self-ensemble approaches on the input data, the attacked model, and the adversarial patch to efficiently make use of the limited information and prevent the patch from overfitting. The experimental results show that the proposed framework can be applied with multiple classical base attack methods (e.g., PGD and MIM) to greatly improve the black-box transferability of the well-optimized patch on multiple mainstream detectors, meanwhile boosting white-box performance.

中文总结: 相较于基于查询的黑盒攻击,基于迁移的黑盒攻击不需要被攻击模型的任何信息,这确保了攻击的保密性。然而,大多数现有的基于迁移的方法依赖于集成多个模型来提升攻击的迁移性,这是耗时且资源密集的,更不用说在同一任务上获取多样化模型的困难了。为了解决这一限制,在这项工作中,我们专注于基于单一模型的迁移黑盒攻击目标检测,仅利用一个模型实现对多个黑盒检测器的高迁移性对抗攻击。具体来说,我们首先观察现有方法的补丁优化过程,并通过略微调整其训练策略提出了一个增强的攻击框架。然后,我们将补丁优化类比为常规模型优化,提出了一系列自我集成方法,以有效利用有限信息并防止补丁过度拟合。实验结果表明,所提出的框架可以与多种经典基础攻击方法(例如PGD和MIM)结合使用,显著提高对多个主流检测器上优化良好的补丁的黑盒迁移性,同时提升白盒性能。

Paper48 Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection

摘要原文: Deep neural networks (DNNs) have enabled astounding progress in several vision-based problems. Despite showing high predictive accuracy, recently, several works have revealed that they tend to provide overconfident predictions and thus are poorly calibrated. The majority of the works addressing the miscalibration of DNNs fall under the scope of classification and consider only in-domain predictions. However, there is little to no progress in studying the calibration of DNN-based object detection models, which are central to many vision-based safety-critical applications. In this paper, inspired by the train-time calibration methods, we propose a novel auxiliary loss formulation that explicitly aims to align the class confidence of bounding boxes with the accurateness of predictions (i.e. precision). Since the original formulation of our loss depends on the counts of true positives and false positives in a minibatch, we develop a differentiable proxy of our loss that can be used during training with other application-specific loss functions. We perform extensive experiments on challenging in-domain and out-domain scenarios with six benchmark datasets including MS-COCO, Cityscapes, Sim10k, and BDD100k. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios. Our source code and pre-trained models are available at https://github.com/akhtarvision/bpc_calibration

中文总结: 这段话主要讨论了深度神经网络(DNNs)在视觉问题中取得了惊人的进展,但近期一些研究表明它们往往提供过于自信的预测,导致校准不佳。大多数研究致力于解决DNN的校准问题,主要集中在分类领域,仅考虑领域内的预测。然而,在研究基于DNN的目标检测模型的校准方面几乎没有进展,而这对许多基于视觉的安全关键应用至关重要。该论文提出了一种新颖的辅助损失公式,旨在明确地将边界框的类别置信度与预测的准确性(即精度)对齐。由于原始损失公式依赖于小批量中的真正例和假正例的计数,因此开发了一个可微的损失代理,可在训练中与其他特定于应用的损失函数一起使用。在包括MS-COCO、Cityscapes、Sim10k和BDD100k在内的六个基准数据集上进行了大量实验,结果显示我们的训练损失在减少领域内外的校准误差方面优于强校准基线。我们的源代码和预训练模型可在https://github.com/akhtarvision/bpc_calibration上获得。

Paper49 Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection

摘要原文: It is well known that CNNs tend to overfit to the training data. Test-time adaptation is an extreme approach to deal with overfitting: given a test image, the aim is to adapt the trained model to that image. Indeed nothing can be closer to the test data than the test image itself. The main difficulty of test-time adaptation is that the ground truth is not available. Thus test-time adaptation, while intriguing, applies to only a few scenarios where one can design an effective loss function that does not require ground truth. We propose the first approach for test-time Salient Object Detection (SOD) in the context of weak supervision. Our approach is based on a so called regularized loss function, which can be used for training CNN when pixel precise ground truth is unavailable. Regularized loss tends to have lower values for the more likely object segments, and thus it can be used to fine-tune an already trained CNN to a given test image, adapting to images unseen during training. We develop a regularized loss function particularly suitable for test-time adaptation and show that our approach significantly outperforms prior work for weakly supervised SOD.

中文总结: 这段话主要讨论了卷积神经网络(CNNs)往往会对训练数据过拟合的问题以及针对过拟合采取的一种极端方法——测试时适应(test-time adaptation)。测试时适应的目标是针对测试图像调整训练好的模型,因为没有比测试图像本身更接近测试数据的了。然而,测试时适应的主要困难在于缺乏真实标签。因此,测试时适应只适用于少数情况,其中可以设计一个不需要真实标签的有效损失函数。作者提出了一种在弱监督情况下进行测试时显著对象检测(SOD)的方法,基于一种称为正则化损失函数的方法,可以在没有像素精确真实标签的情况下用于训练CNN。正则化损失倾向于对更可能的对象段具有较低的值,因此可以用于微调已经训练好的CNN以适应给定的测试图像,适应训练中未见过的图像。作者开发了一种特别适用于测试时适应的正则化损失函数,并表明我们的方法在弱监督SOD方面明显优于先前的工作。

Paper50 DynamicDet: A Unified Dynamic Architecture for Object Detection

摘要原文: Dynamic neural network is an emerging research topic in deep learning. With adaptive inference, dynamic models can achieve remarkable accuracy and computational efficiency. However, it is challenging to design a powerful dynamic detector, because of no suitable dynamic architecture and exiting criterion for object detection. To tackle these difficulties, we propose a dynamic framework for object detection, named DynamicDet. Firstly, we carefully design a dynamic architecture based on the nature of the object detection task. Then, we propose an adaptive router to analyze the multi-scale information and to decide the inference route automatically. We also present a novel optimization strategy with an exiting criterion based on the detection losses for our dynamic detectors. Last, we present a variable-speed inference strategy, which helps to realize a wide range of accuracy-speed trade-offs with only one dynamic detector. Extensive experiments conducted on the COCO benchmark demonstrate that the proposed DynamicDet achieves new state-of-the-art accuracy-speed trade-offs. For instance, with comparable accuracy, the inference speed of our dynamic detector Dy-YOLOv7-W6 surpasses YOLOv7-E6 by 12%, YOLOv7-D6 by 17%, and YOLOv7-E6E by 39%. The code is available at https://github.com/VDIGPKU/DynamicDet.

中文总结: 动态神经网络是深度学习中的一个新兴研究课题。通过自适应推断,动态模型可以实现卓越的准确性和计算效率。然而,设计强大的动态检测器是具有挑战性的,因为缺乏适用于目标检测的动态架构和退出标准。为了解决这些困难,我们提出了一个名为DynamicDet的目标检测动态框架。首先,我们根据目标检测任务的特性精心设计了一个动态架构。然后,我们提出了一个自适应路由器,用于分析多尺度信息并自动决定推断路线。我们还提出了一种基于检测损失的退出标准的新型优化策略,用于我们的动态检测器。最后,我们提出了一种可变速推断策略,可帮助实现广泛的准确性和速度的折衷,仅使用一个动态检测器。在COCO基准测试上进行的大量实验表明,所提出的DynamicDet实现了新的准确性和速度的折衷的最新水平。例如,与相似的准确性相比,我们的动态检测器Dy-YOLOv7-W6的推断速度超过YOLOv7-E6的12%,超过YOLOv7-D6的17%,超过YOLOv7-E6E的39%。代码可在https://github.com/VDIGPKU/DynamicDet获得。

Paper51 The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

摘要原文: We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. For each task in the ObjectFolder Benchmark, we conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu

中文总结: 这段话主要介绍了ObjectFolder Benchmark,这是一个包含10个任务的基准套件,用于多感官对象为中心的学习,围绕对象识别、重建和视觉、听觉、触觉的操作。同时介绍了ObjectFolder Real数据集,其中包括100个真实世界家用物品的多感官测量数据,基于新设计的数据采集流程,收集了这些物品的3D网格、视频、冲击声音和触觉读数。针对ObjectFolder Benchmark中的每个任务,我们对来自ObjectFolder的1,000个多感官神经对象和来自ObjectFolder Real的真实多感官数据进行了系统基准测试。我们的结果展示了多感官感知的重要性,并揭示了视觉、音频和触觉在不同对象为中心学习任务中的各自作用。通过公开发布我们的数据集和基准套件,我们希望在计算机视觉、机器人学等领域推动和促进多感官对象为中心学习的新研究。项目页面:https://objectfolder.stanford.edu。

Paper52 X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

摘要原文: Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X3KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the PV feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for cross-modal output distillation (X-OD), providing dense supervision at the prediction stage. We perform extensive ablations of knowledge distillation at different stages of multi-camera 3DOD. Our final X3KD model outperforms previous state-of-the-art approaches on the nuScenes and Waymo datasets and generalizes to RADAR-based 3DOD. Qualitative results video at https://youtu.be/1do9DPFmr38.

中文总结: 这段话主要讨论了最近在3D物体检测(3DOD)方面取得的重要进展,特别是基于激光雷达的模型取得了显著强大的结果。与此相反,基于多摄像头图像的环视3DOD模型表现不佳,因为需要将特征从透视视图(PV)转换为3D世界表示,由于缺少深度信息而导致模糊。该论文介绍了X3KD,这是一个跨不同模态、任务和阶段的全面知识蒸馏框架,用于多摄像头3DOD。具体而言,他们提出了从PV特征提取阶段的实例分割教师(X-IS)进行跨任务蒸馏,提供监督,避免了通过视图转换进行模糊错误反向传播。在转换后,他们应用了跨模态特征蒸馏(X-FD)和对抗训练(X-AT)来通过基于激光雷达的3DOD教师中包含的信息来改进多摄像头特征的3D世界表示。最后,他们还利用这个教师进行跨模态输出蒸馏(X-OD),在预测阶段提供密集监督。他们在多摄像头3DOD的不同阶段进行了广泛的知识蒸馏消融实验。最终的X3KD模型在nuScenes和Waymo数据集上优于先前的最先进方法,并且推广到基于雷达的3DOD。可以在 https://youtu.be/1do9DPFmr38 上观看定性结果视频。

Paper53 Gaussian Label Distribution Learning for Spherical Image Object Detection

摘要原文: Spherical image object detection emerges in many applications from virtual reality to robotics and automatic driving, while many existing detectors use ln-norms loss for regression of spherical bounding boxes. There are two intrinsic flaws for ln-norms loss, i.e., independent optimization of parameters and inconsistency between metric (dominated by IoU) and loss. These problems are common in planar image detection but more significant in spherical image detection. Solution for these problems has been extensively discussed in planar image detection by using IoU loss and related variants. However, these solutions cannot be migrated to spherical image object detection due to the undifferentiable of the Spherical IoU (SphIoU). In this paper, we design a simple but effective regression loss based on Gaussian Label Distribution Learning (GLDL) for spherical image object detection. Besides, we observe that the scale of the object in a spherical image varies greatly. The huge differences among objects from different categories make the sample selection strategy based on SphIoU challenging. Therefore, we propose GLDL-ATSS as a better training sample selection strategy for objects of the spherical image, which can alleviate the drawback of IoU threshold-based strategy of scale-sample imbalance. Extensive results on various two datasets with different baseline detectors show the effectiveness of our approach.

中文总结: 这段话主要讨论了在虚拟现实、机器人技术和自动驾驶等应用中,球形图像物体检测的重要性。现有的检测器通常使用ln-norms损失来进行球形边界框的回归,但ln-norms损失存在独立参数优化和度量与损失不一致等两个固有缺陷。这些问题在平面图像检测中很常见,但在球形图像检测中更为显著。解决这些问题的方法在平面图像检测中已被广泛讨论,主要是通过使用IoU损失及其变种。然而,这些解决方案无法迁移到球形图像物体检测中,因为球形IoU(SphIoU)不可微。因此,本文设计了一种基于高斯标签分布学习(GLDL)的简单而有效的回归损失,用于球形图像物体检测。此外,作者观察到球形图像中物体的尺度差异很大,不同类别的物体之间存在巨大差异,这使得基于SphIoU的样本选择策略具有挑战性。因此,作者提出了GLDL-ATSS作为更好的球形图像物体的训练样本选择策略,可以缓解IoU阈值策略导致的尺度-样本不平衡问题。在不同基准检测器的两个数据集上进行的广泛实验结果显示了我们方法的有效性。

Paper54 MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection

摘要原文: Scale variation across object instances is one of the key challenges in object detection. Although modern detection models have achieved remarkable progress in dealing with the scale variation, it still brings trouble in the semi-supervised case. Most existing semi-supervised object detection methods rely on strict conditions to filter out high-quality pseudo labels from the network predictions. However, we observe that objects with extreme scale tend to have low confidence, which makes the positive supervision missing for these objects. In this paper, we delve into the scale variation problem, and propose a novel framework by introducing a mixed scale teacher to improve the pseudo labels generation and scale invariant learning. In addition, benefiting from the better predictions from mixed scale features, we propose to mine pseudo labels with the score promotion of predictions across scales. Extensive experiments on MS COCO and PASCAL VOC benchmarks under various semi-supervised settings demonstrate that our method achieves new state-of-the-art performance. The code and models will be made publicly available.

中文总结: 这段话主要讨论了目标检测中对象实例之间的尺度变化是一个关键挑战,尽管现代检测模型在处理尺度变化方面取得了显著进展,但在半监督情况下仍然存在困难。大多数现有的半监督目标检测方法依赖于严格的条件来过滤网络预测中的高质量伪标签。然而,作者观察到尺度极端的对象往往具有较低的置信度,这使得这些对象缺乏正面监督。因此,他们提出了一种新颖的框架,通过引入混合尺度教师来改善伪标签生成和尺度不变学习。此外,通过利用混合尺度特征的更好预测,他们提出了通过跨尺度预测得分提升来挖掘伪标签。在各种半监督设置下对MS COCO和PASCAL VOC基准数据集进行的大量实验表明,他们的方法实现了新的最先进性能。代码和模型将公开提供。

Paper55 Layout-Based Causal Inference for Object Navigation

摘要原文: Previous works for ObjectNav task attempt to learn the association (e.g. relation graph) between the visual inputs and the goal during training. Such association contains the prior knowledge of navigating in training environments, which is denoted as the experience. The experience performs a positive effect on helping the agent infer the likely location of the goal when the layout gap between the unseen environments of the test and the prior knowledge obtained in training is minor. However, when the layout gap is significant, the experience exerts a negative effect on navigation. Motivated by keeping the positive effect and removing the negative effect of the experience, we propose the layout-based soft Total Direct Effect (L-sTDE) framework based on the causal inference to adjust the prediction of the navigation policy. In particular, we propose to calculate the layout gap which is defined as the KL divergence between the posterior and the prior distribution of the object layout. Then the sTDE is proposed to appropriately control the effect of the experience based on the layout gap. Experimental results on AI2THOR, RoboTHOR, and Habitat demonstrate the effectiveness of our method.

中文总结: 这段话主要内容是关于ObjectNav任务的先前研究,试图在训练过程中学习视觉输入和目标之间的关联(例如关系图)。这种关联包含了在训练环境中导航的先验知识,即经验。经验在帮助代理推断目标可能位置时产生积极影响,当测试中未见环境与训练中获得的先验知识之间的布局差距较小时。然而,当布局差距显著时,经验对导航产生负面影响。受保持经验的积极影响和消除负面影响的启发,我们提出了基于因果推断的基于布局的软总直接效应(L-sTDE)框架来调整导航策略的预测。具体来说,我们提出计算布局差距,定义为对象布局的后验分布与先验分布之间的KL散度。然后,根据布局差距,提出了sTDE来适当控制经验的影响。在AI2THOR、RoboTHOR和Habitat上的实验结果证明了我们方法的有效性。

Paper56 Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration

摘要原文: The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not consider both localisation and classification quality. In this work, we address these issues, and introduce the Self Aware Object Detection (SAOD) task, a unified testing framework which respects and adheres to the challenges that object detectors face in safety-critical environments such as autonomous driving. Specifically, the SAOD task requires an object detector to be: robust to domain shift; obtain reliable uncertainty estimates for the entire scene; and provide calibrated confidence scores for the detections. We extensively use our framework, which introduces novel metrics and large scale test datasets, to test numerous object detectors in two different use-cases, allowing us to highlight critical insights into their robustness performance. Finally, we introduce a simple baseline for the SAOD task, enabling researchers to benchmark future proposed methods and move towards robust object detectors which are fit for purpose. Code is available at: https://github.com/fiveai/saod

中文总结: 这段话主要介绍了当前用于测试目标检测器鲁棒性的方法存在严重缺陷,例如在执行超出分布检测方面使用不当的方法以及使用不考虑定位和分类质量的校准度量。在这项工作中,我们解决了这些问题,并引入了自感知目标检测(SAOD)任务,这是一个统一的测试框架,尊重并遵守目标检测器在自动驾驶等安全关键环境中面临的挑战。具体而言,SAOD任务要求目标检测器具备以下特点:对领域转移具有鲁棒性;为整个场景获取可靠的不确定性估计;并为检测提供校准的置信度得分。我们广泛使用我们的框架,在两种不同的用例中测试了许多目标检测器,从而使我们能够突出显示它们的鲁棒性性能的关键见解。最后,我们引入了SAOD任务的一个简单基准线,使研究人员能够对未来提出的方法进行基准测试,并朝着适用于目的的鲁棒目标检测器迈进。代码可在以下网址找到:https://github.com/fiveai/saod。

Paper57 LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

摘要原文: Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch–but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for LEarning reGular rearrangement of Objects in messy rooms. LEGO-Net is partly inspired by diffusion models–it starts with an initial messy state and iteratively “de-noises” the position and orientation of objects to a regular state while reducing distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery.

中文总结: 这段话主要讨论了人类普遍不喜欢清理杂乱房间的任务,如果机器要帮助我们完成这项任务,它们必须理解人类对规整布局的标准,如几种对称性、共线性或共圆性、线性或圆形模式中的间距均匀性,以及与风格和功能相关的物体间关系等。先前的方法依赖于人类输入来明确指定目标状态,或者从头开始合成场景,但这些方法并未解决重新排列现有杂乱场景而不提供目标状态的问题。在本文中,我们提出LEGO-Net,这是一种基于数据驱动的基于变压器的迭代方法,用于学习在杂乱房间中重新排列物体。LEGO-Net在一定程度上受扩散模型启发–它从初始杂乱状态开始,迭代地“去噪”物体的位置和方向,将其调整到规整状态,同时减少移动距离。通过在现有数据集中的专业布置场景中随机扰动的物体位置和方向,我们的方法经过训练可以恢复出规整的重新排列。结果表明,我们的方法能够可靠地重新排列房间场景,并且优于其他方法。此外,我们还提出了一种评估房间布置规整性的指标,使用了数论工具。

Paper58 Angelic Patches for Improving Third-Party Object Detector Performance

摘要原文: Deep learning models have shown extreme vulnerability to simple perturbations and spatial transformations. In this work, we explore whether we can adopt the characteristics of adversarial attack methods to help improve perturbation robustness for object detection. We study a class of realistic object detection settings wherein the target objects have control over their appearance. To this end, we propose a reversed Fast Gradient Sign Method (FGSM) to obtain these angelic patches that significantly increase the detection probability, even without pre-knowledge of the perturbations. In detail, we apply the patch to each object instance simultaneously, strengthen not only classification but also bounding box accuracy. Experiments demonstrate the efficacy of the partial-covering patch in solving the complex bounding box problem. More importantly, the performance is also transferable to different detection models even under severe affine transformations and deformable shapes. To our knowledge, we are the first (object detection) patch that achieves both cross-model and multiple-patch efficacy. We observed average accuracy improvements of 30% in the real-world experiments, which brings large social value. Our code is available at: https://github.com/averysi224/angelic_patches.

中文总结: 这段话主要讨论了深度学习模型对简单扰动和空间变换的极端脆弱性。研究人员探讨了是否可以借鉴对抗性攻击方法的特征来帮助改善目标检测的扰动鲁棒性。他们研究了一类现实的目标检测设置,其中目标对象可以控制其外观。为此,他们提出了一种反向的快速梯度符号方法(FGSM),以获得这些“天使贴片”,显著增加检测概率,即使没有关于扰动的预先知识。具体来说,他们将贴片同时应用于每个对象实例,不仅增强了分类准确性,还增强了边界框准确性。实验证明了部分覆盖贴片在解决复杂边界框问题方面的有效性。更重要的是,这种性能还可以转移到不同的检测模型,即使在严重的仿射变换和可变形形状下也能保持。据我们所知,这是第一个实现跨模型和多贴片有效性的(目标检测)贴片。在真实世界实验中,我们观察到平均准确率提高了30%,这带来了巨大的社会价值。我们的代码可在以下链接找到:https://github.com/averysi224/angelic_patches。

Paper59 Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder

摘要原文: Discriminating known from unknown objects is an important essential ability for human beings. To simulate this ability, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect the objects that are never-seen-before during model training, which is beneficial for promoting the safe deployment of object detectors. Due to lacking unknown data for supervision, for this task, the main challenge lies in how to leverage the known in-distribution (ID) data to improve the detector’s discrimination ability. In this paper, we first propose a method of Structure-Enhanced Recurrent Variational AutoEncoder (SR-VAE), which mainly consists of two dedicated recurrent VAE branches. Specifically, to boost the performance of object localization, we explore utilizing the classical Laplacian of Gaussian (LoG) operator to enhance the structure information in the extracted low-level features. Meanwhile, we design a VAE branch that recurrently generates the augmentation of the classification features to strengthen the discrimination ability of the object classifier. Finally, to alleviate the impact of lacking unknown data, another cycle-consistent conditional VAE branch is proposed to synthesize virtual OOD features that deviate from the distribution of ID features, which improves the capability of distinguishing OOD objects. In the experiments, our method is evaluated on OOD-OD, open-vocabulary detection, and incremental object detection. The significant performance gains over baselines show the superiorities of our method. The code will be released at https://github.com/AmingWu/SR-VAE.

中文总结: 这段话主要讨论了区分已知对象和未知对象对于人类的重要性,提出了一种无监督的超出分布对象检测(OOD-OD)任务,旨在检测在模型训练期间从未见过的对象,这有助于促进对象检测器的安全部署。由于缺乏未知数据进行监督,因此该任务的主要挑战在于如何利用已知的分布(ID)数据来提高检测器的区分能力。文中首先提出了一种结构增强的循环变分自编码器(SR-VAE)方法,主要由两个专用的循环VAE分支组成。具体而言,为了提高对象定位的性能,探索利用经典的高斯拉普拉斯(LoG)算子来增强提取的低级特征中的结构信息。同时,设计了一个VAE分支,循环生成分类特征的增强,以加强对象分类器的区分能力。最后,为了缓解缺乏未知数据的影响,提出了另一个循环一致的条件VAE分支,用于合成偏离ID特征分布的虚拟OOD特征,从而提高区分OOD对象的能力。在实验中,我们的方法在OOD-OD、开放词汇检测和增量对象检测上进行了评估。与基线相比,显著的性能提升显示了我们方法的优越性。代码将发布在https://github.com/AmingWu/SR-VAE。

Paper60 Simple Cues Lead to a Strong Multi-Object Tracker

摘要原文: For a long time, the most common paradigm in MultiObject Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-ofthe-art performance. https://github.com/dvl-tum/GHOST

中文总结: 这段话主要讨论了在多目标跟踪中的两种不同方法:一种是传统的基于检测的跟踪方法,即首先检测物体,然后在视频帧之间进行关联;另一种是基于注意力的方法,通过数据驱动的学习来确定跟踪所需的线索。作者在研究中探讨了是否简单的传统跟踪方法也能达到端到端模型的性能,并提出了两个关键因素,使标准的再识别网络在基于外观的跟踪中表现出色。他们分析了跟踪器的失败案例,并展示了将外观特征与简单的运动模型结合的效果。最终,他们的跟踪器在四个公共数据集(MOT17、MOT20、BDD100k和DanceTrack)上实现了最先进的性能。

Paper61 SOOD: Towards Semi-Supervised Oriented Object Detection

摘要原文: Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for boosting object detectors, has become an active task in recent years. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects that are common in aerial images unexplored. This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework. Towards oriented objects in aerial scenes, we design two loss functions to provide better supervision. Focusing on the orientations of objects, the first loss regularizes the consistency between each pseudo-label-prediction pair (includes a prediction and its corresponding pseudo label) with adaptive weights based on their orientation gap. Focusing on the layout of an image, the second loss regularizes the similarity and explicitly builds the many-to-many relation between the sets of pseudo-labels and predictions. Such a global consistency constraint can further boost semi-supervised learning. Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark. The code will be available at https://github.com/HamPerdredes/SOOD.

中文总结: 这段话主要讨论了半监督目标检测(SSOD)在近年来变得活跃,旨在探索未标记数据以提升目标检测器的效果。然而,现有的SSOD方法主要集中在水平对象上,而忽视了在航拍图像中常见的多方向对象。本文提出了一种新颖的半监督定向目标检测模型,称为SOOD,建立在主流的伪标记框架之上。针对航拍场景中的定向对象,我们设计了两个损失函数以提供更好的监督。第一个损失函数侧重于对象的方向,通过自适应权重基于它们的方向差异规范每个伪标签-预测对之间的一致性。第二个损失函数则侧重于图像的布局,规范了伪标签和预测集合之间的相似性,并明确建立了多对多的关系。这种全局一致性约束可以进一步提升半监督学习效果。我们的实验证明,当使用两个提出的损失进行训练时,SOOD在DOTA-v1.5基准测试中各种设置下均超过了最先进的SSOD方法。代码将在https://github.com/HamPerdredes/SOOD 上提供。

Paper62 Dense Distinct Query for End-to-End Object Detection

摘要原文: One-to-one label assignment in object detection has successfully obviated the need of non-maximum suppression (NMS) as a postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounters optimization difficulty. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at https://github.com/jshilong/DDQ.

中文总结: 这段话主要讨论了在目标检测中一对一标签分配的方法成功地消除了非极大值抑制(NMS)作为后处理的需要,使得整个流程成为端到端。然而,这也引发了一个新的困境,即广泛使用的稀疏查询无法保证高召回率,而密集查询则会带来更多相似的查询并遇到优化困难。因为稀疏和密集查询都存在问题,所以在端到端目标检测中期望的查询是什么?这篇论文表明解决方案应该是密集独特查询(DDQ)。具体来说,首先像传统检测器一样放置密集查询,然后为一对一分配选择不同的查询。DDQ融合了传统方法和最近的端到端检测器的优点,并显著提高了各种检测器的性能,包括FCN、R-CNN和DETRs。最令人印象深刻的是,DDQ-DETR在MS-COCO数据集上使用ResNet-50骨干网络在12个时期内实现了52.1的AP,优于同一设置下的所有现有检测器。DDQ在拥挤场景中也享有端到端检测器的好处,并在CrowdHuman数据集上实现了93.8的AP。我们希望DDQ能够激发研究人员考虑传统方法和端到端检测器之间的互补性。源代码可以在https://github.com/jshilong/DDQ找到。

Paper63 Virtual Sparse Convolution for Multimodal 3D Object Detection

摘要原文: Recently, virtual/pseudo-point-based 3D object detection that seamlessly fuses RGB images and LiDAR data by depth completion has gained great attention. However, virtual points generated from an image are very dense, introducing a huge amount of redundant computation during detection. Meanwhile, noises brought by inaccurate depth completion significantly degrade detection precision. This paper proposes a fast yet effective backbone, termed VirConvNet, based on a new operator VirConv (Virtual Sparse Convolution), for virtual-point-based 3D object detection. The VirConv consists of two key designs: (1) StVD (Stochastic Voxel Discard) and (2) NRConv (Noise-Resistant Submanifold Convolution). The StVD alleviates the computation problem by discarding large amounts of nearby redundant voxels. The NRConv tackles the noise problem by encoding voxel features in both 2D image and 3D LiDAR space. By integrating our VirConv, we first develop an efficient pipeline VirConv-L based on an early fusion design. Then, we build a high-precision pipeline VirConv-T based on a transformed refinement scheme. Finally, we develop a semi-supervised pipeline VirConv-S based on a pseudo-label framework. On the KITTI car 3D detection test leaderboard, our VirConv-L achieves 85% AP with a fast running speed of 56ms. Our VirConv-T and VirConv-S attains a high-precision of 86.3% and 87.2% AP, and currently rank 2nd and 1st, respectively. The code is available at https://github.com/hailanyi/VirConv.

中文总结: 最近,虚拟/伪点基于RGB图像和LiDAR数据融合的3D物体检测引起了广泛关注。然而,从图像生成的虚拟点非常密集,在检测过程中引入了大量冗余计算。同时,由于不准确的深度完成引入的噪声显著降低了检测精度。本文提出了一种快速而有效的骨干网络,称为VirConvNet,基于一种新的操作符VirConv(虚拟稀疏卷积),用于基于虚拟点的3D物体检测。VirConv包括两个关键设计:(1)StVD(随机体素丢弃)和(2)NRConv(抗噪子流形卷积)。StVD通过丢弃大量相邻冗余体素来缓解计算问题。NRConv通过在2D图像和3D LiDAR空间中编码体素特征来解决噪声问题。通过集成我们的VirConv,我们首先开发了一种基于早期融合设计的高效流水线VirConv-L。然后,我们基于转换细化方案构建了一个高精度流水线VirConv-T。最后,我们基于伪标签框架开发了一种半监督流水线VirConv-S。在KITTI汽车3D检测测试排行榜上,我们的VirConv-L以56毫秒的快速运行速度达到了85%的AP。我们的VirConv-T和VirConv-S分别达到了86.3%和87.2%的高精度AP,并目前排名第2和第1。代码可在https://github.com/hailanyi/VirConv上找到。

Paper64 Weak-Shot Object Detection Through Mutual Knowledge Transfer

摘要原文: Weak-shot Object Detection methods exploit a fully-annotated source dataset to facilitate the detection performance on the target dataset which only contains image-level labels for novel categories. To bridge the gap between these two datasets, we aim to transfer the object knowledge between the source (S) and target (T) datasets in a bi-directional manner. We propose a novel Knowledge Transfer (KT) loss which simultaneously distills the knowledge of objectness and class entropy from a proposal generator trained on the S dataset to optimize a multiple instance learning module on the T dataset. By jointly optimizing the classification loss and the proposed KT loss, the multiple instance learning module effectively learns to classify object proposals into novel categories in the T dataset with the transferred knowledge from base categories in the S dataset. Noticing the predicted boxes on the T dataset can be regarded as an extension for the original annotations on the S dataset to refine the proposal generator in return, we further propose a novel Consistency Filtering (CF) method to reliably remove inaccurate pseudo labels by evaluating the stability of the multiple instance learning module upon noise injections. Via mutually transferring knowledge between the S and T datasets in an iterative manner, the detection performance on the target dataset is significantly improved. Extensive experiments on public benchmarks validate that the proposed method performs favourably against the state-of-the-art methods without increasing the model parameters or inference computational complexity.

中文总结: 弱标注目标检测方法利用完全标注的源数据集来提高只包含新类别图像级标签的目标数据集上的检测性能。为了弥合这两个数据集之间的差距,我们旨在以双向方式在源(S)和目标(T)数据集之间传递目标知识。我们提出了一种新颖的知识传递(KT)损失,同时从在S数据集上训练的提议生成器中提炼目标性和类别熵的知识,以优化T数据集上的多实例学习模块。通过联合优化分类损失和提出的KT损失,多实例学习模块有效地学习将目标提议分类为T数据集中的新类别,并从S数据集中的基础类别传递知识。注意到T数据集上的预测框可以被视为对S数据集上原始标注的扩展,以改进提议生成器,我们进一步提出了一种新颖的一致性过滤(CF)方法,通过评估多实例学习模块对噪声注入的稳定性来可靠地去除不准确的伪标签。通过在S和T数据集之间相互迭代地传递知识,目标数据集上的检测性能得到了显著提升。在公共基准上进行的大量实验验证了所提出的方法在不增加模型参数或推理计算复杂性的情况下表现优于现有方法。

Paper65 Enhanced Training of Query-Based Object Detection via Selective Query Recollection

摘要原文: This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage. We review the training process and attribute the overlooked phenomenon to two limitations: lack of training emphasis and cascading errors from decoding sequence. We design and present Selective Query Recollection (SQR), a simple and effective training strategy for query-based object detectors. It cumulatively collects intermediate queries as decoding stages go deeper and selectively forwards the queries to the downstream stages aside from the sequential structure. Such-wise, SQR places training emphasis on later stages and allows later stages to work with intermediate queries from earlier stages directly. SQR can be easily plugged into various query-based object detectors and significantly enhances their performance while leaving the inference pipeline unchanged. As a result, we apply SQR on Adamixer, DAB-DETR, and Deformable-DETR across various settings (backbone, number of queries, schedule) and consistently brings 1.4 2.8 AP improvement.

中文总结: 这篇论文调查了一个现象,即基于查询的物体检测器在预测中间阶段时预测正确,但在最后解码阶段出现错误。我们审查了训练过程,并将被忽视的现象归因于两个限制:缺乏训练强调和来自解码序列的级联错误。我们设计并提出了选择性查询回溯(SQR),这是一种简单有效的基于查询的物体检测器训练策略。随着解码阶段的深入,SQR逐渐收集中间查询,并选择性地将查询转发到下游阶段,而不是按照顺序结构。这样,SQR将训练重点放在后期阶段,并允许后期阶段直接使用来自早期阶段的中间查询。SQR可以轻松插入各种基于查询的物体检测器,并显著提高它们的性能,同时保持推理流程不变。因此,我们在Adamixer、DAB-DETR和Deformable-DETR上应用了SQR,跨各种设置(骨干、查询数量、调度),并始终带来1.4-2.8个AP改进。

Paper66 The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection

摘要原文: Most camera lens systems are designed in isolation, separately from downstream computer vision methods. Recently, joint optimization approaches that design lenses alongside other components of the image acquisition and processing pipeline–notably, downstream neural networks–have achieved improved imaging quality or better performance on vision tasks. However, these existing methods optimize only a subset of lens parameters and cannot optimize glass materials given their categorical nature. In this work, we develop a differentiable spherical lens simulation model that accurately captures geometrical aberrations. We propose an optimization strategy to address the challenges of lens design–notorious for non-convex loss function landscapes and many manufacturing constraints–that are exacerbated in joint optimization tasks. Specifically, we introduce quantized continuous glass variables to facilitate the optimization and selection of glass materials in an end-to-end design context, and couple this with carefully designed constraints to support manufacturability. In automotive object detection, we report improved detection performance over existing designs even when simplifying designs to two- or three-element lenses, despite significantly degrading the image quality.

中文总结: 这段话主要讨论了相机镜头系统的设计问题。传统上,大多数相机镜头系统是独立设计的,与后续的计算机视觉方法分开。最近,一种联合优化方法开始兴起,该方法在设计镜头的同时还考虑了图像采集和处理管道的其他组件,尤其是下游的神经网络,从而实现了更好的成像质量或更好的视觉任务性能。然而,现有的方法仅优化了部分镜头参数,无法优化玻璃材料,因为玻璃材料是分类的。在这项工作中,作者开发了一个可微分的球面镜头模拟模型,准确捕捉了几何畸变。他们提出了一种优化策略,以解决镜头设计的挑战,这些挑战在联合优化任务中变得更加严峻,因为镜头设计通常具有非凸损失函数和许多制造约束。具体而言,他们引入了量化连续的玻璃变量,以促进玻璃材料的优化和选择,并结合精心设计的约束条件来支持可制造性。在汽车物体检测方面,即使简化设计为两个或三个元素的镜头,他们也报告了比现有设计更好的检测性能,尽管这会显著降低图像质量。

Paper67 Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection

摘要原文: Salient object detection (SOD) aims to mimic the human visual system (HVS) and cognition mechanisms to identify and segment salient objects. However, due to the complexity of these mechanisms, current methods are not perfect. Accuracy and robustness need to be further improved, particularly in complex scenes with multiple objects and background clutter. To address this issue, we propose a novel approach called Multiple Enhancement Network (MENet) that adopts the boundary sensibility, content integrity, iterative refinement, and frequency decomposition mechanisms of HVS. A multi-level hybrid loss is firstly designed to guide the network to learn pixel-level, region-level, and object-level features. A flexible multiscale feature enhancement module (ME-Module) is then designed to gradually aggregate and refine global or detailed features by changing the size order of the input feature sequence. An iterative training strategy is used to enhance boundary features and adaptive features in the dual-branch decoder of MENet. Comprehensive evaluations on six challenging benchmark datasets show that MENet achieves state-of-the-art results. Both the codes and results are publicly available at https://github.com/yiwangtz/MENet.

中文总结: 这段话主要讨论了显著目标检测(SOD)的研究内容。SOD的目标是模仿人类视觉系统(HVS)和认知机制,以识别和分割显著对象。然而,由于这些机制的复杂性,目前的方法并不完美。准确性和鲁棒性需要进一步提高,特别是在具有多个对象和背景混乱的复杂场景中。为了解决这个问题,提出了一种名为多重增强网络(MENet)的新方法,该方法采用了HVS的边界敏感性、内容完整性、迭代细化和频率分解机制。首先设计了一个多级混合损失,以引导网络学习像素级、区域级和对象级特征。然后设计了一个灵活的多尺度特征增强模块(ME-Module),逐渐聚合和细化全局或详细特征,通过改变输入特征序列的大小顺序。采用迭代训练策略来增强MENet的双分支解码器中的边界特征和自适应特征。在六个具有挑战性的基准数据集上进行的综合评估表明,MENet实现了最先进的结果。代码和结果均可在https://github.com/yiwangtz/MENet 上公开获取。

Paper68 In-Hand 3D Object Scanning From an RGB Sequence

摘要原文: We propose a method for in-hand 3D scanning of an unknown object with a monocular camera. Our method relies on a neural implicit surface representation that captures both the geometry and the appearance of the object, however, by contrast with most NeRF-based methods, we do not assume that the camera-object relative poses are known. Instead, we simultaneously optimize both the object shape and the pose trajectory. As direct optimization over all shape and pose parameters is prone to fail without coarse-level initialization, we propose an incremental approach that starts by splitting the sequence into carefully selected overlapping segments within which the optimization is likely to succeed. We reconstruct the object shape and track its poses independently within each segment, then merge all the segments before performing a global optimization. We show that our method is able to reconstruct the shape and color of both textured and challenging texture-less objects, outperforms classical methods that rely only on appearance features, and that its performance is close to recent methods that assume known camera poses.

中文总结: 本文提出了一种利用单目摄像头对未知物体进行手持式3D扫描的方法。我们的方法依赖于神经隐式表面表示,能够捕捉物体的几何形状和外观,然而,与大多数基于NeRF的方法不同的是,我们并不假设摄像头和物体之间的相对姿势是已知的。相反,我们同时优化物体形状和姿势轨迹。由于直接优化所有形状和姿势参数容易失败,需要在粗略级别初始化之后采用增量方法。我们提出的增量方法首先将序列分割为精心选择的重叠段,以便在这些段内优化可能成功。我们在每个段内独立重建物体形状并跟踪其姿势,然后在执行全局优化之前合并所有段。我们展示了我们的方法能够重建纹理丰富和挑战性无纹理物体的形状和颜色,优于仅依赖外观特征的传统方法,并且其性能接近最近假设已知摄像头姿势的方法。

Paper69 UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird’s-Eye View

摘要原文: In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex. Since the multi-modal methods have system complexity while the accuracy of single-modal ones is relatively low, how to make a tradeoff between them is difficult. In this work, we propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors. Specifically, during training, UniDistill projects the features of both the teacher and the student detector into Bird’s-Eye-View (BEV), which is a friendly representation for different modalities. Then, three distillation losses are calculated to sparsely align the foreground features, helping the student learn from the teacher without introducing additional cost during inference. Taking advantage of the similar detection paradigm of different detectors in BEV, UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths. Furthermore, the three distillation losses can filter the effect of misaligned background information and balance between objects of different sizes, improving the distillation effectiveness. Extensive experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0% 3.2%.

中文总结: 这段话主要讨论了在自动驾驶中的3D物体检测领域中,传感器组合包括多模态和单模态,具有多样性和复杂性。由于多模态方法具有系统复杂性,而单模态方法的准确性相对较低,如何在它们之间进行权衡是困难的。在这项工作中,提出了一种通用的跨模态知识蒸馏框架(UniDistill),以提高单模态检测器的性能。具体来说,在训练过程中,UniDistill将教师和学生检测器的特征投影到鸟瞰图中,这是一种适合不同模态的友好表示。然后,计算三个蒸馏损失,以稀疏对齐前景特征,帮助学生从教师那里学习,而不会在推断过程中引入额外的成本。利用BEV中不同检测器的相似检测范式,UniDistill轻松支持LiDAR到摄像头、摄像头到LiDAR、融合到LiDAR和融合到摄像头的蒸馏路径。此外,三个蒸馏损失可以过滤对齐不准的背景信息的影响,并在不同大小的物体之间保持平衡,提高蒸馏效果。在nuScenes上进行的大量实验表明,UniDistill有效地将学生检测器的mAP和NDS提高了2.0%至3.2%。

Paper70 Unknown Sniffer for Object Detection: Don’t Turn a Blind Eye to Unknown Objects

摘要原文: The recently proposed open-world object and open-set detection have achieved a breakthrough in finding never-seen-before objects and distinguishing them from known ones. However, their studies on knowledge transfer from known classes to unknown ones are not deep enough, resulting in the scanty capability for detecting unknowns hidden in the background. In this paper, we propose the unknown sniffer (UnSniffer) to find both unknown and known objects. Firstly, the generalized object confidence (GOC) score is introduced, which only uses known samples for supervision and avoids improper suppression of unknowns in the background. Significantly, such confidence score learned from known objects can be generalized to unknown ones. Additionally, we propose a negative energy suppression loss to further suppress the non-object samples in the background. Next, the best box of each unknown is hard to obtain during inference due to lacking their semantic information in training. To solve this issue, we introduce a graph-based determination scheme to replace hand-designed non-maximum suppression (NMS) post-processing. Finally, we present the Unknown Object Detection Benchmark, the first publicly benchmark that encompasses precision evaluation for unknown detection to our knowledge. Experiments show that our method is far better than the existing state-of-the-art methods. Code is available at: https://github.com/Went-Liang/UnSniffer.

中文总结: 这段话主要介绍了最近提出的开放世界目标和开放集检测在发现从未见过的对象并将其与已知对象区分开方面取得了突破。然而,它们对从已知类别向未知类别的知识传递的研究不够深入,导致在检测隐藏在背景中的未知对象方面能力不足。因此,作者提出了未知嗅探器(UnSniffer)来发现未知和已知对象。首先引入了广义对象置信度(GOC)分数,该分数仅使用已知样本进行监督,避免了在背景中不当抑制未知对象。这种从已知对象学习到的置信度分数可以推广到未知对象。此外,作者提出了一种负能量抑制损失,进一步抑制背景中的非对象样本。接下来,由于训练中缺乏未知对象的语义信息,因此在推断过程中很难获得每个未知对象的最佳框。为了解决这个问题,作者引入了基于图的确定方案来取代手动设计的非极大值抑制(NMS)后处理。最后,作者提出了未知对象检测基准,这是我们所知的第一个公开基准,涵盖了对未知检测的精度评估。实验证明,我们的方法远远优于现有的最先进方法。源代码可在以下链接获取:https://github.com/Went-Liang/UnSniffer。

Paper71 CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation

摘要原文: For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 22 CoW baselines across Habitat, RoboTHOR, and Pasture. In total we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are surprisingly proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration—and no additional training—matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model.

中文总结: 这段话主要讨论了为了使机器人在普遍情况下有用,它们必须能够找到由人描述的任意对象(即受语言驱动),即使没有在领域数据上进行昂贵的导航训练(即进行零样本推理)。作者们在一个统一的设置中探讨了这些能力:基于语言驱动的零样本对象导航(L-ZSON)。受到最近开放词汇模型在图像分类方面的成功启发,他们研究了一个简单的框架,CLIP on Wheels(CoW),以将开放词汇模型应用于这一任务而无需微调。为了更好地评估L-ZSON,他们引入了Pasture基准测试,考虑了查找不常见的对象、由空间和外观属性描述的对象以及相对于可见对象描述的隐藏对象。他们通过在Habitat、RoboTHOR和Pasture上直接部署22个CoW基线进行了深入的实证研究。总共评估了超过90,000个导航场景,并发现:(1)CoW基线通常难以利用语言描述,但出乎意料地擅长找到不常见的对象。 (2)一个简单的CoW,具有基于CLIP的对象定位和经典的探索方法—并且没有额外的训练—与在Habitat MP3D数据上进行了500M步训练的最先进ZSON方法的导航效率相匹配。这个相同的CoW相比于最先进的RoboTHOR ZSON模型提供了15.6个百分点的成功率提升。

Paper72 Referring Multi-Object Tracking

摘要原文: Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The Refer-KITTI dataset and the code are released at https://referringmot.github.io.

中文总结: 这篇论文提出了一个新的和通用的指代理解任务,称为指代多目标跟踪(RMOT)。其核心思想是利用语言表达作为语义线索来指导多目标跟踪的预测。据我们所知,这是第一个在视频中实现任意数量参考对象预测的工作。为了推动RMOT的发展,我们基于KITTI构建了一个具有可扩展表达的基准数据集,称为Refer-KITTI。具体来说,它提供了18个视频,包含818个表达式,每个视频中的每个表达式平均注释了10.7个对象。此外,我们开发了一个基于Transformer的架构TransRMOT来处理这一新任务,以在线方式实现了令人印象深刻的检测性能,并超越了其他对手。Refer-KITTI数据集和代码已在https://referringmot.github.io发布。

Paper73 NeRF-RPN: A General Framework for Object Detection in NeRFs

摘要原文: This paper presents the first significant object detection framework, NeRF-RPN, which directly operates on NeRF. Given a pre-trained NeRF model, NeRF-RPN aims to detect all bounding boxes of objects in a scene. By exploiting a novel voxel representation that incorporates multi-scale 3D neural volumetric features, we demonstrate it is possible to regress the 3D bounding boxes of objects in NeRF directly without rendering the NeRF at any viewpoint. NeRF-RPN is a general framework and can be applied to detect objects without class labels. We experimented NeRF-RPN with various backbone architectures, RPN head designs, and loss functions. All of them can be trained in an end-to-end manner to estimate high quality 3D bounding boxes. To facilitate future research in object detection for NeRF, we built a new benchmark dataset which consists of both synthetic and real-world data with careful labeling and clean up. Code and dataset are available at https://github.com/lyclyc52/NeRF_RPN.

中文总结: 这篇论文介绍了第一个重要的目标检测框架NeRF-RPN,该框架直接在NeRF上运行。给定一个预训练的NeRF模型,NeRF-RPN旨在检测场景中所有对象的边界框。通过利用一种新颖的体素表示,结合多尺度的3D神经体积特征,我们证明可以直接回归NeRF中对象的3D边界框,而无需在任何视角渲染NeRF。NeRF-RPN是一个通用框架,可用于检测没有类标签的对象。我们尝试了使用不同的骨干架构、RPN头设计和损失函数来实现NeRF-RPN。所有这些都可以以端到端的方式训练,以估计高质量的3D边界框。为了促进未来在NeRF中进行目标检测的研究,我们构建了一个新的基准数据集,其中包含经过仔细标记和清理的合成和真实世界数据。代码和数据集可在https://github.com/lyclyc52/NeRF_RPN 上获得。

Paper74 NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360deg Views

摘要原文: Virtual reality and augmented reality (XR) bring increasing demand for 3D content generation. However, creating high-quality 3D content requires tedious work from a human expert. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360deg views that corresponds well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/

中文总结: 这段话主要讨论了虚拟现实和增强现实(XR)对3D内容生成需求不断增加的情况。然而,创建高质量的3D内容需要人类专家进行繁琐的工作。作者研究了将单个图像转换为3D对象的挑战性任务,并首次展示了能够生成与给定参考图像相符的具有360度视图的可信3D对象的能力。通过在参考图像上进行条件化,作者的模型可以满足从图像中合成对象的新视图的永恒好奇心。作者的技术为简化3D艺术家和XR设计师的工作流程指明了一个有希望的方向。作者提出了一个名为NeuralLift-360的新框架,利用了深度感知的神经辐射表示(NeRF),并学会了通过去噪扩散模型引导场景的技巧。通过引入一个排名损失,作者的NeuralLift-360可以在野外使用粗糙的深度估计进行引导。作者还采用了一个以CLIP为指导的采样策略来提供连贯的引导。大量实验表明,作者的NeuralLift-360明显优于现有的最先进基线。项目页面:https://vita-group.github.io/NeuralLift-360/

Paper75 Command-Driven Articulated Object Understanding and Manipulation

摘要原文: We present Cart, a new approach towards articulated-object manipulations by human commands. Beyond the existing work that focuses on inferring articulation structures, we further support manipulating articulated shapes to align them subject to simple command templates. The key of Cart is to utilize the prediction of object structures to connect visual observations with user commands for effective manipulations. It is achieved by encoding command messages for motion prediction and a test-time adaptation to adjust the amount of movement from only command supervision. For a rich variety of object categories, Cart can accurately manipulate object shapes and outperform the state-of-the-art approaches in understanding the inherent articulation structures. Also, it can well generalize to unseen object categories and real-world objects. We hope Cart could open new directions for instructing machines to operate articulated objects.

中文总结: 这段话主要介绍了一种名为Cart的新方法,通过人类命令来进行关节对象的操作。与现有的工作侧重于推断关节结构不同,Cart进一步支持操作关节形状,使它们按照简单的命令模板进行对齐。Cart的关键在于利用对象结构的预测来将视觉观察与用户命令连接起来,以实现有效的操作。通过对运动预测的命令消息进行编码和在测试时进行适应性调整以根据仅有的命令监督来调整移动量,实现了这一目标。对于丰富的对象类别,Cart能够准确地操作对象形状,并在理解固有关节结构方面优于现有技术方法。此外,它能够很好地推广到未见过的对象类别和现实世界的对象。我们希望Cart能够为指导机器操作关节对象开辟新的方向。

Paper76 Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection

摘要原文: In this study, we dive deep into the inconsistency of pseudo targets in semi-supervised object detection (SSOD). Our core observation is that the oscillating pseudo-targets undermine the training of an accurate detector. It injects noise into the student’s training, leading to severe overfitting problems. Therefore, we propose a systematic solution, termed NAME, to reduce the inconsistency. First, adaptive anchor assignment (ASA) substitutes the static IoU-based strategy, which enables the student network to be resistant to noisy pseudo-bounding boxes. Then we calibrate the subtask predictions by designing a 3D feature alignment module (FAM-3D). It allows each classification feature to adaptively query the optimal feature vector for the regression task at arbitrary scales and locations. Lastly, a Gaussian Mixture Model (GMM) dynamically revises the score threshold of pseudo-bboxes, which stabilizes the number of ground truths at an early stage and remedies the unreliable supervision signal during training. NAME provides strong results on a large range of SSOD evaluations. It achieves 40.0 mAP with ResNet-50 backbone given only 10% of annotated MS-COCO data, which surpasses previous baselines using pseudo labels by around 3 mAP. When trained on fully annotated MS-COCO with additional unlabeled data, the performance further increases to 47.7 mAP. Our code is available at https://github.com/Adamdad/ConsistentTeacher.

中文总结: 在这项研究中,我们深入探讨了半监督目标检测(SSOD)中伪目标的不一致性。我们的核心观察是,振荡的伪目标破坏了准确检测器的训练。它向学生的训练中注入了噪声,导致严重的过拟合问题。因此,我们提出了一个系统性解决方案,名为NAME,以减少这种不一致性。首先,自适应锚点分配(ASA)取代了静态IoU-based策略,使学生网络能够抵抗嘈杂的伪边界框。然后,我们通过设计一个3D特征对齐模块(FAM-3D)来校准子任务预测。它允许每个分类特征根据需要在任意尺度和位置自适应地查询回归任务的最佳特征向量。最后,一个高斯混合模型(GMM)动态修订伪边界框的得分阈值,稳定了早期阶段的地面实况数量,并在训练过程中纠正了不可靠的监督信号。NAME在大范围的SSOD评估中取得了强大的结果。仅使用10%的标记的MS-COCO数据和ResNet-50骨干网络,它实现了40.0 mAP,超过了以前使用伪标签的基线约3 mAP。当在完全标记的MS-COCO数据上训练时,再加上额外的未标记数据,性能进一步提高至47.7 mAP。我们的代码可在https://github.com/Adamdad/ConsistentTeacher 上找到。

Paper77 OCTET: Object-Aware Counterfactual Explanations

摘要原文: Nowadays, deep vision models are being widely deployed in safety-critical applications, e.g., autonomous driving, and explainability of such models is becoming a pressing concern. Among explanation methods, counterfactual explanations aim to find minimal and interpretable changes to the input image that would also change the output of the model to be explained. Such explanations point end-users at the main factors that impact the decision of the model. However, previous methods struggle to explain decision models trained on images with many objects, e.g., urban scenes, which are more difficult to work with but also arguably more critical to explain. In this work, we propose to tackle this issue with an object-centric framework for counterfactual explanation generation. Our method, inspired by recent generative modeling works, encodes the query image into a latent space that is structured in a way to ease object-level manipulations. Doing so, it provides the end-user with control over which search directions (e.g., spatial displacement of objects, style modification, etc.) are to be explored during the counterfactual generation. We conduct a set of experiments on counterfactual explanation benchmarks for driving scenes, and we show that our method can be adapted beyond classification, e.g., to explain semantic segmentation models. To complete our analysis, we design and run a user study that measures the usefulness of counterfactual explanations in understanding a decision model. Code is available at https://github.com/valeoai/OCTET.

中文总结: 现今,深度视觉模型被广泛应用于安全关键应用,例如自动驾驶,这些模型的可解释性正变得日益紧迫。在各种解释方法中,反事实解释旨在找到最小和可解释的更改,使输入图像的输出也会改变为需要解释的模型。这些解释指导最终用户了解影响模型决策的主要因素。然而,先前的方法难以解释在训练了许多对象图像的决策模型,例如城市场景,这些场景更难处理但也更重要解释。在这项工作中,我们提出了一个面向对象的反事实解释生成框架来解决这个问题。我们的方法受到最近生成建模工作的启发,将查询图像编码为一个结构化的潜在空间,以便于进行对象级别的操作。通过这样做,它为最终用户提供了对在反事实生成过程中要探索哪些搜索方向(例如对象的空间位移,样式修改等)的控制。我们在驾驶场景的反事实解释基准上进行了一系列实验,并展示了我们的方法可以扩展到超出分类的应用,例如解释语义分割模型。为了完成我们的分析,我们设计并进行了一项用户研究,以衡量反事实解释在理解决策模型方面的实用性。代码可在https://github.com/valeoai/OCTET找到。

Paper78 MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

摘要原文: In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector. Existing end-to-end methods, e.g. MOTR and TrackFormer are inferior to their tracking-by-detection counterparts mainly due to their poor detection performance. We aim to improve MOTR by elegantly incorporating an extra object detector. We first adopt the anchor formulation of queries and then use an extra object detector to generate proposals as anchors, providing detection prior to MOTR. The simple modification greatly eases the conflict between joint learning detection and association tasks in MOTR. MOTRv2 keeps the end-to-end feature and scales well on large-scale benchmarks. MOTRv2 achieves the top performance (73.4% HOTA) among all existing methods on the DanceTrack dataset. Moreover, MOTRv2 reaches state-of-the-art performance on the BDD100K dataset. We hope this simple and effective pipeline can provide some new insights to the end-to-end MOT community. The code will be released in the near future.

中文总结: 在这篇论文中,我们提出了MOTRv2,这是一个简单而有效的流水线,用于利用预训练的目标检测器引导端到端的多目标跟踪。现有的端到端方法,如MOTR和TrackFormer,与它们的基于检测的跟踪方法相比表现较差,主要是因为它们的检测性能较差。我们旨在通过优雅地整合额外的目标检测器来改进MOTR。我们首先采用查询的锚点公式,然后使用额外的目标检测器生成提议作为锚点,为MOTR提供检测先验。这一简单的修改极大地减轻了MOTR中联合学习检测和关联任务之间的冲突。MOTRv2保留了端到端特征,并在大规模基准测试中表现出色。MOTRv2在DanceTrack数据集中实现了最佳性能(73.4% HOTA)。此外,MOTRv2在BDD100K数据集上达到了最先进的性能。我们希望这个简单而有效的流水线能为端到端MOT社区提供一些新的见解。代码将在不久的将来发布。

CVPR 2023是计算机视觉和模式识别的顶级会议,UAV(无人机)在该会议上是一个热门的研究领域。 UAV(无人机)技术在过去几年中取得了显著的发展和广泛的应用。它们被广泛用于农业、测绘、监测和救援等领域。CVPR 2023将成为研究者们交流、展示和分享无人机相关研究的理想平台。 首先,CVPR 2023将提供一个特殊的无人机研究专题,以探讨该领域的最新进展和创新。研究人员可以提交和展示基于无人机的计算机视觉和模式识别的研究成果。这些研究可能涉及无人机导航、目标识别、图像处理等方面,以解决现实世界中的问题。 其次,CVPR 2023也将包括无人机在计算机视觉和模式识别中的应用研究。无人机可以提供独特的视角和数据采集能力,用于处理各种计算机视觉任务,如物体检测、场景分割等。研究者可以展示他们基于无人机的方法与传统方法的对比实验结果,并讨论无人机在这些领域的优势和局限性。 此外,CVPR 2023还将包括与无人机相关的新兴技术和趋势的讨论。例如,无人机与深度学习、增强现实等领域的结合,将推动计算机视觉和模式识别的研究和应用取得更大的突破。研究者可以分享他们在这些交叉领域中的创新成果,并与其他学者进行深入的讨论和合作。 总之,CVPR 2023将为无人机在计算机视觉和模式识别领域的研究提供一个重要的平台。它将促进学术界和工业界之间的合作与交流,并为未来的无人机技术发展提供新的思路和方向。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值