论文阅读笔记（二十八）：facebookresearch/Detectron

最新推荐文章于 2024-03-19 09:37:55 发布

__Sunshine__

最新推荐文章于 2024-03-19 09:37:55 发布

阅读量1.4k

点赞数

分类专栏：笔记文章标签： Detectron

本文链接：https://blog.csdn.net/sunshine_010/article/details/80031270

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

At FAIR, Detectron has enabled numerous research projects, including:

Feature Pyramid Networks for Object Detection

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection.

特征金字塔是识别系统中用于检测不同尺度目标的基本组件。但最近的深度学习目标检测器已经避免了金字塔表示，部分原因是它们是计算和内存密集型的。在本文中，我们利用深度卷积网络内在的多尺度、金字塔分级来构造具有很少额外成本的特征金字塔。开发了一种具有横向连接的自顶向下架构，用于在所有尺度上构建高级语义特征映射。这种称为特征金字塔网络（FPN）的架构在几个应用程序中作为通用特征提取器表现出了显著的改进。在一个基本的Faster R-CNN系统中使用FPN，没有任何不必要的东西，我们的方法可以在COCO检测基准数据集上取得state-of-the-art的单模型结果，结果超过了所有现有的单模型输入，包括COCO 2016挑战赛的获奖者。此外，我们的方法可以在GPU上以6FPS运行，因此是多尺度目标检测的实用和准确的解决方案。

这里写图片描述
Mask R-CNN

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, boundingbox object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

我们提出了一个概念上简单，灵活和通用的目标分割框架。我们的方法有效地检测图像中的目标，同时为每个实例生成高质量的分割掩码。称为Mask R-CNN的方法通过添加一个与现有目标检测框回归并行的，用于预测目标掩码的分支来扩展Faster R-CNN。Mask R-CNN训练简单，相对于Faster R-CNN，只需增加一个较小的开销，运行速度可达5 FPS。此外，Mask R-CNN很容易推广到其他任务，例如，允许我们在同一个框架中估计人的姿势。我们在COCO挑战的所有三个项目中取得了最佳成绩，包括目标分割，目标检测和人体关键点检测。在没有使用额外技巧的情况下，Mask R-CNN优于所有现有的单一模型，包括COCO 2016挑战优胜者。我们希望我们的简单而有效的方法将成为一个促进未来目标级识别领域研究的坚实基础。我们稍后将提供代码。

这里写图片描述

Detecting and Recognizing Human-Object Interactions

To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting ⟨human, verb, object⟩ triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person – their pose, clothing, action – is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

为了理解视觉世界，机器不仅要识别单个物体实例，还要识别它们如何相互作用。人类经常处于这种相互作用的中心，检测人与物体的相互作用是一个重要的实际的和科学问题。在本文中，我们将探讨在具有挑战性的日常照片中检测“人”，“动词”，“物体”三元组的任务。我们提出了一种以human-centric approach驱动的新型模型。我们的假设是，一个人的外观他们的姿势，服装，行动是一个强大的线索，用于定位他们正在与之交互的物体。为了利用这个线索，我们的模型学习根据检测到的人物的外观来预测针对目标物体位置的action-specific密度。我们的模型还共同学习探测人和物体，通过融合这些预测，它可以在我们称之为InteractNet的完全，共同训练的端到端系统中有效地推断出交互三元组。我们验证了最近在COCO（V-COCO）和HICO-DET数据集中引入的Verbs的方法，其中我们展示了数量上引人注目的结果。

这里写图片描述
Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.

迄今为止, 最高精度的物体检测器是基于 R-CNN 推广的 two-stage方法, 其中分类器被应用到稀疏的候选物体定位集合中。相比之下, one-stage检测器应用在可能的目标定位常规密集取样之上, 它有可能更快、更简单, 但迄今已落后于 two-stage检测器的精确度。本文对这一情况进行了调查。发现在密集检测器训练过程中遇到的极端 foreground-background类不平衡是主要原因。我们提出通过reshaping 标准的 cross entropy loss来解决这类不平衡问题, 这样它就会把损失分配给分类良好的例子。我们的novel Focal Loss集中在一组稀疏的hard example上进行训练, 并防止在训练过程中大量的easy negatives overwhelm检测器。为了评估损失的有效性, 我们设计和训练了一个简单的密集检测器叫做RetinaNet。我们的结果表明, 当用focal loss训练, RetinaNet 可以匹配以前的one-stage检测器的速度, 同时超过了所有现有的state-of-the-art的two-stage检测器的准确性。代码在: https://github.com/facebookresearch/Detectron。

这里写图片描述
Non-local Neural Networks

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

卷积和循环操作都是一次处理一个local neighborhood的构建块。在本文中，我们将non-local操作作为捕获远程依赖关系的通用系列构建块。受计算机视觉中的non-local means method[4]的启发，我们的non-local运算将位置处的响应计算为所有位置处的特征的加权和。这个构建模块可以插入许多计算机视觉体系结构中。在视频分类的任务中，即使没有任何花里胡哨的工作，我们的non-local模型可以在Kinetics和Charades数据集上竞争或胜过当前的竞赛获胜者。在静态图像识别中，我们的non-local模型改进了COCO套件中的对象检测/分割和姿态估计。代码将可用。

这里写图片描述

Learning to Segment Every Thing

Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ∼100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We carefully evaluate our proposed approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

物体实例分割的现有方法要求所有训练实例都用分割掩码进行标记。这个要求使得注释新的类别变得很昂贵，并且将实例分割模型限制在大约100个注释良好的类中。本文的目标是提出一种新的部分监督训练范式，以及一种新颖的权重传递函数，它可以在大量类别上进行训练实例分割模型，所有这些类别都有框注释，但只有一小部分具有掩码注释。通过这些贡献，我们可以训练Mask R-CNN使用Visual Genome数据集中的框注释和COCO数据集中80个类的掩码注释来检测和分割3000个视觉概念。我们在COCO数据集的对照研究中仔细评估了我们提出的方法。这项工作是对具有广泛理解视觉世界的实例细分模型的第一步。

这里写图片描述

Data Distillation: Towards Omni-Supervised Learning

We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lowerbounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging realworld data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

我们调查omni-supervised learning，这是一种半监督学习的特殊体制，学习者利用所有可用的标记数据加上未标记数据的互联网规模来源。omni-supervised learning在现有标记数据集上的表现下限，提供超越state-of-the-art的完全监督方法的潜力。为了利用全方位监督设置，我们提出了data distillation，一种使用单一模型集成来自多个未标记数据变换的预测的方法，以自动生成新的训练注释。我们认为视觉识别模型最近已经足够准确，现在可以将关于自我训练的经典观点应用于具有挑战性的现实世界数据。我们的实验结果表明，在人体关键点检测和一般物体检测的情况下，用data distillation进行训练的state-of-the-art的模型超过了单独使用来自COCO数据集的标记数据的性能。

这里写图片描述

DensePose: Dense Human Pose Estimation In The Wild

In this work, we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We first gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence ‘in the wild’, namely in the presence of background, occlusions and scale variations. We improve our training set’s effectiveness by training an ‘inpainting’ network that can fill in missing ground truth values, and report clear improvements with respect to the best results that would be achievable in the past. We experiment with fullyconvolutional networks and region-based models and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly-accurate results in real time. Supplementary materials and videos are provided on the project page http://densepose.org.

在这项工作中，我们建立了RGB图像和人体surface-based的表示之间的密集对应关系，我们称之为密集人体姿态估计。我们首先通过引入高效的annotation pipeline来收集出现在COCO数据集中的5万人的密集对应关系。然后，我们使用我们的数据集来训练基于CNN的系统，“in the wild”实现密集的对应关系，也就是背景，遮挡和尺度变化。我们通过训练能够填补缺失的ground truth值的“inpainting”网络来提高我们的训练集的有效性，并报告过去可以实现的最佳结果方面的明显改进。我们试验用全卷积网络和基于区域的模型，并观察后者的优越性;我们通过cascading进一步提高准确性，获得实时提供高精度结果的系统。补充材料和视频在项目页面http://densepose.org上提供。

这里写图片描述