论文阅读笔记(三十三):Relation Network for Object Detection

Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning.

虽然多年来人们一直相信, 建模物体之间的关系将有助于物体的识别, 但没有证据表明这个想法是在深度学习的时代起效的。所有state-of-the-art物体检测系统仍然依赖于单独识别物体实例, 而不利用它们在学习过程中的关系。

This work proposes an object relation module. It processes a set of objects simultaneously through interaction between their appearance feature and geometry, thus allowing modeling of their relations. It is lightweight and in-place. It does not require additional supervision and is easy to embed in existing networks. It is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline. It verifies the efficacy of modeling object relations in CNN based detection. It gives rise to the first fully end-to-end object detector.

本项目提出了一个object relation module。它通过它们的外观特征和几何之间的相互作用同时处理一组物体, 从而允许建模它们之间的关系。它是轻量级和in-place。它不需要额外的监督, 很容易嵌入到现有的网络中。该方法对改进现代目标检测领域中的目标识别和重复移动steps有较好的效果。验证了基于 CNN 的检测中物体关系建模的有效性。它产生了第一个完全端到端的物体探测器。

Recent years have witnessed significant progress in object detection using deep convolutional neutral networks (CNNs) [27]. The state-of-the-art object detection methods [24, 18, 38, 9, 32, 10, 23] mostly follow the region based paradigm since it is established in the seminal work R-CNN [19]. Given a sparse set of region proposals, object classification and bounding box regression are performed on each proposal individually. A heuristic and hand crafted post-processing step, non-maximum suppression (NMS), is then applied to remove duplicate detections.

近年来,在使用深度卷积神经网络(CNN)的物体检测方面取得了重大进展[27]。state-of-the-art的物体检测方法[24,18,38,9,32,10,23]大多遵循region based的范例,因为它是在R-CNN [19]的开创性工作中建立的。给定一个稀疏的区域提案集,分别对每个提案执行物体分类和bounding boxes回归。然后应用heuristic 和hand crafted post-processing step, non-maximum suppression(NMS),以消除重复检测。

It has been well recognized in the vision community for years that contextual information, or relation between objects, helps object recognition [12, 17, 46, 47, 39, 36, 17, 16, 6]. Most such works are before the prevalence of deep learning. During the deep learning era, there is no significant progress about exploiting object relation for detection learning. Most methods still focus on recognising objects separately.

视觉社区多年来一直认为contextual信息或物体之间的关系有助于物体识别[12,17,46,47,39,36,17,16,6]。大多数此类工作在深度学习盛行之前。在深度学习时代,为检测学习探索物体的关系没有取得重大进展。大多数方法仍然侧重于分开识别物体。

One reason is that object-object relation is hard to model.
The objects are at arbitrary image locations, of different scales, within different categories, and their number may vary across different images. The modern CNN based methods mostly have a simple regular network structure [25, 23]. It is unclear how to accommodate above irregularities in existing methods.

一个原因是object-object关系很难建模。
这些物体位于不同尺寸,不同类别的任意图像位置,其数量可能因不同图像而异。现代基于CNN的方法大多具有简单的规则网络结构[25,23]。目前还不清楚如何解决现有方法中的上述不规范问题。

Our approach is motivated by the success of attention modules in natural language processing field [5, 49]. An attention module can effect an individual element (e.g., a word in the target sentence in machine translation) by aggregating information (or features) from a set of elements (e.g., all words in the source sentence). The aggregation weights are automatically learnt, driven by the task goal. An attention module can model dependency between the elements, without making excessive assumptions on their locations and feature distributions. Recently, attention modules have been successfully applied in vision problems such as image captioning [50].

我们的方法是由自然语言处理领域的attention modules的成功驱动的[5,49]。attention modules可通过聚集来自一组元素(例如,源句子中的所有词)的信息(或特征)来实现个体元素(例如,机器翻译中的目标语句中的词)。聚合权重自动学习,由任务目标驱动。attention modules可以模拟元素之间的依赖关系,而不会对其位置和特征分布进行过多的假设。最近,attention modules已成功应用于视觉问题,如图像字幕[50]。

In this work, for the first time we propose an adapted attention module for object detection. It is built upon a basic attention module. An apparent distinction is that the primitive elements are objects instead of words. The objects have 2D spatial arrangement and variations in scale/aspect ratio. Their locations, or geometric features in a general sense, play a more complex and important role than the word location in an 1D sentence. Accordingly, the proposed module extends the original attention weight into two components: the original weight and a new geometric weight. The latter models the spatial relationships between objects and only considers the relative geometry between them, making the module translation invariant, a desirable property for object recognition. The new geometric weight proves important in our experiments.

在这项工作中,我们第一次提出了一个适用于物体检测的attention modules。它建立在基本的attention modules之上。一个明显的区别是,原始元素是物体而不是单词。这些物体具有2D空间排列和尺度/长宽比的变化。它们的位置或一般意义上的几何特征,比起一维句子中的单词位置起着更复杂和更重要的作用。因此,所提出的模块将original attention weight扩展到两个部分:original weight和新的geometric weight。后者模拟物体之间的空间关系,只考虑它们之间的相对几何关系,使模块转换不变,这是物体识别的理想属性。新的geometric weight在我们的实验中证明很重要。

The module is called object relation module. It shares the same advantages of an attention module. It takes variable number of inputs, runs in parallel (as opposed to sequential relation modeling [29, 44, 6]), is fully differentiable and is in-place (no dimension change between input and output). As a result, it serves as a basic building block that is usable in any architecture flexibly.

该模块被称为object relation module。它具有attention modules的相同优点。它需要可变数量的输入,并行运行(而不是sequential relation modeling[29,44,6]),是完全可微分的,并且是 in-place(输入和输出之间没有维度变化)。因此,它可以灵活地用作任何架构中的基本构建块。

Specifically, it is applied to several state-of-the-art object detection architectures [38, 10, 32] and show consistent improvement. As illustrated in Figure 1, it is applied to improve the instance recognition step and learn the duplicate removal step (see Section 4.1 for details). For instance recognition, the relation module enables joint reasoning of all objects and improves recognition accuracy (Section 4.2). For duplicate removal, the traditional NMS method is replaced and improved by a lightweight relation network (Section 4.3), resulting in the first end-to-end object detector (Section 4.4), to our best knowledge.

具体而言,它应用于多种state-of-the-art的目标检测体系结构[38,10,32]并显示出一致的改进。如图1所示,它用于改进instance recognition步骤并学习duplicate removal步骤(详见第4.1节)。例如instance recognition,relation module可以对所有物体进行联合推理并提高识别的准确性(见第4.2节)。对于duplicate removal,传统的NMS方法被lightweight relation network(第4.3节)所取代和改进,从而产生了第一个端到端的物体检测器(第4.4节),这是我们所知的。

In principle, our approach is fundamentally different from and would complement most (if not all) CNN based object detection methods. It exploits a new dimension: a set of objects are processed, reasoned and affect each other simultaneously, instead of recognized individually.

原则上,我们的方法与大多数(如果不是全部的话)基于CNN的目标检测方法有根本的不同并且可以对其进行补充。它利用了一个新的维度:一组物体被同时处理,推理和相互影响,而不是单独识别。

The object relation module is general and not limited to object detection. We do not see any reason preventing it from finding broader applications in vision tasks, such as instance segmentation [30], action recognition [41], object relationship detection [28], caption [50], VQA [1], etc.

object relation module是通用的,不限于物体检测。我们没有看到任何理由阻止它在视觉任务中找到更广泛的应用,例如nstance segmentation [30], action recognition [41], object relationship detection [28], caption [50], VQA [1]等。

Object relation in post-processing Most early works use object relations as a post-processing step [12, 17, 46, 47, 36, 17]. The detected objects are re-scored by considering object relationships. For example, co-occurrence, which indicates how likely two object classes can exist in a same image, is used by DPM [15] to refine object scores. The subsequent approaches [7, 36] try more complex relation models, by taking additional position and size [3] into account. We refer readers to [16] for a more detailed survey. These methods achieve moderate success in pre-deep learning era but do not prove effective in deep ConvNets. A possible reason is that deep ConvNets have implicitly incorporated contextual information by the large receptive field.

后处理中的物体关系大多数早期的工作使用物体关系作为后处理步骤[12,17,46,47,36,17]。通过考虑物体关系对检测到的物体进行re-scored。例如,DPM [15]使用co-occurrence表示两个物体类可能存在于同一图像中的可能性,以提炼物体scores。随后的方法[7,36]考虑更多复杂的关系模型,考虑更多的位置和大小[3]。我们引用读者[16]进行更详细的调查。这些方法在深度学习时代取得了稳健的成功,但在深度通信网络中并未证明有效。一个可能的原因是深层ConvNets通过大感受野隐含地结合了contextual信息。

Sequential relation modeling Several recent works perform sequential reasoning (LSTM [29, 44] and spatial memory network (SMN) [6]) to model object relations. During detection, objects detected earlier are used to help finding objects next. Training in such methods is usually sophisticated. More importantly, they do not show evidence of improving the state-of-the-art object detection approaches, which are simple feed-forward networks.

Sequential relation modeling最近的一些工作执行sequential reasoning(LSTM [29,44]和spatial memory network(SMN)[6])来建模物体关系。在检测期间,先前检测到的物体用于帮助下一个查找物体。这种方法的训练通常很复杂。更重要的是,他们没有证明改进state-of-the-art的物体检测方法,这是简单的前馈网络。

In contrast, our approach is parallel for multiple objects. It naturally fits into and improves modern object detectors.
Human centered scenarios Quite a few works focus on human-object relation [51, 22, 20, 21]. They usually require additional annotations of relation, such as human action. In contrast, our approach is general for object-object relation and does not need additional supervision.

相反,我们的方法对于多个物体是并行的。它自然适合并改进模型物体探测器。
以人为中心的情景不少工作集中于human-object relation[51,22,20,21]。他们通常需要额外的关系注释,例如人类行为。相反,我们的方法通常用于object-object关系,并且不需要额外的监督。

Duplicate removal In spite of the significant progress of object detection using deep learning, the most effective method for this task is still the greedy and hand-crafted nonmaximum suppression (NMS) and its soft version [4]. This task naturally needs relation modeling. For example, NMS uses simple relations between bounding boxes and scores.

Duplicate removal尽管使用深度学习的物体检测有了重大进展,但这项任务的最有效方法仍然是greedy 和hand-crafted nonmaximum suppression(NMS)及其soft version[4]。这个任务自然需要关系建模。例如,NMS使用bounding boxes和scores之间的简单关系。

Recently, GossipNet [26] attempts to learn duplicate removal by processing a set of objects as a whole, therefore sharing the similar spirit of ours. However, its network is specifically designed for the task and very complex (depth>80). Its accuracy is comparable to NMS but computation cost is demanding. Although it allows end-to-end learning in principle, no experimental evidence is shown.

最近,GossipNet [26]试图通过整体处理一组物体来学习duplicate removal,因此分享了我们类似的精神。但是,它的网络是专门为这项任务设计的,并且非常复杂(深度> 80)。其准确度与NMS相当,但计算成本要求很高。虽然原则上允许端到端学习,但没有显示实验证据。

In contrast, our relation module is simple, general and applied to duplicate removal as an application. Our network for duplicate removal is much simpler, has small computation overhead and surpasses SoftNMS [4]. More importantly, we show that an end-to-end object detection learning is feasible and effective, for the first time.

相比之下,我们的relation module非常简单,通用,适用于作为应用程序进行duplicate removal。我们的duplicate removal网络要简单得多,计算开销小,超过SoftNMS [4]。更重要的是,我们首次展示了一个端到端的物体检测学习是可行和有效的。

The comprehensive ablation experiments suggest that the relation modules have learnt information between objects that is missing when learning is performed on individual objects. Nevertheless, it is not clear what is learnt in the relation module, especially when multiple ones are stacked.

全面的消融实验表明,relation module已经学习了在对单个物体进行学习时丢失的物体之间的信息。尽管如此,relation module中学到的东西并不清楚,特别是当多个堆叠时。

Towards understanding, we investigate the (only) relation module in the {r1, r2} = {1, 0} head in Table 1(c). Figure 4 show some representative examples with high relation weights. The left example suggests that several objects overlapping on the same ground truth (bicycle) contribute to the centering object. The right example suggests that the person contributes to the glove. While these examples are intuitive, our understanding of how relation module works is preliminary and left as future work.

为了理解,我们调查表1(c)中{r1,r2} = {1,0}head的(唯一的)relation module。图4显示了一些具有 high relation weights的代表性例子。左边的例子表明,在同一个ground truth(自行车)上重叠的几个物体有助于居中物体。右边的例子表明该人为手套做出贡献。虽然这些例子很直观,但我们对relation module如何工作的理解是初步的,并留作未来工作。

这里写图片描述

Figure 1. Current state-of-the-art object detectors are based on a four-step pipeline. Our object relation module (illustrated as red dashed boxes) can be conveniently adopted to improve both instance recognition and duplicate removal steps, resulting in an end-to-end object detector.

图1.当前state-of-the-art的物体检测器基于四步pipeline。我们的object relation module(如红色虚线框所示)可以方便地用于改进instance recognition 和 duplicate removal steps,从而形成端到端的物体检测器。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值