Abstract
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules.
卷积神经网络由于它内部固定的几何结构,所以对模型的几何变换有一定的局限性。
In this work, they introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision.
在这项工作中,作者介绍了两种新的模块来增强CNN的变形建模能力,叫做deformable convolution 和 deformable RoI pooling。两者都基于这样的想法:利用从目标任务中学到的offsets,来增加模块中的空间采样位置,而无需额外的监督。
The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of their approach.
这个新的模块可以很容易地取代现有CNN中的普通模块,并且可以通过标准的反向传播轻松地进行端到端的训练,从而产生可变形的卷积网络。
Introduction
A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation.
视觉识别的一个关键挑战是如何适应对象比例,姿势,视点和部分变形的几何变化。
In general, there are two ways.
•The first is to build the training datasets with sufficient desired variations.
•The second is to use transformation-invariant features and algorithms.
通常来说有两种方式:
1、通过构建变形的数据集;
2、使用几何不变性的特征或算法,如SIFT。
There are two drawbacks in above ways.
•First, the geometric transformations are assumed fixed and known.
•Second, handcrafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.
上述两种方式有两个缺点:
1、几何变换都是假定我们已知的;
2、人工设计特征是困难的,并且它们很难应对复杂的变换。
The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.
第一个是deformable convolution,它将2D偏移添加到标准卷积中的常规网格采样位置。 它可以使采样网格自由变形。 通过附加的卷积层从前面的特征图中学习offsets。 因此,变形以局部、密集和自适应方式对输入特征进行调节。
The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.
第二个是Deformable RoI pooling,它为原版RoI池化的常规bin分区中的每个bin位置添加了一个offsets。 类似地,从前面的特征图和RoI中学习offsets,从而实现具有不同形状的对象的自适应部件定位。
Deformable Convolutional Networks
Deformable Convolution
The 2D convolution consists of two steps:
1) sampling using a regular grid R over the input feature map x;
2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,
R = {(−1, −1),(−1, 0), . . . ,(0, 1),(1, 1)}
defines a 3 × 3 kernel with dilation 1.
2D卷积包括两个步骤:
1)用一个常规格子R在输入特征图x上采样;
2)用权重w将采样的结果求和。常规格子R定义了接受域的大小和扩张。例如:
R = {(−1, −1),(−1, 0), . . . ,(0, 1),(1, 1)}
定义了一个扩张为1的3 × 3的卷积核 。
For each location p0 on the output feature map y, we have
where pn enumerates the locations in R。
对于输出特征图y上的每一个位置p0,我们有
其中pn枚举格子R中的位置。
In deformable convolution, the regular grid R is augmented with offsets {∆pn|n = 1, ..., N}, where N = |R|. Eq. (1) becomes
在deformable convolution中,常规的R被offsets {∆pn|n = 1, ..., N}所增强,其中,N=|R|,公式(1)变成了
Now, the sampling is on the irregular and offset locations pn+∆pn. As the offset ∆pn is typically fractional, Eq. (2) is implemented via bilinear interpolation as
现在,采样的位置是不规则的pn+∆pn。因为offsets ∆pn通常是小数,所以公式(2)通过双线性插值实现得到公式(3):
公式(3)中p表示一个任意(小数)的位置(p = p0 + pn + ∆pn for Eq. (2)),q枚举输入特征图x上所有的整数位置, G(·, ·) 是一个双线性插值核。注意到G是二维的,这里它被拆分成两个一维核,如公式(4)所示。其中g(a, b) = max(0, 1 − |a − b|).
Deformable Convolutional Networks
RoI Pooling, Given the input feature map x and a RoI of size w×h and top-left corner p0, RoI pooling divides the RoI into k × k (k is a free parameter) bins and outputs a k × k feature map y. For (i, j)-th bin (0 ≤ i, j < k), we have
where nij is the number of pixels in the bin. The (i, j)-th bin spans and
Similarly as in Eq. (2), in deformable RoI pooling, offsets {∆pij |0 ≤ i, j < k} are added to the spatial binning positions. Eq.(5) becomes
Typically, ∆pij is fractional. Eq. (6) is implemented by bilinear interpolation via Eq. (3) and (4).
和公式(2)相似,在deformable RoI pooling中,offsets {∆pij |0 ≤ i, j < k}被添加到空间位置中,公式(5)变成了(6)。一般来说, ∆pij是小数,所以公式(6)也是通过双线性插值来实现的。
Firstly, RoI pooling (Eq. (5)) generates the pooled feature maps. From the maps, a fc layer generates the normalized offsets , which are then transformed to the offsets ∆pij in Eq. (6) by element-wise product with the RoI’s width and height, as. Here γ is a pre-defined scalar to modulate the magnitude of the offsets. It is empirically set to γ = 0.1. The offset normalization is necessary to make the offset learning invariant to RoI size.
Position-Sensitive (PS) RoI Pooling