Reading notes–Deformable Convolution Networks
This article is just used to record the important part of original paper which I think can help understanding.
In this work, we introduce two new models to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI polling. Both are based on the idea of augmenting the spacial sampling locations in models with additional offsets and learning the offsets from the target tasks, without additional supervision.
This new models can be esaily trained end-to-end by standard back-propagation, giving rise to deformable convolution networks.
Deformable Concolution
Deformable convolution adds 2D offsets to regular grid sampling locations in standard convolution. It enables free form deformation of of the sampling grid. It is illustrated in Figure 1. The offset are learned from the preceding feature maps, via additional convolution layers.
The 2D convolution consists of two steps: 1) sampling using a regular grid R over the input feature map x; 2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,
defines a 3 x 3 kernel with dilation 1.
For each location P0 P 0 on the output feature map y, we have
where
pn
p
n
enumerates the locations in R.
In deformable convolution, the regular grid R is augmented with offsets
{Δpn|n=1,...,N}
{
Δ
p
n
|
n
=
1
,
.
.
.
,
N
}
,where N = |R|. Eq.(1) becomes
Now, the sampling is on the irregular and offset locations pn+Δpn p n + Δ p n . As the offset Δpn Δ p n is typically fractional, Eq.(2) is implemented via bilinear interpolation as
where p denotes an arbitrary (fractional) location ( p=p0+pn+Δpn p = p 0 + p n + Δ p n for Eq.(2)), q enumerates all integral spacial locations in the future map x, and G(⋅,⋅) G ( ⋅ , ⋅ ) is the bilinear interpolation kernel. Note that G is two dimensional. It is separated into two one dimensional kernels as
where g(a,b)=max(0,1−|a−b|) g ( a , b ) = m a x ( 0 , 1 − | a − b | ) . Eq.(3) is fast to compute as G(q,p) G ( q , p ) is non-zero only for a few q q s.
Deformable RoI pooling
Deformable RoI pooling adds an offset to each bin position in regular bin partition of the previous RoI pooling. Similiarly, the offsets are learned from the preceding feature maps and RoIs, enabling adaptive part localization for objects with different shapes.
RoI pooling is used in all region proposal based object detection methods. It coverts an input rectangular region of arbitrary size into fixed size feature.
RoI Pooling Given the input feature x and a RoI of size w x h and top-left corner , RoI pooling divides the RoI into k x k(k is a free parameter) bins and outputs a k x k feature map
y. For (i,j)-th bin
(0<=i,j<k),
(
0
<=
i
,
j
<
k
)
,
we have
where nij n i j is the number of pixels in the bin.The (i,j)-th bin spans ⌊iwk⌋≤px<⌈(i+1)wk⌉and⌊jhk⌋≤py< ⌊ i w k ⌋ ≤ p x < ⌈ ( i + 1 ) w k ⌉ a n d ⌊ j h k ⌋ ≤ p y < ⌈(j+1)hk⌉ ⌈ ( j + 1 ) h k ⌉
Similarly as in Eq.(2), in deformable RoI pooling, offsets {Δpij|0≤i,j<k} { Δ p i j | 0 ≤ i , j < k } are added to the spatial binning positions. Eq.(5)becomes
Typically, Δpij Δ p i j is fractional. Eq.(6) is implemented by bilinear interpolation via Eq.(3) and Eq.(4).