Bounding-box regression
Basics(RCNN)
Mainly refer to appendix C in RCNN paper.
bbox regression 是一种针对bbox的机器学习回归问题
input
{ P i , G i } i = 1 , 2 , … , N \{P^i, G^i\}_{i=1,2,\dots,N} {Pi,Gi}i=1,2,…,N
where P i = ( P x i , P y i , P w i , P h i , ) P^i=(P^i_x,P^i_y,P^i_w,P^i_h,) Pi=(Pxi,Pyi,Pwi,Phi,) specifies the pixel coordinates of the center of proposal P i P^i Pi’s bounding box together with P i P^i Pi’s width and height in pixels.
Hence forth, we drop the superscript i unless it is needed.
Each ground-truth bounding box G G G is specified in the same way: G = ( G x , G y , G w , G h ) G=(G_x,G_y,G_w,G_h) G=(Gx,Gy,Gw,Gh)
P P P is proposal bounding box, 由于RCNN的是在region poposal的基础上做的回归,所以自然引入了 P P P作为回归的初始值。同时也可以是认为是传统的滑动窗口检测思想的延伸。
goal
Our goal is to learn a transformation that maps a proposed box P to a ground-truth box G G G.
未知模型 f : P → G f: P \to G f:P→G, 求 g g g, g : P → G ^ g: P \to \hat G g:P→G^, 使 g ≈ f g \thickapprox f g≈f
model
We parameterize the transformation in terms of four functions d x ( P ) , d y ( P ) , d w ( P ) , d h ( P ) d_x(P), dy_(P), d_w(P), d_h(P) dx(P),dy(P),dw(P),dh(P). The first two specify a scale-invariant translation of the center of P P P’s bounding box, while the second two specify log-space translations of the width and height of P’s bounding box.
After learning these functions, we can transform an input proposal P P P into a predicted ground-truth box G ^ \hat G G^ by applying the transformation
G x ^ = P w d x ( P ) + P w G y ^ = P h d y ( P ) + P h G w ^ = P w e x p ( d w ( P ) ) G h ^ = P h e x p ( d h ( P ) ) \begin{aligned} \hat{G_x} &= P_wd_x(P) + P_w \\ \hat{G_y} &= P_hd_y(P) + P_h \\ \hat{G_w} &= P_wexp(d_w(P)) \\ \hat{G_h} &= P_hexp(d_h(P)) \\ \end{aligned} Gx^Gy^Gw^Gh^=Pwdx(P)+Pw=Phdy(P)+Ph=Pwexp(dw(P))=Phexp(dh(P))
where d ∗ ( P ) = d ∗ ( P , Φ ( P ) ) = w ∗ T Φ ( P ) d_*(P)=d_*(P, \varPhi(P))=w_*^T\varPhi(P) d∗(P)=d∗(P,Φ(P))=w∗TΦ(P), Φ ( P ) \varPhi(P) Φ(P) is the feature decided by P P P, w ∗ T w_*^T w∗T is weight to be learned, ∗ ∈ { x , y , w , h } * \in \{x,y,w,h\} ∗∈{x,y,w,h}, e x p ( x ) = e x exp(x)=e^x exp(x)=ex.
注意,不同于分类问题使用feature map上的所有特征,bbox regression只使用由 P P P决定的局部特征。
It is easy to get
d
x
=
(
G
^
x
−
P
x
)
/
P
w
d
y
=
(
G
^
y
−
P
y
)
/
P
h
d
w
=
l
o
g
(
G
^
w
/
P
w
)
d
h
=
l
o
g
(
G
^
h
/
P
h
)
\begin{aligned} d_x &=(\hat{G}_x-P_x)/P_w \\ d_y &=(\hat{G}_y-P_y)/P_h \\ d_w &=log(\hat{G}_w/P_w) \\ d_h &=log(\hat{G}_h/P_h) \\ \end{aligned}
dxdydwdh=(G^x−Px)/Pw=(G^y−Py)/Ph=log(G^w/Pw)=log(G^h/Ph)
scale-invariant translation
特征提取应该具有尺度不变性,即不同尺度的同一物体应得到相同的特征 d ( P ) d(P) d(P), 而 P P P的尺度随着物体尺度变化而变化(对于RCNN),从而尺度不变的 d x ( P ) , d y ( P ) d_x(P), d_y(P) dx(P),dy(P)能得到准确的 G ^ \hat G G^。
log-space (width/height) translation
猜测log-space使 δ w , δ h \delta_w,\delta_h δw,δh 与 δ x , δ y \delta_x,\delta_y δx,δy在数值上比较接近从而在loss中的贡献也比较接近。
optimize objective
w ∗ = a r g m i n w ^ ∗ ∑ i = 1 N L ( δ ∗ i ) + λ R ( w ^ ∗ ) = a r g m i n w ^ ∗ ∑ i = 1 N L [ t ∗ i − d ∗ i ( P ) ] + λ R ( w ^ ∗ ) = a r g m i n w ^ ∗ ∑ i = 1 N L [ t ∗ i − w ^ ∗ T Φ ( P i ) ] + λ R ( w ^ ∗ ) \begin{aligned} w_* &= argmin_{\hat{w}_*} \sum_{i=1}^N L(\delta_*^i) + \lambda R(\hat {w}_*)\\ &= argmin_{\hat{w}_*} \sum_{i=1}^N L[t_*^i - d_*^i(P)] + \lambda R(\hat {w}_*) \\ &= argmin_{\hat{w}_*} \sum_{i=1}^N L[t_*^i - \hat {w}_*^T\varPhi(P^i)] + \lambda R(\hat {w}_*) \\ \end{aligned} w∗=argminw^∗i=1∑NL(δ∗i)+λR(w^∗)=argminw^∗i=1∑NL[t∗i−d∗i(P)]+λR(w^∗)=argminw^∗i=1∑NL[t∗i−w^∗TΦ(Pi)]+λR(w^∗)
where L L L is the loss function, R R R is the regularization function.
The regression targets
t
∗
t_*
t∗ for the training pair
(
P
,
G
)
(P,G)
(P,G) are defined as
t
x
=
(
G
x
−
P
x
)
/
P
w
t
y
=
(
G
y
−
P
y
)
/
P
h
t
w
=
l
o
g
(
G
w
/
P
w
)
t
h
=
l
o
g
(
G
h
/
P
h
)
\begin{aligned} t_x &=(G_x-P_x)/P_w \\ t_y &=(G_y-P_y)/P_h \\ t_w &=log(G_w/P_w) \\ t_h &=log(G_h/P_h) \\ \end{aligned}
txtytwth=(Gx−Px)/Pw=(Gy−Py)/Ph=log(Gw/Pw)=log(Gh/Ph)
It is easy to get
δ
x
=
(
G
x
−
G
^
x
)
/
P
w
δ
y
=
(
G
y
−
G
^
y
)
/
P
h
δ
w
=
l
o
g
(
G
w
/
G
^
w
)
δ
h
=
l
o
g
(
G
h
/
G
^
h
)
\begin{aligned} \delta_x &=(G_x-\hat{G}_x)/P_w \\ \delta_y &=(G_y-\hat{G}_y)/P_h \\ \delta_w &=log(G_w/\hat{G}_w) \\ \delta_h &=log(G_h/\hat{G}_h) \\ \end{aligned}
δxδyδwδh=(Gx−G^x)/Pw=(Gy−G^y)/Ph=log(Gw/G^w)=log(Gh/G^h)
…care must be taken when selecting which training pairs ( P , G ) (P,G) (P,G) to use. Intuitively, if $P $ is far from all ground-truth boxes, then the task of transforming P P P to a ground-truth box G G G does not make sense.
Faster RCNN
BBox regression of RPN is a variant of Basic bbox regression.
Region Proposal Network (RPN)
This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).
Translation-Invariant Anchors
Multi-Scale Anchors as Regression References
P P P is equivalent to anchor box here, so anchors are proposal/references.
Φ ( P ) \varPhi (P) Φ(P) is 1 x 1 x C’ feature at the anchor position on the intermediate layer, and w ∗ w_* w∗ is a 1 x 1 x C’ convolution kernal.
SSD
Bbox regression in SSD is a simplified version of Faster RCNN bbox regression.
The main difference between them is:
- SSD remove the intermediate layer and use 3x3 convolution in cls/reg layer. In MobileNet-SSD, use 1x1 convolution in cls/reg layer.
- SSD only predict k scores for foreground in cls layer, background is also predicted but not use.
P P P is called prior box or default box. but actually equivalent to anchor box.