Bounding-box regression in RCNN/Faster-RCNN/SSD

最新推荐文章于 2023-05-13 13:18:14 发布

TLP1993

最新推荐文章于 2023-05-13 13:18:14 发布

阅读量285

点赞数

分类专栏： Object Detection 文章标签： Object detection bbox regression

本文链接：https://blog.csdn.net/TLP1993/article/details/95658692

版权

Object Detection 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Bounding-box regression

Basics(RCNN)

Mainly refer to appendix C in RCNN paper.

bbox regression 是一种针对bbox的机器学习回归问题

input

$\{P^i, G^i\}_{i=1,2,\dots,N}$

where $P^i=(P^i_x,P^i_y,P^i_w,P^i_h,)$ specifies the pixel coordinates of the center of proposal $P^i$ ’s bounding box together with $P^i$ ’s width and height in pixels.

Hence forth, we drop the superscript i unless it is needed.
Each ground-truth bounding box $G$ is specified in the same way: $G=(G_x,G_y,G_w,G_h)$

$P$ is proposal bounding box, 由于RCNN的是在region poposal的基础上做的回归，所以自然引入了 $P$ 作为回归的初始值。同时也可以是认为是传统的滑动窗口检测思想的延伸。

goal

Our goal is to learn a transformation that maps a proposed box P to a ground-truth box $G$ .

未知模型 $\to G$ , 求 $g$ , $\to \hat G$ , 使 $\thickapprox f$

model

We parameterize the transformation in terms of four functions $d_x(P), dy_(P), d_w(P), d_h(P)$ . The first two specify a scale-invariant translation of the center of $P$ ’s bounding box, while the second two specify log-space translations of the width and height of P’s bounding box.

After learning these functions, we can transform an input proposal $P$ into a predicted ground-truth box $\hat G$ by applying the transformation

$\begin{aligned} \hat{G_x} &= P_wd_x(P) + P_w \\ \hat{G_y} &= P_hd_y(P) + P_h \\ \hat{G_w} &= P_wexp(d_w(P)) \\ \hat{G_h} &= P_hexp(d_h(P)) \\ \end{aligned}$

where $d_*(P)=d_*(P, \varPhi(P))=w_*^T\varPhi(P)$ , $\varPhi(P)$ is the feature decided by $P$ , $w_*^T$ is weight to be learned, $\in \{x,y,w,h\}$ , $exp(x)=e^x$ .

注意，不同于分类问题使用feature map上的所有特征，bbox regression只使用由 $P$ 决定的局部特征。

It is easy to get
$\begin{aligned} d_x &=(\hat{G}_x-P_x)/P_w \\ d_y &=(\hat{G}_y-P_y)/P_h \\ d_w &=log(\hat{G}_w/P_w) \\ d_h &=log(\hat{G}_h/P_h) \\ \end{aligned}$
scale-invariant translation

特征提取应该具有尺度不变性，即不同尺度的同一物体应得到相同的特征 $d (P)$ ，而 $P$ 的尺度随着物体尺度变化而变化（对于RCNN)，从而尺度不变的 $d_x(P), d_y(P)$ 能得到准确的 $\hat G$ 。

log-space (width/height) translation

猜测log-space使 $\delta_w,\delta_h$ 与 $\delta_x,\delta_y$ 在数值上比较接近从而在loss中的贡献也比较接近。

optimize objective

$\begin{aligned} w_* &= argmin_{\hat{w}_*} \sum_{i=1}^N L(\delta_*^i) + \lambda R(\hat {w}_*)\\ &= argmin_{\hat{w}_*} \sum_{i=1}^N L[t_*^i - d_*^i(P)] + \lambda R(\hat {w}_*) \\ &= argmin_{\hat{w}_*} \sum_{i=1}^N L[t_*^i - \hat {w}_*^T\varPhi(P^i)] + \lambda R(\hat {w}_*) \\ \end{aligned}$

where $L$ is the loss function, $R$ is the regularization function.

The regression targets $t_*$ for the training pair $(P, G)$ are defined as
$\begin{aligned} t_x &=(G_x-P_x)/P_w \\ t_y &=(G_y-P_y)/P_h \\ t_w &=log(G_w/P_w) \\ t_h &=log(G_h/P_h) \\ \end{aligned}$
It is easy to get
$\begin{aligned} \delta_x &=(G_x-\hat{G}_x)/P_w \\ \delta_y &=(G_y-\hat{G}_y)/P_h \\ \delta_w &=log(G_w/\hat{G}_w) \\ \delta_h &=log(G_h/\hat{G}_h) \\ \end{aligned}$

…care must be taken when selecting which training pairs $(P, G)$ to use. Intuitively, if $P $ is far from all ground-truth boxes, then the task of transforming $P$ to a ground-truth box $G$ does not make sense.

Faster RCNN

BBox regression of RPN is a variant of Basic bbox regression.

Region Proposal Network (RPN)

在这里插入图片描述

This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).

Translation-Invariant Anchors

Multi-Scale Anchors as Regression References

$P$ is equivalent to anchor box here, so anchors are proposal/references.

$\varPhi (P)$ is 1 x 1 x C’ feature at the anchor position on the intermediate layer, and $w_*$ is a 1 x 1 x C’ convolution kernal.

SSD

Bbox regression in SSD is a simplified version of Faster RCNN bbox regression.

The main difference between them is:

SSD remove the intermediate layer and use 3x3 convolution in cls/reg layer. In MobileNet-SSD, use 1x1 convolution in cls/reg layer.
SSD only predict k scores for foreground in cls layer, background is also predicted but not use.

$P$ is called prior box or default box. but actually equivalent to anchor box.

TLP1993

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Bounding-box regression in RCNN/Faster-RCNN/SSD

Bounding-box regressionBasics(RCNN)Mainly refer to appendix C in RCNN paper.bbox regression 是一种针对bbox的机器学习回归问题input{Pi,Gi}i=1,2,…,N\{P^i, G^i\}_{i=1,2,\dots,N}{Pi,Gi}i=1,2,…,Nwhere Pi=(Pxi,P...
复制链接

扫一扫