概述
在目标检测中,由于图像中物体的尺度并不具有一致性,以往的目标检测模型往往对于检测物体的尺度及其敏感,造成模型对小目标检测的准确率并不高。而造成这个问题的原因主要有两个(1)经典的ROI Pooling方式破坏了小目标的结构(2)物体的尺度方差所造成的极大的类间距离已经超过了单一网络的表征能力。
主要思想
Context-aware RoI Pooling
ROI Pooling方式的目的在于使下游网络的输入有一个固定的大小。由于ROI的大小可能有多种不同的情况,所以在具体的处理过程中有诸多的细节。如将下图中这样一个
4
×
6
4\times6
4×6的ROI转化成
3
×
3
3\times3
3×3的固定大小,我们需要将ROI的最后一行丢弃,而在列方向上每两个元素选择最大的一个。
而当ROI的宽高都小于目标大小时,就会使用如下图中(c)这种简单复制的方法进行扩充。但是这种方法已经明显的破坏了小目标的空间结构。Context-aware RoI Pooling则将所采用的策略分为三类:
- 当ROI的宽和高均大于目标大小时,继续采用和经典ROI Pooling一致的策略。
- 当ROI的宽和高均小于目标大小时,使用双线性插值的方法。其中核的大小等于目标大小与ROI大小的比值。
- 当ROI的宽大于目标大小而高小于目标大小时,则只针对于高进行双线性插值。
用数学描述如下:
i
∗
=
a
r
g
m
a
x
x
∈
R
(
k
,
j
)
x
i
i^*=argmax_{x\in R(k,j)}x_i
i∗=argmaxx∈R(k,j)xi
x
i
∈
(
X
k
⊗
σ
k
)
x_i \in (X_k\otimes\sigma_k)
xi∈(Xk⊗σk)
y
k
j
y_k^{j}
ykj代表着第k个建议框CARoI pooling层的第j个输出,
y
k
j
=
x
i
∗
y_k^j=x_{i^*}
ykj=xi∗,R(k,j)}代表着子窗第索引集合,在这个子窗里面
y
k
j
y_k^j
ykj选择其中最大值,
x
i
x_i
xi代表着特征层中第i个值,
X
k
X_k
Xk代表着第k个建议框,
⊗
\otimes
⊗代表着反卷积操作,
σ
k
)
\sigma_k)
σk)代表着反卷积核,如果ROI的大小大于目标大小时,
σ
k
=
1
\sigma_k=1
σk=1,意味着反卷积将不会起作用。
一部分代码如下所示,具体实现见此全部代码链接。代码的思路很简单,通过逐次判断ROI的宽和高与目标大小,来判断是否需要进行双线性插值,例如当ROI的高大于目标大小而宽小于目标大小时,就将设定接下来要resize的表示形状的数值对,其中表示高的部分数值等于ROI的高,而将宽直接设定为目标大小,然后使用cv::resize(ori_roi_feature,enlarge_roi_feature,cv_enlarge_size,0,0,cv::INTER_LINEAR)
其中cv::INTER_LINEAR
就代表双线性插值。这里相当于只针对宽进行了线性插值,而接下来则会针对高进行和经典的ROI Pooling思路一致的操作。
template <typename Dtype>
void CAROIPoolingLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
//cout<<"Layer Forward!!!!!!!!!!!!!!!!!!!!"<<endl;
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* bottom_rois = bottom[1]->cpu_data();
// num of small ROIs (the size on feature maps is smaller than the bin)
int num_small_rois = 0;
// Number of ROIs
int num_rois = bottom[1]->num();
int batch_size = bottom[0]->num();
int top_count = top[0]->count();
Dtype* top_data = top[0]->mutable_cpu_data();
caffe_set(top_count, Dtype(-FLT_MAX), top_data);
int* argmax_data = max_idx_.mutable_cpu_data();
caffe_set(top_count, -1, argmax_data);
// For each ROI R = [batch_index x1 y1 x2 y2]: max pool over R
for (int n = 0; n < num_rois; ++n) {
int roi_batch_ind = bottom_rois[0];
CHECK_GE(roi_batch_ind, 0);
CHECK_LT(roi_batch_ind, batch_size);
// padding
Dtype pad_w, pad_h;
pad_w = (bottom_rois[3]-bottom_rois[1]+1)*pad_ratio_;
pad_h = (bottom_rois[4]-bottom_rois[2]+1)*pad_ratio_;
int roi_start_w = round((bottom_rois[1]-pad_w) * spatial_scale_);
int roi_start_h = round((bottom_rois[2]-pad_h) * spatial_scale_);
int roi_end_w = round((bottom_rois[3]+pad_w) * spatial_scale_);
int roi_end_h = round((bottom_rois[4]+pad_h) * spatial_scale_);
// clipping
roi_start_w = max(roi_start_w,0); roi_start_h = max(roi_start_h,0);
int img_width = round(width_/spatial_scale_);
int img_height = round(height_/spatial_scale_);
roi_end_w = min(img_width-1,roi_end_w);
roi_end_h = min(img_height-1,roi_end_h);
int roi_height = max(roi_end_h - roi_start_h + 1, 1);
int roi_width = max(roi_end_w - roi_start_w + 1, 1);
//get the batch_data at coorect channel (first dim)
const Dtype* batch_data = bottom_data + bottom[0]->offset(roi_batch_ind); //bottom[0]->cpu_data();
const float bin_size_h_float = (float)roi_height / (float)pooled_height_;
const float bin_size_w_float = (float)roi_width / (float)pooled_width_;
//**********************CUHK HU XIAOWEI**************************************deal with small rois
if (bin_size_h_float<1 || bin_size_w_float<1)
{
//************************************************************2016.11
if (bin_size_h_float>1) //height is large
{
num_small_rois++;
int enlarge_pad_h = 0; //enlarge_pad = (multipler-1)
int enlarge_pad_w = 0; //pooled_width_/roi_width;
//cout<<"enlarge_pad: "<<enlarge_pad_h<<" "<<enlarge_pad_w<<endl;
cv::Mat ori_roi_feature(roi_height,roi_width,CV_32F); //add(0,2)
cv::Size cv_enlarge_size;
cv_enlarge_size.height = roi_height; //both sides (no padding)
cv_enlarge_size.width = pooled_width_+enlarge_pad_w*2;
cv::Mat enlarge_roi_feature(cv_enlarge_size,CV_32F); //+2 enlarged feature map
for (int c = 0; c < channels_; ++c) //the index to the correct channel
{
//int pad_h1=1,pad_h2=1,pad_w1=1,pad_w2=1;
//cout<<"rows:"<<ori_roi_feature.rows<<" cols:"<<ori_roi_feature.cols<<" channels:"<<ori_roi_feature.channels()<<endl; // channel is 1
for (int i = 0, ori_roi_h = roi_start_h; i < roi_height; i++, ori_roi_h++) //expand 1 pixel for both size
{
int h = min(max(ori_roi_h, 0), height_-1); //check the border [nearest neighborhood if it's exceeded border]
for (int j = 0, ori_roi_w = roi_start_w; j < roi_width; j++, ori_roi_w++)
{
int w = min(max(ori_roi_w, 0), width_-1); //check the border [nearest neighborhood if it's exceeded border]
const int index = h * width_ + w;
ori_roi_feature.at<float>(i,j) = static_cast<float>(batch_data[index]);
}
}
//cout<<"M= "<<endl<< " "<<ori_roi_feature<<endl<<endl;
cv::resize(ori_roi_feature,enlarge_roi_feature,cv_enlarge_size,0,0,cv::INTER_LINEAR);
//cout<<"XXM= "<<endl<< " "<<enlarge_roi_feature<<endl<<endl;
const Dtype bin_size_h = static_cast<Dtype>(bin_size_h_float);
const Dtype bin_size_w = static_cast<Dtype>(1.0); //bin size for enlarged width is 1
for (int ph = 0; ph < pooled_height_; ++ph) {
for (int pw = 0; pw < pooled_width_; ++pw) {
int hstart = static_cast<int>(floor(static_cast<Dtype>(ph) * bin_size_h));
int wstart = static_cast<int>(floor(static_cast<Dtype>(pw) * bin_size_w));
int hend = static_cast<int>(ceil(static_cast<Dtype>(ph + 1) * bin_size_h));
int wend = static_cast<int>(ceil(static_cast<Dtype>(pw + 1) * bin_size_w));
hstart = min(max(hstart + enlarge_pad_h, 0), height_);
hend = min(max(hend + enlarge_pad_h, 0), height_);
wstart = min(max(wstart + enlarge_pad_w, 0), width_);
wend = min(max(wend + enlarge_pad_w, 0), width_);
bool is_empty = (hend <= hstart) || (wend <= wstart);
const int pool_index = ph * pooled_width_ + pw;
if (is_empty) {
top_data[pool_index] = 0;
argmax_data[pool_index] = -1;
}
// cout<<"hstart:"<<hstart<<endl<<"hend:"<<hend<<endl;
for (int h = hstart; h < hend; ++h) {
for (int w = wstart; w < wend; ++w) {
const int index = h * cv_enlarge_size.width + w;
if (static_cast<Dtype>(enlarge_roi_feature.at<float>(h,w)) > top_data[pool_index]) {
top_data[pool_index] = static_cast<Dtype>(enlarge_roi_feature.at<float>(h,w));
argmax_data[pool_index] = index; //index on the enlarged feature map //the index is based on new enlarged feature map
}
}
}
}
}
// Increment all data pointers by one channel
batch_data += bottom[0]->offset(0, 1);
top_data += top[0]->offset(0, 1);
argmax_data += max_idx_.offset(0, 1);
}
}
else if (bin_size_w_float>1) //weight is large
//................
//................
除此之外,SINet还借鉴了FPN的思想,即将处于不同深度的特征层对应的ROI区域连接在一起,作为下游网络的输入。如下图左半部分所示。
多分支决策网络(Multi-branch Decision Network)
多分支决策网络主要是为了解决单一网络在面对不同尺度的物体时表征能力不足的问题。其思想很简单就是利用不同网络对不同尺度的物体分开进行预测,这里分支网络是数目由数据集尺度分布和算力经验性的决定。假如我们准备建立一个二分支的决策网络,需要使用数据集中目标的中间值作为参考阈值来决定建议框是属于大目标决策分支,还是小目标决策分支。在训练阶段,为了使两个分支的网络都拥有一部分处于中间值的数据样本,需要进行数据增强。并且参考阈值在每次训练迭代过程中动态改变,我们用一个高斯模型来模拟这种变化,这个高斯模型的均值就是所有目标尺度值的中间值。这样,在整个训练过程中,那些尺度接近所有目标尺度的中值的建议框就有机会被分为大目标决策分支和小目标决策分支。在测试中,我们只需用中值来分割建议框。
参考文献
[1]Hu X, Xu X, Xiao Y, et al. SINet: A scale-insensitive convolutional neural network for fast vehicle detection[J]. IEEE transactions on intelligent transportation systems, 2018, 20(3): 1010-1019.
[2]ROI Pooling的理解
https://zhuanlan.zhihu.com/p/62050478
[3]Understanding Region of Interest — (RoI Pooling)
https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44
[4]源码链接
https://github.com/xw-hu/SINet