RoI: Region of Interest Projection and Pooling

RoI池化是Fast R-CNN中引入的技术,用于将不同形状的区域提案转换为固定尺寸,以便通过全连接层。它通过最大池化处理输入,但原始版本存在信息损失。RoI Align通过双线性插值改进了采样过程,减少了精度损失。该技术在对象检测中起关键作用,将CNN特征图与区域提案相结合,通过全连接层进行分类和边界框回归。
摘要由CSDN通过智能技术生成

RoI projection/pooling are techniques/layers introduced in Fast-RCNN paper: https://arxiv.org/abs/1504.08083

more on RCNN and F-RCNN see:

RCNN and Variants_EverNoob的博客-CSDN博客

"RoI" by itself, refers to regional proposal, i.e. candidates of bound boxes (for object detection).

Here is an easy to read intro:

Understanding Region of Interest (RoI Pooling) - Blog by Kemal Erdem

==> in short, RoI projection shrinks RoI after CNN pre-processing (feature extraction);

==> RoI pooling standardizes said shrunk RoI in size to pass them through FC layer;

===> the base version of RoI suffer from heavy sampling information loss due to quantization in both steps, hence introduce RoI_Align and RoI_Warp: 

Understanding Region of Interest - Part 2 (RoI Align) - Blog by Kemal Erdem

==> in short, the two enhanced RoI methods combined projection and pooling into one process and applied bilinear interpolation to cover more data points during the sampling process

===> we see that overall, RoI_Align provides a much better average precision in results

another brief article for better visualization and more technicality: 

https://towardsdatascience.com/region-of-interest-pooling-f7c637f409af

The major hurdle for going from image classification to object detection is fixed size input requirement to the network because of existing fully connected layers. In object detection, each proposal will be of a different shape. So there is a need for converting all the proposals to fixed shape as required by fully connected layers. ROI Pooling is exactly doing this.

The entire image feeds a CNN model to detect RoI on the feature maps. Each region is separated using a RoI pooling layer and it feeds fully-connected layers. This vector is used by a softmax classifier to detect the object and by a linear regressor to modify the coordinates of the bounding box. Source: J. Xu’s Blog

ROI pooling produces the fixed-size feature maps from non-uniform inputs by doing max-pooling on the inputs. The number of output channels is equal to the number of input channels for this layer. ROI pooling layer takes two inputs:

  • A feature map obtained from a Convolutional Neural Network after multiple convolutions and pooling layers.
  • ‘N’ proposals or Region of Interests from Region proposal network. Each proposal has five values, the first one indicating the index and the rest of the four are proposal coordinates. Generally, it represents the top-left and bottom-right corner of the proposal.

ROI pooling takes every ROI from the input and takes a section of input feature map which corresponds to that ROI and converts that feature-map section into a fixed dimension map. The output fixed dimension of the ROI pooling for every ROI neither depends on the input feature map nor on the proposal sizes, It solely depends on the layer parameters.

Layer Parameters: pooled_width, pooled_height, spatial scale.

Pooled_width and pooled_height are hyperparameters which can be decided based on the problem at hand. These indicate the number of grids the feature map corresponding to the proposal should be divided into. This will be the output dimension of this layer. Let us assume that W, H are the width and height of the proposal and P_w,P_h are pooled width and height. Then the ROI will be divided into P_w*P_h blocks, each of dimensions (W/P_w, H/P_h).

==> we then poll each block for the max and record its value in RoI_pooling layer's corresponding spot;

==> the essence is sampling then choosing the max; which in case of forced quantization, would inevitably lead to information and precision loss.

bin_size_h = roi_height/pooled_height;
bin_size_w = roi_width/pooled_width;

Spatial scale is a scaling parameter for resizing the proposal according to the feature map dimensions. Let's say in our network, the image size is 1056x640 and due to many convolution and pooling operations, the feature map size reduced to 66x40, which is being used by ROI pooling. Now the proposals are generated based on input image size, so we need to rescale the proposals to feature map size ==> i.e. RoI projection. In this case, we can divide all dimensions of proposal by 16 (1056/66=16 or 640/40=16). So the spatial scale will be 1/16 in our example.

int roi_start_w = round(bottom_rois[1] * spatial_scale_);    
int roi_start_h = round(bottom_rois[2] * spatial_scale_);    
int roi_end_w = round(bottom_rois[3] * spatial_scale_);    
int roi_end_h = round(bottom_rois[4] * spatial_scale_);

Now we got a clear understanding of each parameter, let us see how ROI pooling works. For every proposal in the input proposals, we take the corresponding feature map section and divide that section into W*H blocks defined by layer parameters. After that take the maximum element of each block and copy to the output. So the output size will be P_w*P_h for every ROI proposal and N*P_w*P_h for all N proposals which is a fixed dimension feature map irrespective of the various sizes of the input proposals.

Scaled_Proposals = Proposals * spatial_scale
for every ROI in Scaled_Proposals:
    fmap_subset = feature_map[ROI] (Feature_map for that ROI)
    Divide fmap_subset into P_w x P_h blocks (ex: 6*6 blocks)
    Take the maximum element of each block and copy to output block

Below diagram illustrates the forward pass of ROI pooling layer.

credits: Region of interest pooling explained

==> the example above is a variant, since the true block size is likely to be 2 * 3 and the RoI implemented absorbed trailing values into the neighboring block; (those values are discarded by base implementations, resulting in information and precision loss.)

===> we retained the information this way, but the representation is skewed in this case, and we can safely assume precision loss due to this sampling distortion;

===> RoI_Align should still give a better result by keeping the consistency of the sampling block size, which could be crucial since all RoI pooling dimensions are rather small and consistent block size matters.

The main advantage of ROI pooling is that we can use the same feature map for all the proposals which enables us to pass the entire image to the CNN instead of passing all proposals individually.

Hope this helps! Thanks, everyone!

References:

  • Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE International Conference on Computer Vision. 2015.
  • Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值