RCNN and Variants

Intro video

https://www.youtube.com/watch?v=vr5rs_cTKCs

RCNN

(short summary) https://towardsdatascience.com/object-detection-explained-r-cnn-a6c813937a76

Region-based Convolutional Neural Network

Object detection consists of two separate tasks that are classification and localization. R-CNN stands for Region-based Convolutional Neural Network. The key concept behind the R-CNN series is region proposals. Region proposals are used to localize objects within an image. In the following blogs, I decided to write about different approaches and architectures used in Object Detection. Therefore, I am happy to start this journey with R-CNN based object detectors.

Working Details

RCNN: Working Details. Source: https://arxiv.org/pdf/1311.2524.pdf.

As can be seen in the image above before passing an image through a network, we need to extract region proposals or regions of interest using an algorithm such as selective search. Then, we need to resize (wrap) all the extracted crops and pass them through a network.

Finally, a network assigns a category from C + 1, including the ‘background’ label, categories for a given crop. Additionally, it predicts delta Xs and Ys to shape a given crop.

Extract region proposals

Selective Search is a region proposal algorithm used for object localization that groups regions together based on their pixel intensities. So, it groups pixels based on the hierarchical grouping of similar pixels. In the original paper, the authors extract about 2,000 proposals.

==>it's enough to know selective search as a traditional similarity grouping algorithm for segmentation, but for more on selective search see:

Selective Search for Object Detection | R-CNN_EverNoob的博客-CSDN博客

Positive vs. negative examples

After we extract our region proposal, we also have to label them for training. Therefore, the authors label all the proposals having IOU of at least 0.5 with any of the ground-truth bounding boxes with their corresponding classes. However, all other region proposals that have an IOU of less than 0.3 are labelled as background (negative). Thus, the rest of them are simply ignored.

Bounding-box regression

Bounding-box regression. Source: https://arxiv.org/pdf/1311.2524.pdf.

The image above shows deltas that are to be predicted by CNN. So, x, y are centre coordinates. whereas w, h are width and height respectively. Finally, G and P stand for ground-truth bounding box and region proposal respectively. It is important to note that the bounding box loss is only calculated for positive samples.

Loss

The total loss is calculated as the sum of classification and regression losses. However, there is a coefficient lambda for the latter one, which is 1,000 in the original paper. Note that the regression loss is ignored for negative examples.

==> the huge regularization coefficient for BB-regression loss enforce the simple rule: no classification counts unless the bound box is up to scrutinies.

Architecture

Typically, we pass the resized crops through VGG 16 or ResNet 50 in order to get features. They are subsequently passed through fully connected layers that output predictions.

==> again, key of RCNN is R, not CNN

===> 2-staged pipeline:

====> RoI for localization problem (apparently, solved with traditional algorithms)

====> a very traditional CNN for classification

Variants

base, fast, faster, You Only Look Once (YOLO):

https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

Introduction

Computer vision is an interdisciplinary field that has been gaining huge amounts of traction in the recent years(since CNN) and self-driving cars have taken centre stage. Another integral part of computer vision is object detection. Object detection aids in pose estimation, vehicle detection, surveillance etc. The difference between object detection algorithms and classification algorithms is that in detection algorithms, we try to draw a bounding box around the object of interest to locate it within the image. Also, you might not necessarily draw just one bounding box in an object detection case, there could be many bounding boxes representing different objects of interest within the image and you would not know how many beforehand.

The major reason why you cannot proceed with this problem by building a standard convolutional network followed by a fully connected layer is that, the length of the output layer is variable — not constant, this is because the number of occurrences of the objects of interest is not fixed. A naive approach to solve this problem would be to take different regions of interest from the image, and use a CNN to classify the presence of the object within that region. The problem with this approach is that the objects of interest might have different spatial locations within the image and different aspect ratios. Hence, you would have to select a huge number of regions and this could computationally blow up. Therefore, algorithms like R-CNN, YOLO etc have been developed to find these occurrences and find them fast.

R-CNN

To bypass the problem of selecting a huge number of regionsRoss Girshick et al. proposed a method where we use selective search to extract just 2000 regions from the image and he called them region proposals. Therefore, now, instead of trying to classify a huge number of regions, you can just work with 2000 regions. 

To know more about the selective search algorithm, follow this link or the link above. These 2000 candidate region proposals are warped into a square and fed into a convolutional neural network that produces a 4096-dimensional feature vector as output. The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image and the extracted features are fed into an SVM to classify the presence of the object within that candidate region proposal. In addition to predicting the presence of an object within the region proposals, the algorithm also predicts four values which are offset values to increase the precision of the bounding box. For example, given a region proposal, the algorithm would have predicted the presence of a person but the face of that person within that region proposal could’ve been cut in half. Therefore, the offset values help in adjusting the bounding box of the region proposal.

R-CNN

Problems with R-CNN

  • It still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
  • It cannot be implemented real time as it takes around 47 seconds for each test image.
  • The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

Fast R-CNN

Fast R-CNN

The same author of the previous paper(R-CNN) solved some of the drawbacks of R-CNN to build a faster object detection algorithm and it was called Fast R-CNN. The approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map. From the convolutional feature map, we identify the region of proposals and warp them into squares [==> not precise, we use RoI projection to fit the RoI, which are given as inputs along with the input image, into the feature map, for more detail, see RoI: Region of Interest Projection and Pooling_EverNoob的博客-CSDN博客] and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI feature vector (==> the reshaped RoI projection), we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Comparison of object detection algorithms

From the above graphs, you can infer that Fast R-CNN is significantly faster in training and testing sessions over R-CNN. When you look at the performance of Fast R-CNN during testing time, including region proposals slows down the algorithm significantly when compared to not using region proposals. Therefore, region proposals become bottlenecks in Fast R-CNN algorithm affecting its performance.

==> we see that deploying Selective Search (SS), the algorithm generating regional proposals, on the input image is preventing the F-RCNN from being deployed reliably in real-time cases.

==> SS is also a traditional algorithm with fixed performances, not benefitting from learning

===> hence there was a strong incentive to replace SS, as seen below:

Faster R-CNN

Faster R-CNN

Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network. Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates the selective search algorithm and lets the network learn the region proposals.

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map ==> misleading: [by the paper, in caption of Figure 1: "Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network" ==> which clearly shows that (for Fast-RCNN)SS is deployed on the input image, not the CNN generated "feature map"to identify the region proposals, a separate network is used to predict the region proposals. ==> since there are no RoI projection step nor RoI as inputs, we indeed generate RoI based on CNN generated feature map, which effectively reuse CNN results. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

Comparison of test-time speed of object detection algorithms

From the above graph, you can see that Faster R-CNN is much faster than it’s predecessors ==> since the inference time of the RoI proposal network is negligible compared to SS. Therefore, it can even be used for real-time object detection.

YOLO — You Only Look Once

All of the previous object detection algorithms use regions to localize the object within the image. The network does not look at the complete image, but, instead, parts of the image which have high probabilities of containing the object. YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms seen above. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes.

YOLO

How YOLO works is that we take an image and split it into an SxS gridwithin each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.

YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms. The limitation of YOLO algorithm is that it struggles with small objects within the image, for example it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm.

base, fast, faster, mask:

the linked material below is intended to be an open source textbook, and is more technical:

13.8. Region-based CNNs (R-CNNs) — Dive into Deep Learning 0.17.5 documentation

Fast vs. Faster RCNN

obviously we can expect a longer training time for "Faster", but the payoff is the reduced inference time by cutting off SS.

Faster R-CNN in More Details

To be more accurate in object detection, the fast R-CNN model usually has to generate a lot of region proposals in selective search. To reduce region proposals without loss of accuracy, the faster R-CNN proposes to replace selective search with a region proposal network [Ren et al., 2015].

Fig. 13.8.4 shows the faster R-CNN model. Compared with the fast R-CNN, the faster R-CNN only changes the region proposal method from selective search to a region proposal network. The rest of the model remain unchanged. The region proposal network works in the following steps:

  1. Use a 3×3 convolutional layer with padding of 1 to transform the CNN output to a new output with c channels. In this way, each unit along the spatial dimensions of the CNN-extracted feature maps gets a new feature vector of length c.

  2. Centered on each pixel of the feature maps, generate multiple anchor boxes of different scales and aspect ratios and label them.

  3. Using the length-c feature vector at the center of each anchor box, predict the binary class (background or objects) and bounding box for this anchor box.

  4. Consider those predicted bounding boxes whose predicted classes are objects. Remove overlapped results using non-maximum suppression. The remaining predicted bounding boxes for objects are the region proposals required by the region of interest pooling layer.

more on NMS.

It is worth noting that, as part of the faster R-CNN model, the region proposal network is jointly trained with the rest of the model. In other words, the objective function of the faster R-CNN includes not only the class and bounding box prediction in object detection, but also the binary class and bounding box prediction of anchor boxes in the region proposal network. As a result of the end-to-end training, the region proposal network learns how to generate high-quality region proposals, so as to stay accurate in object detection with a reduced number of region proposals that are learned from data.

Mask RCNN

In the training dataset, if pixel-level positions of object are also labeled on images, the mask R-CNN can effectively leverage such detailed labels to further improve the accuracy of object detection [He et al., 2017a].

As shown in Fig. 13.8.5, the mask R-CNN is modified based on the faster R-CNN. Specifically, the mask R-CNN replaces the region of interest pooling layer with the region of interest (RoI) alignment layer. This region of interest alignment layer uses bilinear interpolation to preserve the spatial information on the feature maps (more on RoI Align see Understanding Region of Interest - Part 2 (RoI Align) - Blog by Kemal Erdem), which is more suitable for pixel-level prediction. The output of this layer contains feature maps of the same shape for all the regions of interest. They are used to predict not only the class and bounding box for each region of interest, but also the pixel-level position of the object through an additional fully convolutional network. More details on using a fully convolutional network to predict pixel-level semantics of an image will be provided in subsequent sections of this chapter.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值