Class4-Week3 Object Detection

Object Localization

在这里插入图片描述
With object localization the network need to identify where the object is, putting a bounding box around it.This is what is called “classification with localization”. Later on, we’ll see the “detection” problem, which takes care of detecting and localizing multiple objects within the image.

For an object localization problem, we start off using the same network we saw in image classification. So, we have an image as an input, which goes through a ConvNet that results in a vector of features fed to a softmax to classify the object (for example with 4 classes for pedestrians/cars/bike/background). Now, if we want to localize those objects in the image as well, we change the neural network to have a few more output units that encompass a bounding box. In particular, we add four more numbers, which identify the x and y coordinates of the upper left corner and the height and width of the box (bx, by, bh, bw).
The neural network now will output the above four numbers, plus the probability of class labels (also four in our case). Therefore, the target label will be:
在这里插入图片描述

Where pc is the confidence of an object to be in the image. It responds to the question “is there an object?” Instead, c1,c2,c3, in case there is an object, tell if the object is part of class 1, 2 or 3. So, it tells us which object it is. Finally, bx, by, bh, bw identify the coordinates related to the bounding box around the detected object.
For example, if an image has a car, the target label will be:
在这里插入图片描述
In case the network doesn’t detect an object, the output is simply:
在这里插入图片描述
Where the question marks are placed in the rest of the positions that don’t provide any meaning in this case. Technically the network will output big numbers or NaN in these positions.


Landmarks detection

In this case, the output will be even bigger since we ask the network to output the x and y coordinates of important points within an image. For example, think about an application for detecting key landmarks of a face. In this situation, we could identify points along the face that denote, for example, the corners of the eyes, the mouth, etc.
在这里插入图片描述


Object Detection

Sliding Window Detection

在这里插入图片描述
Object detection can be performed using a technique called “sliding window detection”. We train a ConvNet to detect objects within an image and use windows of different sizes that we slide on top of it. For each window, we perform a prediction.

The big downside of it is the computational cost, which is very extensive since we can have a lot of windows. The solution to that is the sliding window detection computed convolutionally.
Instead of sliding a small squeegee to clean a window, we now have a big one that fits the entire window and magically cleans it completely without any movement.
Let’s check this out!

Convolutional Implementation of Sliding Windows

The first step to build up towards the convolutional implementation of sliding windows is to turn the Fully Connected layers in a neural network into convolutional layers. See example below:
在这里插入图片描述
Great, now to simplify the representation, let’s re-sketch the final network in 2D:
在这里插入图片描述

If our test image is of dimension 16x16x3 and we had to perform the “regular” sliding window we would have to create 4 different windows of size 14x14x3 out of the original test image and run each one through the ConvNet.
在这里插入图片描述

This is computationally expensive and a lot of this computation is duplicative. We would like, instead, to have these four passes to share computation.

So, with the convolutional implementation of sliding windows we run the ConvNet, with the same parameters and same filters on the test image and this is what we get:
在这里插入图片描述

Each of the 4 subsets of the output unit is essentially the result of running the ConvNet with a 14x14x3 region in the four positions on the initial 16x16x3 image.
You might be wondering if this works on other examples too, and it does.

Think about an input image of 28x28x3. Going through the network, we arrive at the final output of 8x8x4. In this one, each of the 8 subsets corresponds to running the 14x14x3 region 8 times with a slide of 2 in the original image.

One of the weaknesses of this implementation is that the position of the bounding box we get around the detected object is not overly accurate.

We will soon see that the YOLO algorithm is the solution to that.


YOLO (You only look once)

We start with placing a grid on top of the input image. Then, for each of the grid cells, we run the classification and localization algorithm we saw at the beginning of the blog. The labels for training, for each grid cell, will be similar to what we saw earlier, with an 8-dimensional output vector:
在这里插入图片描述
For each cell, we will get a result whether there is an object or not. For example:
在这里插入图片描述
The object is “assigned” to the specific cell looking to where the center falls.
在这里插入图片描述
If we have a 3x3 grid cell, then the target output volume will have a dimension of 3x3x8 (where 8 is the number of labels in y). So, in this case, we will run the input image through a ConvNet to map to an output of 3x3x8 volume.

So we have a convolutional implementation for the entire grid cells (not 9 individual ones), as we saw earlier. We, therefore, combine what we saw in the localization classification algorithm with the convolutional implementation.

The advantage of this algorithm is that it outputs precise positions of bounding boxes, as the values bx, by, bh, bw are computed relative to the cell. So, the finer grid we have the more precision we can obtain and also we have fewer chances of having multiple objects within a cell.

Intersection Over Union(Iou)

This is a way of measuring if the object detection algorithm is working well.It computes the intersection over the union of the detected bounding box and the correct one. Therefore:
I o u = s i z e o f t h e i n t e r s e c t i o n a r e a s i z e o f t h e u n i o n a r e a Iou = \frac{size of the intersection area}{size of the union area} Iou=sizeoftheunionareasizeoftheintersectionarea
We identify a benchmark and consider an accurate object detection if the result of IoU is above that specific value. (i.e. IoU <= 0.5). Clearly, the higher the IoU value, the more accurate results we have.

Non-max Suppression

This technique is used to make our YOLO algorithm perform better.In fact, YOLO could detect an object multiple times, since it’s possible that many grid cells detect the object. To avoid that, we take the following steps:

Remember that each prediction comes with a value p c p_{c} pc, which identifies the prediction probability. We now discard, for example, for all the boxes with p c p_{c} pc <= 0.6.

First, we assign a probability on each detection, then we take the “largest probability” box. We now look at the boxes that overlap the most with the “largest probability” box and remove the ones that have high IoU. Finally, the remaining box is the correct detection.

While there are any remaining boxes then we do:

  • pick the box with the largest p c p_{c} pc. Output that as prediction.
  • discard any remaining box with IoU>=0.5(0.5 is just an experience value) with respect to the box output in the previous step.

If we have multiple classes (objects), then we implement non-max suppression independently for each one.

Anchor Boxes

One of the problems with object detections as we have seen so far is the fact that each grid cell can only detect one object. If we have instead multiple objects in the same cell, the techniques we have used so far won’t help to discern them. Anchor boxes will help us overcome this issue.
在这里插入图片描述
The idea here is to predefine different shapes (called anchor boxes) for each object and associate predictions to each one of them. Our output label now will contain 8 dimensions for each of the anchor boxes we predefined.
If we chose two anchor boxes, then the class label will be:
在这里插入图片描述
So each object in the training image was assigned to the grid cell that contained that object’s midpoint (for a 3x3 grid, the output was 3x3x8). Now, each object in the training image is assigned to the grid cell that contains that object’s midpoint and the anchor box for the grid cell with highest IoU.

(for a 3x3 grid and 2 anchor boxes, the output is 3x3x16).

The only thing it can not handle well is in case two objects in the same cell have the same anchor box. Additionally, we get to choose and redefine the shape of the anchor boxes.

Putting it All Together for YOLO

Quick tips when implementing the YOLO algorithm:

  • decide the grid size and the number of anchor boxes (as these two variables drive the dimension of the output volume y).
  • Train the ConvNet on the training images.
  • Run non-max suppression.

Region Proposals

This algorithm tries to pick few regions within the image, which make sense to run the classifier. As for some regions of the image that contain no objects, it makes no sense to run the ConvNet classifier.

So first we need to find a way to find out where the objects are. We can do so by running a segmentation algorithm, which identifies blobs around objects. Then, we place a bounding box around each blob and run the classifier for each of these bounding boxes. It is a pretty slow algorithm as it proposes some regions and it classifies them one at a time.

To speed it out there, has been proposed the “fast R-CNN” algorithm. For this one, we still have the first step, which proposes the regions, but then it uses the convolution implementation of sliding windows to classify all the proposed regions.

Well, the first step is still a bit annoyingly slow, right?
Why not a “faster R-CNN”? Yes, it exists.

This one replaces the first step with the use of a convolutional network to propose regions.

DeepLearning series: Object detection and localization — YOLO algorithm, R-CNN

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值