Course4-week3-object detection

最新推荐文章于 2021-02-10 15:47:01 发布

土肥宅娘口三三

最新推荐文章于 2021-02-10 15:47:01 发布

阅读量671

点赞数

分类专栏： deep learning 文章标签： Andrew Ng deep learning deeplearning.ai

本文链接：https://blog.csdn.net/robin_Xu_shuai/article/details/80658588

版权

deep learning 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

1 - object localization

In order to build up the object detection, we first learn about object localization.

image classification: the algorithm look at the picture and responsible for saying this is a car.
image classification with localization: the algorithm not only responsible for saying this is a car, but also is responsible for putting a bounding box around the position of the car in the image. (one object)
detecion: might be multuple objects in the pictures, you have to detect them all, and localize them all. (maybe more than one objects)

For the image classification, we input a picture into a ConvNet with multiple layer, finally a softmax unit outputs the predict class. This is the standard classification pipline, how about if we want to localize the car in the image as well. To do that we can change the neural network to have a few more output units that output a bounding box specified by four more numbers, $b_x, b_y, b_h, b_w$ , where $b_x, b_y$ specifing the midpoint, and the height would be $b_h$ and width would be $b_w$ of the bounding box.

So if your training set contain not just the object class label, it also contain the four additional number giving the bounding box, that we can use the supervised learning to making the algorothm ouput not just the class label, but also four parameters to tell us where is the bounding box of the object we detected.

how to define the target label $y$ :

$y$ need to output $b_x, b_y, b_h, b_w$ and class label(1-4), and the first component $p_c$ of the vector is going to saying is there any object, so if the object is class 1, 2 or 3, $p_c$ will be equal 1, and if it’s the background class, then $p_c = 0$ . So $p_c$ standing for the probability that one of the classes we are trying to detect is there.

how to define the loss function:

$L(y,y^)={(y1^−y1)2+(y8^−y8)2+⋯+(y8^−y8)2(y1^−y1)2 if y1=1 if y1=0(in this case all the rest components are don't care) L ( y , y ^ ) = { ( y 1 ^ − y 1 ) 2 + ( y 8 ^ − y 8 ) 2 + ⋯ + ( y 8 ^ − y 8 ) 2 if y 1 = 1 ( y 1 ^ − y 1 ) 2 if y 1 = 0 (in this case all the rest components are don't care)$ $\mathcal{L}(y, \hat{y}) = \begin{cases} (\hat{y_1} - y_1)^2 + (\hat{y_8} - y_8)^2 + \cdots + (\hat{y_8} - y_8)^2 & \text{ if }y_1 = 1\\ (\hat{y_1} - y_1)^2 &\text{ if }y_1 = 0\text{(in this case all the rest components are don't care)} \end{cases}$

2 - lankmark detection

We can have a neural network just oupout x and y coordinates of important points in images called landmark that we want the neural network to recognize.

$trainin set: (x = image, y = (is there a face, l x 1, l y 1, l x 2, l y 2, \dots, l x 64, l y 64))$ $\text{trainin set: }\big(x = \text{image}, y = (\text{is there a face}, lx_1, ly_1, lx_2, ly_2, \cdots, lx_{64}, ly_{64})\big)$

train neural network on the labeled training set, which have a set of images as well as labels y have laboriously annotated all of these landmarks, can tell us if there is a face as well as where the key landmarks on the face. This will be a basic building block for recognizing emotions from face, or only draw a crown on the face and other special effects.

Those ideas might seem quite simple of just adding a bunch of output units to output the x, y coordinates of different landmark you want to recognize. To be clear, the labels/landmark have to be consistent accross different images. Next let’s use these building blocks and use it to start building object detection.

3 - object detect

we have learned about object localization and landmark detection. Net let’s building up a object detection algorithm. We will learn how to usa a ConvNet to perform object detection, using sliding windows detecting algorithm.

We want to building a car detection algorithm, we can first create a labeled training set, where x is the closely cropped examples of cars.

given this labeled training set, we can then train a ConvNet that input the closely cropped images, and then the job of the ConvNet is to output the corresponding y. Once we have trained up this ConvNet, we can use it in sliding windows algorithm.

What we do is start by picking a certain window sizes, and silde the window across every position in the images and then feed into the ConvNet to classify 0 or 1 for each those square region. Then repeat it, but now use a larger window, so now we take a slightly larger region throughout the entire image and feed each of these into the ConvNet. And then we might do at the third time using even larger windows and so on. So this algorithm is called sliding windows detection.

There is a huge disadvantage of sliding windows detection which is the computational cost. Cause the coarser granularity may hurt the performance, cause you end up not able to localize the object accurately, whereas a very fine granularity or a small strides means high computational cost.

4 - convolutional implementation of sliding windows

We have learned about the sliding windows object detection algorithm using a ConvNet. But we thought that it was too slow. Now we learn how to implement this alogrithm convolutionally.

Turn fully connected layer in the neural network into convolutional layers:

What we next to do is turn the fc layers into convolutional layers. One way to implement the first fully connected layer, is to use 400 5 by 5 by 16 filters, and the output will be 1 by 1 by 400, this is the same as the fc layer. Diagram below show how we can replace these fc layers and implement them using convolutional layers.

let’s see how we can have a convolutional implementation of sliding windows object detection.

Let’s say the ConvNet inputs 14 by 14 by 3 images, and the test set image is 16 by 16 by 3. In the original sliding windows algorithm we will sliding window on the 16 by 16 image, and run the ConvNet shown above four times with strides 2 in order to get four labels. But it turns out a lot of computational done during the 4 runs is highly duplicated. So what convolutional implement of sliding window does is allow the 4 forward passes of the ConvNet to share a lot of computation.

It turns out that the blue 1 by 1 by 4 subset of the network’ output really gives you the result of running in the upper left-hand cornor 14 by 14 image, and so on. (the max pooling of 2 corrsponds to running your neural network with a strides of 2 on the origin image)

To recap, implement sliding windows previously what we do is drop out a region and run that in the ConvNet, and do that for the next one region, until there is one region of all recognize the car in it. But with the convolutional implementation you can convolve the entire image make all the predictions at the same time. This make the whole thing much more effecient. This algorithm still has one weakness which is the position of the bounding boxes is not going to be too accurate.

5 - bounding box predictions

With sliding windows we take discrete set of positions and run the classifier through it, and maybe none of the boxs really match up perfectly with the position of the car, and maybe the optimal match box is not too good.

A good way to get this output more accurate bounding box is with the YOLO algorithm, YOLO stands for you only look once.

The basic idea is that we are going to place down a grid on the image, and take the image classification and localization algorithm to apply on the each of the nine grid cells of this image.

take the midpoint of each of the two objects and assigns the objects to the grid cell containing the midpoint.

labels for training:

$for each cell: y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ p c b x b y b h b w c 1 c 2 c 3 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ for example ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 ? ? ? ? ? ? ? ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ for the upper left-hand corner cell or ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 b x b y b h b w 010 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ for the right middle cell$ $\text{for each cell:} y = \begin{bmatrix} p_c\\ b_x\\ b_y\\ b_h\\ b_w\\ c_1\\ c_2\\ c_3\\ \end{bmatrix} \text{ for example} \begin{bmatrix} 0\\ ?\\ ?\\ ?\\ ?\\ ?\\ ?\\ ?\\ \end{bmatrix} \text{for the upper left-hand corner cell or} \begin{bmatrix} 1\\ b_x\\ b_y\\ b_h\\ b_w\\ 0\\ 1\\ 0\\ \end{bmatrix}\text{for the right middle cell}$

In this way so the total volume of the output is goint to be 3 by 3 by 8, because we have 3 by 3 grid cells and each of it have a eight dimensional y vector. Let’s say the input images is 100 by 100 by 3, so now what we should to do is build a neural network that input a 100 by 100 by 3 image, and choose the conv layers, max pooling layer and so on, so that this eventually maps to a 3 by 3 by 8 target output volume, and then use the back propagation to train the neural network to map from any input x to this type of output y.

The advantage of this algorithm is that the neural network can outputs precise bounding boxes. At the test time what you do is feed in an image x and run forward prop until get output y, and then for each of the nine positions, we can read off 1 or 0, and if there is an object what object it is, and where is the bounding box for the object in this cell. And so long as don’t have more than one object in each cell, this algorithm should works okay.

The $b_x, b_y, b_h, b_w$ is specified relative to the grid cell, and so $b_x, b_y$ has to between 0 and 1, but $b_h, b_w$ could be greater than 1.

6 - intersection over union

how to tell if your object detection algorithm is working well? We will use both for evaluating the object detection algorithm and using it to add another component to object detection algorithm to make it work even better.

By convention, very often 0.5 is used as a threshold to judge as whether the predicted bounding box is correct or not. This is one way to** map localization to accuracy**, where you just count up the number of times algorithm correctly detect and localization an object.

The motivated to define the LoU

evaluate whether or not the object localization algorithm is accurate
a measure of the overlap between two bounding boxes, or how similiar two boxes are to each other

7 - non-max suppression

The algorithm may find multiple detections of the same object rather than detecting an object just once, and non-max suppression is a way for you to make sure that your algorithm detects each object only once.

technically only two of those grid cells which the midpoint of the car in should predict that there is a car, repectively, but in practice, maybe others cell also think they have found a car. And the non-max suppression does is clean up these detection. No-max suppression means that you are going to output your maximal probabilities classifications, but suppress it close by ones that are non maximal.

non-max suppression algorithm:

let’s say you only doing car detection, so each ouptut of the 19 by 19 positions you will get an output prediction as following:

$each output prediction is: ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ p c b x b y b h b w ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥$ $\text{each output prediction is: } \begin{bmatrix} p_c\\b_x\\b_y\\b_h\\b_w\\ \end{bmatrix}$

Discard all boxes with $p_c \le 0.6$ , so this discards all the low probability output boxes.

While there are any remaining boxes:

pick the box with the largest $p_c$ and output that as a prediction
discard any remaining box with IoU $\ge 0.5$ with the output in the previous step

until yow have take each of the boxes and either put it as a prediction, or discard it as having too high IoU with one of the boxes that you have just output as a predictd position for one of the detected objects.

If you actually to detect three objects and it turns out the right thing to do is to independently carry out non-max suppression three times on each of the classes.

8 - anchor boxes

One of the problems with object detection is each of the grid cell can detect only one object. What if a grid cell wants to detect multiple objects? Here is what a anchor box do.

Notice that both midpoint of the pedestrian and midpoint of the car fall into the same grid cell. Predefined two shape as following

previously:

each object in the training image is assigned to grid cell that contains that object’s midpoint. So the dimension of the output y is 3 by 3 by 8
with two anchor boxes:

each object in training image is assigned to grid cell that contains object’s midpoint and anchor box with highest IoU with the object shape, so the dimension of output y is 3 by 3 by 16

if one of this grid cell only have a car and have no pedestrian,

Additional details:

if you have two anchor box but 3 objects in the same grid cell, that’s one case that this algorithm doesn’t handle it well
two objects associated the same grid cell, but both of them have the same anchor box shape, that’s another case that this algorithm doesn’t handle well

The anchor box as a way to deal with two objects appear in the same grid cell, in practice that happens quite rarely, espcially if you use a 19 by 19 rather than 3 by 3 grid. Maybe even better motivation that anchor box gives you is it allows your learning algorithm to specialize better, in particular if your dataset has some of tall skinny object, then this allows your algorithm to specialize so that some of the output can specialize in detecting the tall skinny object.

9 - putting it together: YOLO algorithm

We have already seen most of the components of the object detection, new let’w put all the components together to form the YOLO object detection algorithm.

How to construct training set:

suppose that we are trying to train an algorithm to detect 3 objects, pedestrians cars and motorcycles, if we are using two anchor boxes, then the output y will be 3 by 3 by 16, and to construct the training set, you need to go through each of the 9 grid cells and form the target vector y. The target y corresponding to most of the grid cells would be:

$y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 ? ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥$ $y = \begin{bmatrix} 0\\?\\?\\?\\?\\?\\?\\?\\0\\?\\?\\?\\?\\?\\?\\? \end{bmatrix}$

where the $p_c$ for the first anchor box is 0, because there is nothing associated to the first anchor box, and also 0 for the second anchor box. But for the lower middle grid cell the target vector y would be:

$y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 ? ? ? ? ? ? ? 1 b x b y b h b w 010 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥$ $y = \begin{bmatrix} 0\\?\\?\\?\\?\\?\\?\\?\\1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0 \end{bmatrix}$

So that’s why the final output volume is going to be 3 by 3 by 16. In practice, maybe 19 by 19 by 5 by 8, where 5 stand for there are 5 anchors boxes. So we training a ConvNet that inputs an image maybe 100 by 100 by 3, and the ConvNet would then finally output the volume of shape 3 by 3 by 16.

How to make prediction:

After get the output volume from the ConvNet, we would run the non-max suppression through it. If we are using two anchor boxes, for each of the nine grid cells you get two predicted bounding boxes, some of the them will have very low probability $p_c$ ,

Next we get rid of the low probability predictions even a neural network says this object probably is not there.

Next what we do is for each of the three classes independently run non-max suppression for the object. So run the non-max suppresion 3 times to generate the final predictions.

YOLO Algorithm:

for each grid cell, get 2 predicted bounding boxes
get rid of the low probability predictions
for each class(pedestrian, car, motorcycle) use non-max suppression to generate the final predictions

10 - region proposals

The sliding windows object detection algorithm would take a trained classifier and run it across all of the region on the image to see if there is a car or a motorcycles or a pedestrian in it, you could also run the algorithm convolutionally to cut the computational cost, but one downside that the algorithm is it just classifier a lot of regions where is clearly no object. What the R-Cnn, regions with Cnn, does is it tries to pick a few regions that make sense to run the ConvNet classifier on. The way they perform the region proposals is to run an algorithm called a segmentation algorithm to ouput the result something like the most right picture below. And place bounding boxes around the blobs, and just run the algorithm on those blobs. So this can be a much smaller number of positions to run the classifier. And especially, you will run the ConvNet not just on square shape regions, you will running them at multiple scale and multiple shape.

土肥宅娘口三三

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Course4-week3-object detection

1 - object localizationIn order to build up the object detection, we first learn about object localization. image classification: the algorithm look at the picture and responsible for saying ...
复制链接

扫一扫

专栏目录