Paper Reading: YOLOv1&v2&v3

最新推荐文章于 2020-08-27 18:38:17 发布

surtol

最新推荐文章于 2020-08-27 18:38:17 发布

阅读量255

点赞数

分类专栏： paper reading

本文链接：https://blog.csdn.net/surtol/article/details/98338359

版权

paper reading 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

YOLOv3ContentSection 1Section 1

摘要由CSDN通过智能技术生成

YOLOv1:You Only Look Once: Unified, Real-Time Object Detection
YOLOv2:YOLO9000: Better, Faster, Stronger
YOLOv3:YOLOv3: An Incremental Improvement

YOLO is currently the state-of-the-art network in object detection, and here is how it works, as well as its evolution.

1 YOLO v1

The name YOLO comes form the abbreviation from “You Only Look Once”, which indicates this network only refer to the input image once.

1.1 Previous works

Previous networks propose potential bounding boxes and run a classifier on it. After that, they refine the results to eliminate duplicate objects, rescore and etc.

1.2 YOLO’s method

The author reframe object detection task to a regression task, where coordinate of bounding boxes and class possibilities comes straight form pixels. Here is how it is designed:

Divides the input image into an $\times S$ grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts $B$ bounding boxes and confidence scores for those boxes. Each bounding box consists of 5 predictions: $x, y, w, h$ , and confidence.
Each grid cell predicts $C$ conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object, regardless of the number of boxes $B$ .
Multiply the conditional class probabilities and the individual box confidence predictions.

methods for YOLO v1

1.3 YOLO’s network architecture and training

1.3.1 Architecture

The author select $S = 7, B = 2, C = 20$ (PASCAL VOC has 20 labelled classes), just as shown above. And below is the full architecture of their network, with 24 convolutional layers followed by 2 fully connected layers. Notice that this last layer is $\times 7 \times 30$ , where $S = 7$ and $\times B=30$ .
architecture fro YOLO v1

1.3.2 Training

Pretraining first 20 convolutional layers followed by an avg-pool layer and a fully-connect layer on imagenet.
Add the last 4 convolutional layers to form the final graph, and then increase input resolution from $224 \times 224$ to $448 \times 448$ .
Increase the losses and try some parameters. You may refer to Section 2.2 in the paper.

1.3.3 Final process

To deal with duplicate detection, they apply non-maximal suppression.

1.4 Result

It’s obvious that YOLOv1 performs extremely well, not only in speed, but also in accuracy.

2 YOLO v2

This versionof YOLO is called YOLO9000, which indicates that it can detect over 9000 object categories.

2.1 Methods to achieve better performance

Batch Normalization.
High Resolution Classifier.
Convolutional With Anchor Boxes.
Dimension Clusters.
Direct location prediction.
Fine-Grained Features.
Multi-Scale Training.
The methods turns out to be useful for increasing accuracy. Details shown below:

2.2 Methods to be faster

They propose a new network called Darknet-19, which contains 19 convolutional layers and 5 max-pooling layers, to be the base of the YOLO network. Below is the architecture of Darknet-19:
arctitecture for Darknet-19
The training process is similar to YOLOv1: pretrain for classification, and then train for detection.

2.3 Hierarchical structure for YOLO9000

ImageNet labels comes from WorldNet which is structured as a direct graph instead of a tree. Actually, it’s known to all that ‘dog’ is both a type of ‘canine’ and a type of ‘domestic animal’, which indicates that a tree structure for those words is more practical. So the author build a hierarchical tree from the concepts of ImageNet called WordTree himself. The tree looks just like below:
WordTree from YOLO v2
The tree contains over 9000 ojects, which makes a name for YOLO9000. The advantage of such structure is that, when the network knows this is a ‘dog’, but not quite sure what kind of dog exactly, the catagory ‘dog’ still can get a high score.

2.4 Conclusion for YOLOv2 and YOLO9000

In fact, the author propose 2 networks in this paper. One is YOLOv2, which is faster with higher accuracy. And the other one is YOLO9000 that detects more than 9000 objects using a WordTree strucure.

3 YOLO v3

This article writes in a quite easy and casual atmosphere, but it does contain a lot of good technologies. To understand YOLO v3, I recommend this article.
From the article as well as the applications of YOLO v3, we can easily conclude that YOLO v3 is a big success. Since I’m still struggling with the reason why it’s so fast, I’m not going to write deep here. Anyway, the backbone called DarkNet-53 is really worth analysis.