YOLOv1:You Only Look Once: Unified, Real-Time Object Detection
YOLOv2:YOLO9000: Better, Faster, Stronger
YOLOv3:YOLOv3: An Incremental Improvement
YOLO is currently the state-of-the-art network in object detection, and here is how it works, as well as its evolution.
1 YOLO v1
The name YOLO comes form the abbreviation from “You Only Look Once”, which indicates this network only refer to the input image once.
1.1 Previous works
Previous networks propose potential bounding boxes and run a classifier on it. After that, they refine the results to eliminate duplicate objects, rescore and etc.
1.2 YOLO’s method
The author reframe object detection task to a regression task, where coordinate of bounding boxes and class possibilities comes straight form pixels. Here is how it is designed:
- Divides the input image into an S × S S \times S S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
- Each grid cell predicts B B B bounding boxes and confidence scores for those boxes. Each bounding box consists of 5 predictions: x , y , w , h x, y, w, h x,y,w,h, and confidence.
- Each grid cell predicts C C C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object, regardless of the number of boxes B B B.
- Multiply the conditional class probabilities and the individual box confidence predictions.
1.3 YOLO’s network architecture and training
1.3.1 Architecture
The author select
S
=
7
,
B
=
2
,
C
=
20
S=7,B=2,C=20
S=7,B=2,C=20(PASCAL VOC has 20 labelled classes), just as shown above. And below is the full architecture of their network, with 24 convolutional layers followed by 2 fully connected layers. Notice that this last layer is
7
×
7
×
30
7 \times 7 \times 30
7×7×30, where
S
=
7
S=7
S=7 and
C
+
5
×
B
=
30
C+5 \times B=30
C+5×B=30.
1.3.2 Training
- Pretraining first 20 convolutional layers followed by an avg-pool layer and a fully-connect layer on imagenet.
- Add the last 4 convolutional layers to form the final graph, and then increase input resolution from 224 × 224 224 \times 224 224×224 to 448 × 448 448 \times 448 448×448.
- Increase the losses and try some parameters. You may refer to Section 2.2 in the paper.
1.3.3 Final process
To deal with duplicate detection, they apply non-maximal suppression.
1.4 Result
It’s obvious that YOLOv1 performs extremely well, not only in speed, but also in accuracy.
2 YOLO v2
This versionof YOLO is called YOLO9000, which indicates that it can detect over 9000 object categories.
2.1 Methods to achieve better performance
- Batch Normalization.
- High Resolution Classifier.
- Convolutional With Anchor Boxes.
- Dimension Clusters.
- Direct location prediction.
- Fine-Grained Features.
- Multi-Scale Training.
The methods turns out to be useful for increasing accuracy. Details shown below:
2.2 Methods to be faster
They propose a new network called Darknet-19, which contains 19 convolutional layers and 5 max-pooling layers, to be the base of the YOLO network. Below is the architecture of Darknet-19:
The training process is similar to YOLOv1: pretrain for classification, and then train for detection.
2.3 Hierarchical structure for YOLO9000
ImageNet labels comes from WorldNet which is structured as a direct graph instead of a tree. Actually, it’s known to all that ‘dog’ is both a type of ‘canine’ and a type of ‘domestic animal’, which indicates that a tree structure for those words is more practical. So the author build a hierarchical tree from the concepts of ImageNet called WordTree himself. The tree looks just like below:
The tree contains over 9000 ojects, which makes a name for YOLO9000. The advantage of such structure is that, when the network knows this is a ‘dog’, but not quite sure what kind of dog exactly, the catagory ‘dog’ still can get a high score.
2.4 Conclusion for YOLOv2 and YOLO9000
In fact, the author propose 2 networks in this paper. One is YOLOv2, which is faster with higher accuracy. And the other one is YOLO9000 that detects more than 9000 objects using a WordTree strucure.
3 YOLO v3
This article writes in a quite easy and casual atmosphere, but it does contain a lot of good technologies. To understand YOLO v3, I recommend this article.
From the article as well as the applications of YOLO v3, we can easily conclude that YOLO v3 is a big success. Since I’m still struggling with the reason why it’s so fast, I’m not going to write deep here. Anyway, the backbone called DarkNet-53 is really worth analysis.