Paper Reading: YOLOv1&v2&v3

YOLOv3ContentSection 1Section 1
摘要由CSDN通过智能技术生成

YOLOv1:You Only Look Once: Unified, Real-Time Object Detection
YOLOv2:YOLO9000: Better, Faster, Stronger
YOLOv3:YOLOv3: An Incremental Improvement


YOLO is currently the state-of-the-art network in object detection, and here is how it works, as well as its evolution.

1 YOLO v1

The name YOLO comes form the abbreviation from “You Only Look Once”, which indicates this network only refer to the input image once.

1.1 Previous works

Previous networks propose potential bounding boxes and run a classifier on it. After that, they refine the results to eliminate duplicate objects, rescore and etc.

1.2 YOLO’s method

The author reframe object detection task to a regression task, where coordinate of bounding boxes and class possibilities comes straight form pixels. Here is how it is designed:

  1. Divides the input image into an S × S S \times S S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
  2. Each grid cell predicts B B B bounding boxes and confidence scores for those boxes. Each bounding box consists of 5 predictions: x , y , w , h x, y, w, h x,y,w,h, and confidence.
  3. Each grid cell predicts C C C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object, regardless of the number of boxes B B B.
  4. Multiply the conditional class probabilities and the individual box confidence predictions.

methods for YOLO v1

1.3 YOLO’s network architecture and training

1.3.1 Architecture

The author select S = 7 , B = 2 , C = 20 S=7,B=2,C=20 S=7,B=2,C=20(PASCAL VOC has 20 labelled classes), just as shown above. And below is the full architecture of their network, with 24 convolutional layers followed by 2 fully connected layers. Notice that this last layer is 7 × 7 × 30 7 \times 7 \times 30 7×7×30, where S = 7 S=7 S=7 and C + 5 × B = 30 C+5 \times B=30 C+5×B=30.
architecture fro YOLO v1

1.3.2 Training

  1. Pretraining first 20 convolutional layers followed by an avg-pool layer and a fully-connect layer on imagenet.
  2. Add the last 4 convolutional layers to form the final graph, and then increase input resolution from 224 × 224 224 \times 224 224×224 to 448 × 448 448 \times 448 448×448.
  3. Increase the losses and try some parameters. You may refer to Section 2.2 in the paper.

1.3.3 Final process

To deal with duplicate detection, they apply non-maximal suppression.

1.4 Result

It’s obvious that YOLOv1 performs extremely well, not only in speed, but also in accuracy.

2 YOLO v2

This versionof YOLO is called YOLO9000, which indicates that it can detect over 9000 object categories.

2.1 Methods to achieve better performance

  1. Batch Normalization.
  2. High Resolution Classifier.
  3. Convolutional With Anchor Boxes.
  4. Dimension Clusters.
  5. Direct location prediction.
  6. Fine-Grained Features.
  7. Multi-Scale Training.
    The methods turns out to be useful for increasing accuracy. Details shown below:
    result for some methods that help better the YOLO v2 accuracy

2.2 Methods to be faster

They propose a new network called Darknet-19, which contains 19 convolutional layers and 5 max-pooling layers, to be the base of the YOLO network. Below is the architecture of Darknet-19:
arctitecture for Darknet-19
The training process is similar to YOLOv1: pretrain for classification, and then train for detection.

2.3 Hierarchical structure for YOLO9000

ImageNet labels comes from WorldNet which is structured as a direct graph instead of a tree. Actually, it’s known to all that ‘dog’ is both a type of ‘canine’ and a type of ‘domestic animal’, which indicates that a tree structure for those words is more practical. So the author build a hierarchical tree from the concepts of ImageNet called WordTree himself. The tree looks just like below:
WordTree from YOLO v2
The tree contains over 9000 ojects, which makes a name for YOLO9000. The advantage of such structure is that, when the network knows this is a ‘dog’, but not quite sure what kind of dog exactly, the catagory ‘dog’ still can get a high score.

2.4 Conclusion for YOLOv2 and YOLO9000

In fact, the author propose 2 networks in this paper. One is YOLOv2, which is faster with higher accuracy. And the other one is YOLO9000 that detects more than 9000 objects using a WordTree strucure.

3 YOLO v3

This article writes in a quite easy and casual atmosphere, but it does contain a lot of good technologies. To understand YOLO v3, I recommend this article.
From the article as well as the applications of YOLO v3, we can easily conclude that YOLO v3 is a big success. Since I’m still struggling with the reason why it’s so fast, I’m not going to write deep here. Anyway, the backbone called DarkNet-53 is really worth analysis.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值