2017-09-04 You Only Look Once: Unified, Real-Time Object Detection(YOLO)

最新推荐文章于 2022-09-23 11:24:15 发布

ylin01234

最新推荐文章于 2022-09-23 11:24:15 发布

阅读量232

点赞数

分类专栏：论文

论文专栏收录该内容

6 篇文章 0 订阅

订阅专栏

来源知乎：https://zhuanlan.zhihu.com/p/2491678

You Only Look Once: Unified, Real-Time Object Detection(YOLO)

1. Question
  1. Prior work on object detection repurposes classifiers to perform detection. Can we think of object detection as a regression problem?
  2. Object proposal and classifier is apart. Can we predict bounding box and class probabilities?
2. Solution

Frame object detection as a regression problem to separated bounding boxes and associated class probabilities.

1. Advantage and Contribution
  1. A single convolutional network simultaneously predicts multiple bounding

boxes and class probabilities for those boxes.

1. 1. Simple and fast. Don’t need a complex pipeline and run at 45 frames per second with no batch processing.
  2. YOLO reasons globally about the image when making predictions, so YOLO makes less than half the number of background errors compared to Fast R-CNN.
  3. According to inception modules, this paper use 1*1 reduction layers followed by 3*3 convolutional layers to construct Network, which greatly reduces the amount of calculation and parameters.
  4. Predict the square root of the bounding box width and height instead of the width and height directly to reduce error of different boxes.

1. Weakness
  1. Each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that the model can predict. In other words, this model predicts less accuracy when approaching smaller targets.
  2. The model also uses relatively coarse features for predicting bounding boxes since the architecture has multiple downsampling layers from the input image.
  3. loss function treats errors the same in small bounding boxes versus large bounding boxes, and the error of incorrect localizations is big.
2. Model overview

Define confidence:

C=PrObject*IOU

Class-specific confidence scores for each box（Use it at test time.

PrClassiObject*PrObject*IOU=PrClassi*IOU

Detection principle

Network structure

Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. The model pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

Implementation steps:

1）Reshape the input image into 488*488, and divide the input image into an S × S grid.

2）For each grid sell, get a vector of 30 dimensions（B*5+C（two bounding boxes so B=2. 5 predictions: x, y, w, h, and confidence. 20 labelled classes so C = 20））through the network of 24 convolutional layers followed by 2 fully connected layers. The final output of the network is the 7 × 7 × 30 tensor of predictions.

The multi-part loss function at train time