[深度学习论文笔记][Object Detection] You Only Look Once: Unified, Real-Time Object Detection

最新推荐文章于 2024-08-07 20:55:33 发布

Hao_Zhang_Vision

最新推荐文章于 2024-08-07 20:55:33 发布

阅读量855

点赞数 1

分类专栏： CNN Papers 文章标签： Deep Learning Papers Computer Vision CNN Object Detection

本文链接：https://blog.csdn.net/Hao_Zhang_Vision/article/details/53142068

版权

CNN Papers 专栏收录该内容

58 篇文章 1 订阅

订阅专栏

Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” arXiv preprint arXiv:1506.02640 (2015). (Citations: 76).

1 Motivation

We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

2 Pipeline

See Fig.
1. Resize the input image to 448 × 448 (use 448 instead of 224 is to capture fine-grained visual information).
2. Divides the input image into an S × S grid (S = 7 in our case).
3. Each grid cell predicts B bounding boxes and confidence values Pr(object) for those boxes (B = 2 in our case).
4. Each grid cell also predicts K class probabilities conditioned on object Pr(k|object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
5. Then we combine the class and individual box predictions Pr(k) = Pr(object) · Pr(k|object).

6. Finally we do NMS and threshold detections.

3 Training Details
During trianing, if the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. The loss function only penalizes classification error (the
conditional class probability) if an object is present in that grid cell. It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth
box (i.e. has the highest IOU among the B predictors in that grid cell). For the confidence values, we increase the confidence score of the “responsible” predictor, and decrease the confidence of other boxes. That means, if some grid cells do not have any ground-truth detections, we only decrease the confidence of these boxes, and do not adjust the class probabilities or coordinates.

The reason why we need two kinds of probabilities is that if we predict Pr(k) directly from each grid cell, there will be S × S × B × K prediction numbers, many of which are
zero. Therefore, we can solve this problem by introducing Pr(object). We are updating Pr(object) in each grid cell, while updating Pr(k|object) when there is a object in that grid
cell.

4 Results
See Tab. It is faster than Faster R-CNN, but not as good. This is because YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only
predicts B boxes and can only have one class. Besides, since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Finally, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.