[深度学习论文笔记][Object Detection] You Only Look Once: Unified, Real-Time Object Detection

Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” arXiv preprint arXiv:1506.02640 (2015). (Citations: 76).


1 Motivation

We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.


2 Pipeline

See Fig.
1. Resize the input image to 448 × 448 (use 448 instead of 224 is to capture fine-grained visual information).
2. Divides the input image into an S × S grid (S = 7 in our case).
3. Each grid cell predicts B bounding boxes and confidence values Pr(object) for those boxes (B = 2 in our case).
4. Each grid cell also predicts K class probabilities conditioned on object Pr(k|object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
5. Then we combine the class and individual box predictions Pr(k) = Pr(object) · Pr(k|object).

6. Finally we do NMS and threshold detections.



3 Training Details
During trianing, if the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. The loss function only penalizes classification error (the
conditional class probability) if an object is present in that grid cell. It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth
box (i.e. has the highest IOU among the B predictors in that grid cell). For the confidence values, we increase the confidence score of the “responsible” predictor, and decrease the confidence of other boxes. That means, if some grid cells do not have any ground-truth detections, we only decrease the confidence of these boxes, and do not adjust the class probabilities or coordinates.


The reason why we need two kinds of probabilities is that if we predict Pr(k) directly from each grid cell, there will be S × S × B × K prediction numbers, many of which are
zero. Therefore, we can solve this problem by introducing Pr(object). We are updating Pr(object) in each grid cell, while updating Pr(k|object) when there is a object in that grid
cell.

4 Results
See Tab. It is faster than Faster R-CNN, but not as good. This is because YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only
predicts B boxes and can only have one class. Besides, since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Finally, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.



5 Refences
[1]. CVPR 2016. https://www.youtube.com/watch?v=NM6lrxy0bxs.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值