1 Motivation
We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
2 Pipeline
See Fig.1. Resize the input image to 448 × 448 (use 448 instead of 224 is to capture fine-grained visual information).
2. Divides the input image into an S × S grid (S = 7 in our case).
3. Each grid cell predicts B bounding boxes and confidence values Pr(object) for those boxes (B = 2 in our case).
4. Each grid cell also predicts K class probabilities conditioned on object Pr(k|object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
5. Then we combine the class and individual box predictions Pr(k) = Pr(object) · Pr(k|object).
6. Finally we do NMS and threshold detections.
3 Training Details
During trianing, if the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. The loss function only penalizes classification error (the
conditional class probability) if an object is present in that grid cell. It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth
box (i.e. has the highest IOU among the B predictors in that grid cell). For the confidence values, we increase the confidence score of the “responsible” predictor, and decrease the confidence of other boxes. That means, if some grid cells do not have any ground-truth detections, we only decrease the confidence of these boxes, and do not adjust the class probabilities or coordinates.
The reason why we need two kinds of probabilities is that if we predict Pr(k) directly from each grid cell, there will be S × S × B × K prediction numbers, many of which are
zero. Therefore, we can solve this problem by introducing Pr(object). We are updating Pr(object) in each grid cell, while updating Pr(k|object) when there is a object in that grid
cell.
4 Results
See Tab. It is faster than Faster R-CNN, but not as good. This is because YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only
predicts B boxes and can only have one class. Besides, since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Finally, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.
5 Refences
[1]. CVPR 2016. https://www.youtube.com/watch?v=NM6lrxy0bxs.