来源知乎:https://zhuanlan.zhihu.com/p/2491678
You Only Look Once: Unified, Real-Time Object Detection(YOLO)
-
- Question
- Prior work on object detection repurposes classifiers to perform detection. Can we think of object detection as a regression problem?
- Object proposal and classifier is apart. Can we predict bounding box and class probabilities?
- Solution
- Question
Frame object detection as a regression problem to separated bounding boxes and associated class probabilities.
-
- Advantage and Contribution
- A single convolutional network simultaneously predicts multiple bounding
- Advantage and Contribution
boxes and class probabilities for those boxes.
-
-
- Simple and fast. Don’t need a complex pipeline and run at 45 frames per second with no batch processing.
- YOLO reasons globally about the image when making predictions, so YOLO makes less than half the number of background errors compared to Fast R-CNN.
- According to inception modules, this paper use 1*1 reduction layers followed by 3*3 convolutional layers to construct Network, which greatly reduces the amount of calculation and parameters.
- Predict the square root of the bounding box width and height instead of the width and height directly to reduce error of different boxes.
-
-
- Weakness
- Each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that the model can predict. In other words, this model predicts less accuracy when approaching smaller targets.
- The model also uses relatively coarse features for predicting bounding boxes since the architecture has multiple downsampling layers from the input image.
- loss function treats errors the same in small bounding boxes versus large bounding boxes, and the error of incorrect localizations is big.
- Model overview
- Weakness
Define confidence:
C=PrObject*IOU
Class-specific confidence scores for each box(Use it at test time.
PrClassiObject*PrObject*IOU=PrClassi*IOU
Detection principle
Network structure
Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. The model pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
Implementation steps:
1)Reshape the input image into 488*488, and divide the input image into an S × S grid.
2)For each grid sell, get a vector of 30 dimensions(B*5+C(two bounding boxes so B=2. 5 predictions: x, y, w, h, and confidence. 20 labelled classes so C = 20))through the network of 24 convolutional layers followed by 2 fully connected layers. The final output of the network is the 7 × 7 × 30 tensor of predictions.
The multi-part loss function at train time
- 3) Non-maximal suppression is used to get the final detection.