yolo v1论文解读
所有都摘抄自论文
优势与缺陷
优势:
- First, YOLO is extremely fast
- YOLO sees the entire image during training and test time so it implicitly encodes contex-
tual information about classes as well as their appearance,YOLO makes less than half the number
of background errors compared to Fast R-CNN. - YOLO learns generalizable representations of objects.Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
缺陷:
- While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
- each grid cell only predicts two boxesand can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups
- our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations
- Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image
- our loss function treats errors the same in small bounding boxes versus large bounding boxes
论文思路
- Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
- Each grid cell predicts B bounding boxes and confidence scores for those boxes
- Each bounding box consists of 5 predictions: x, y, w, h, and confidence
- Each grid cell also predicts C conditional class probabilities,Pr( C l a s s i {Class}_i Classi|Object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B
- These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.
- we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor
网络结构
- Our detection network has 24 convolutional layers followed by 2 fully connected layers
- We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
- Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
- We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
- We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1
- We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
loss设计
- We use sum-squared error because it is easy to optimize
- It weights localization error equally with classification error which may not be ideal
- in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
- To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects.
- We use two parameters, λ c o o r d λ_{coord} λcoord and λ n o o b j λ_{noobj} λnoobj to accomplish this. We set λ c o o r d λ_{coord} λcoord= 5 and λ n o o b j λ_{noobj} λnoobj= 0.5.
- Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes
- To partially address this we predict the square root of the bounding box width and height instead of the width and height directly
- the loss function only penalizes classificationerror if an object is present in that grid cell (hence the conditional class probability discussed earlier)
- It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
- During training we optimize the following, multi-part loss function: