检测回顾一之yolov1

最新推荐文章于 2024-11-06 20:38:13 发布

一名ai小菜鸡

最新推荐文章于 2024-11-06 20:38:13 发布

阅读量172

点赞数

分类专栏：经典目标检测回顾文章标签：深度学习神经网络

本文链接：https://blog.csdn.net/fxwfxw7037681/article/details/115827613

版权

2 篇文章 0 订阅

订阅专栏

yolo v1论文解读

所有都摘抄自论文

优势：

First, YOLO is extremely fast
YOLO sees the entire image during training and test time so it implicitly encodes contex-
tual information about classes as well as their appearance,YOLO makes less than half the number
of background errors compared to Fast R-CNN.
YOLO learns generalizable representations of objects.Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

缺陷：

While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
each grid cell only predicts two boxesand can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups
our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations
Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image
our loss function treats errors the same in small bounding boxes versus large bounding boxes

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts B bounding boxes and confidence scores for those boxes
Each bounding box consists of 5 predictions: x, y, w, h, and confidence
Each grid cell also predicts C conditional class probabilities，Pr( ${Class}_i$ |Object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B
These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.
we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor

在这里插入图片描述

Our detection network has 24 convolutional layers followed by 2 fully connected layers
We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation：

We use sum-squared error because it is easy to optimize
It weights localization error equally with classification error which may not be ideal
in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects.
We use two parameters, $λ_{coord}$ and $λ_{noobj}$ to accomplish this. We set $λ_{coord}$ = 5 and $λ_{noobj}$ = 0.5.
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes
To partially address this we predict the square root of the bounding box width and height instead of the width and height directly
the loss function only penalizes classificationerror if an object is present in that grid cell (hence the conditional class probability discussed earlier)
It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
During training we optimize the following, multi-part loss function: