YOLO V1

版权声明:本文为博主原创文章,未经博主允许不得转载

论文:You Only Look Once:Unified, Real-Time Object Detection

链接:https://arxiv.org/abs/1506.02640

2015CVPR的文章


原文解析:

作者在开篇先说了YOLO算法的优点:

1.First, YOLO is extremely fast. YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors。

YOLO的速度非常快。在Titan X GPU上的速度是45 fps,加速版的YOLO差不多是150fps

2.Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques,YOLO makes less than half the number of background errors compared to Fast R-CNN.

YOLO是基于图像的全局信息进行预测的,和基于sliding window以及region proposal检测算法不一样。与Fast R-CNN相比,YOLO误检率降低一半多。

3.Third, YOLO learns generalizable representations of objects.

泛化能力高。

 

接着谈到了Unified Detection:

Our system divides the input image into an S * S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Each grid cell predicts B bounding boxes and confidence scores for those boxes.

Each bounding box consists of 5 predictions: x, y, w, h,and confidence.

We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

Each grid cell also predicts C conditional class probabilities

 

作者把输入图像分成s*s的格子,如果grid cell里面没有object,confidence就是0,如果有就是1.

每个格子预测B个bounding box,每个bounding box包含5个预测值:x,y,w,h和confidence。其次每个格子预测c个类别的概率。

x,y是bounding box的中心,相对于grid cell的偏移,范围0-1。w,h归一化,分别除以图像的w和h,范围0-1.

文中作者取S=7,B=2,C=20,最后预测7*7*(2*5+20)个tensor.

 

再接着谈到了Network Design:

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24
convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we
simply use 1*1 reduction layers followed by 3*3 convolutional layers

 

 

再接着谈到了Training:

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use
the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer.

we add four convolutional layers and two fully connected layers with randomly initializedweights. Detection often requires fine-grained visual information.so we increase the input resolution of the network from 224 *224 to 448 * 448.

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:           

作者先在ImageNet数据集上预训练网络,网络只采用前面20个卷积层。然后在检测的时候再加上随机初始化的4个卷积层和2个全连接层,同时输入分辨率从224*224改到448*448。最后一层线性激活,其余激活函数prelu。
 

We use sum-squared error because it is easy to optimize,however it does not perfectly align with our goal of
maximizing average precision. It weights localization error equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any object.This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability,
causing training to diverge early on.

作者采用sum-squared error的方式,提到localization error和classificaton error权值一致是不合理的,还有因为很多grid cell没有包含物体,loss方面压倒性大。

 

嗯嗯,懒得复制了,作者一方面提高了localization error的权重,另一方面降低了不负责object的confidence loss权值。

 

作责提到了相同的误差对于小目标影响更大,用w和h的开更号代替了原来的w和h。

 

We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.

和该object的ground truth的IOU值最大的bounding box负责预测该object。

 

 

 

再接着谈到了Inference:

在训练的weights下,最后会得到20*(7*7*2)=20*98的score矩阵。首先20个类别分别NMS(将得分少于阈值的先设置为0)。接着取每个bounding box的20个score(confidence*class probabilities)最大的值,如果这个score大于0,那么bounding box就是socre对应的类别,如果等于0,说明这个bounding box里面没有物体。

 

最后就是谈到了该算法和其他算法的比较,这里我就不贴了。


改进点:

1.Compared to state-of-the-art detection systems, YOLO makes more localization errors .Our model struggles with small objects that appear in groups, such as flocks of birds.

2.召回率低。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值