you only look once

Abstract

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

Introduction

We re-frame object detection as a single regression problem,straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these trade-offs further in our experiments.

Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our final prediction is a 7 × 7 × 30 tensor.


network structure


training

We pretrain our convolutional layers on the ImageNet1000-class competition dataset [30]. For pretraining we usethe first 20 convolutional layers from Figure 3 followed by aaverage-pooling layer and a fully connected layer. We trainthis network for approximately a week and achieve a singlecrop top-5 accuracy of 88% on the ImageNet 2012 validationset, comparable to the GoogLeNet models in Caffe’sModel Zoo [24]. We use the Darknet framework for alltraining and inference [26].

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize,however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal.Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often over powering the gradient from cells that do contain objects. This can lead to model instability,causing training to diverge early on.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值