Abstract
Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
Introduction
We re-frame object detection as a single regression problem,straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance
YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these trade-offs further in our experiments.
Unified Detection
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our final prediction is a 7 × 7 × 30 tensor.
network structure
training
We pretrain our convolutional layers on the ImageNet1000-class competition dataset [30]. For pretraining we usethe first 20 convolutional layers from Figure 3 followed by aaverage-pooling layer and a fully connected layer. We trainthis network for approximately a week and achieve a singlecrop top-5 accuracy of 88% on the ImageNet 2012 validationset, comparable to the GoogLeNet models in Caffe’sModel Zoo [24]. We use the Darknet framework for alltraining and inference [26].
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize,however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal.Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often over powering the gradient from cells that do contain objects. This can lead to model instability,causing training to diverge early on.
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.