We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network
Drawbacks of Anchors boxes
A very large set of anchor boxes lead to huge imbalance between positive and negative
how many boxes, what sizes, and what aspect ratios
Overview
We detect an object as a pair of keypoints—the top-left corner and bottom-right corner of the bounding box. We use a single convolutional network to predict a heatmap for the top-left corners of all instances of the same object category, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The embeddings serve to group a pair of corners that belong to the same object
keypoint detect and keypoint group
Three main problem:
- How to detect keypoint?
- How to group keypoint?
- A corner of a bounding box is often outside the object, How to improve the performens?
Detecting Corners
Backbone: Hourglass network or other networks for human pose estimation, in this paper is Hourglass.
Output: Two sets of heatmaps, one for top-left corners and one for bottom-right corners. Each set of heatmaps has C channels, where C is the number of categories
Loss: Instead of equally penalizing negative locations, we reduce the penalty given to negative locations within a radius of the positive location. We determine the radius by the size of an object by ensuring that a pair of points within the radius would generate a bounding box with at least 0.7 IoU with the ground-truth annotation
predict offset: A location \(\left ( x,y \right )\) in the image is mapped to the location \(\left ( \left [ \frac{x}{n} \right ],\left [ \frac{y}{n} \right ] \right )\) in the heatmaps, we predict location offsets to slightly adjust the corner locations before remapping them to the input resolution.
Grouping Corners
Multiple objects may appear in an image, and thus multiple top-left and bottom-right corners may be detected. We need to determine if a pair of the top-left corner and bottom-right corner is from the same bounding box.
The network predicts an embedding vector for each detected corner
if top-left and bottom-right belong to the same bounding box, the distance between their embeddings should be small, otherwise should be large.
"push" and "pull" loss
Corner Pooling
There is often no local visual evidence for the presence of corners, we propose corner pooling to better localize the corners by encoding explicit prior knowledge. For example, top-left corner pooling
Finally CornerNet
Experiments
Effectiveness of corner pooling
Effectiveness of Reducing penalty to negative locations