1,(CHARACTERISTIC) A single FCN perform on object detection, which directly predicts bounding boxes and object class confidences through all locations and scales of an image, and does not require proposal generation.
2,(MOTIVATION) ① R-CNN is very hard to detect small objects since the low resolution and lack of contexts in each candidate box significantly decrease the classification accuracy on them. ② R-CNN with general proposal methods designed for general object detection could results in inferior performance in detection task such as face detection, due to loss recall for small-sized faces and faces in complex appearance variations.
3,(MERIT) Can detect objects under different scales with heavy occlusion extremely accurately and efficiently.
4,(SIMILAR) YOLO also predicts bounding boxes and class probabilities directly from full images in one evaluation.
5,(NETWORK)
5.1 The single convolutional network simultaneously output multiple predicted bounding boxes and class confidence.
5.2 The system takes an image(m×n) as input, and output a (m/4×n/4) feature map with 5 channels.
6,(KERNEL) Define the left top and right bottom points of the target bounding box in output coordinate space as pt=(xt,yt) p t = ( x t , y t ) and pb=(xb,yb) p b = ( x b , y b ) respectively, then each pixel i i located at in the output feature map ti={scorei,xi−xt,yi−yt,xi−xb,yi−yb} t i = { s c o r e i , x i − x t , y i − y t , x i − x b , y i − y b } .
7,(TRAIN DATA) Crop large patches containing faces and sufficient background information on single scale for training, specificly, the patches are cropped and resized to 240×240 240 × 240 with a object in the center roughly has the height of 50 50 pixels, and each pixel can be treated as one sample , since every 5-channel pixel describe a bounding box.
8,(LABEL) The positive labeled region in the first channel of ground truth map is a filled circle with radius rc r c , located in the center of a face bounding box. The remaining 4 channels are filled with the distance between the pixel location of output map between the left top and right bottom corners of the nearest bounding box.
9,(NETWORK)
9.1 (INITIALIZATION) The whole network has
16
16
convolution layers, with the first
12
12
initialized by VGG-19 model.
9.2 (FEATURE FUSION) We concatenate feature map from conv3-4 and conv4-4, and we use a bilinear up-sampling layer to transform them to the same resolution.
10, (LOSS)
10.1 (BALANCE SAMPLE) Ignoring Gray Zone and Hard Negative Mining. We use a binary mask for each output pixel to indicate whether it is selected in training.
10.2 We normalize the regression target d by dividing by the standard object height.
10.3
10.4 Classification Loss:
Lcls=∥y−y∗∥2
L
c
l
s
=
‖
y
−
y
∗
‖
2
; BBR loss:
Lloc=∑i∈{tx,ty,bx,by}∥∥di−d∗i∥∥
L
l
o
c
=
∑
i
∈
{
t
x
,
t
y
,
b
x
,
b
y
}
‖
d
i
−
d
i
∗
‖
;
11,(AUGMENTATIONS) We apply left-right flip, translation shift (of 25 pixels), and scale deformation (from [0:8; 1:25]).