DenseBox: Unifying Landmark Localization with End to End Object Detection 阅读笔记

最新推荐文章于 2021-07-28 10:59:29 发布

aiqiu_gogogo

最新推荐文章于 2021-07-28 10:59:29 发布

阅读量1.2k

点赞数

分类专栏：目标检测及分割文章标签： detection deeplearning

本文链接：https://blog.csdn.net/aiqiu_gogogo/article/details/80722678

版权

目标检测及分割专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1，(CHARACTERISTIC) A single FCN perform on object detection, which directly predicts bounding boxes and object class confidences through all locations and scales of an image, and does not require proposal generation.

2，(MOTIVATION) ① R-CNN is very hard to detect small objects since the low resolution and lack of contexts in each candidate box significantly decrease the classification accuracy on them. ② R-CNN with general proposal methods designed for general object detection could results in inferior performance in detection task such as face detection, due to loss recall for small-sized faces and faces in complex appearance variations.

3，(MERIT) Can detect objects under different scales with heavy occlusion extremely accurately and efficiently.

4，(SIMILAR) YOLO also predicts bounding boxes and class probabilities directly from full images in one evaluation.

5，(NETWORK)
这里写图片描述
5.1 The single convolutional network simultaneously output multiple predicted bounding boxes and class confidence.
5.2 The system takes an image(m×n) as input, and output a (m/4×n/4) feature map with 5 channels.

6，(KERNEL) Define the left top and right bottom points of the target bounding box in output coordinate space as $p_t=\left ( x_t,y_t \right )$ and $p_b=\left ( x_b,y_b \right )$ respectively, then each pixel $i$ located at $\left ( x_i,y_i \right )$ in the output feature map $t_i=\left \{ score_i, x_i-x_t, y_i-y_t, x_i-x_b, y_i-y_b \right \}$ .

7，(TRAIN DATA) Crop large patches containing faces and sufficient background information on single scale for training, specificly, the patches are cropped and resized to $240\times240$ with a object in the center roughly has the height of $50$ pixels, and each pixel can be treated as one sample , since every 5-channel pixel describe a bounding box.

8，(LABEL) The positive labeled region in the first channel of ground truth map is a filled circle with radius $r_c$ , located in the center of a face bounding box. The remaining 4 channels are filled with the distance between the pixel location of output map between the left top and right bottom corners of the nearest bounding box.

9，(NETWORK)
这里写图片描述
9.1 (INITIALIZATION) The whole network has $16$ convolution layers, with the first $12$ initialized by VGG-19 model.
9.2 (FEATURE FUSION) We concatenate feature map from conv3-4 and conv4-4, and we use a bilinear up-sampling layer to transform them to the same resolution.

10, (LOSS)
10.1 (BALANCE SAMPLE) Ignoring Gray Zone and Hard Negative Mining. We use a binary mask for each output pixel to indicate whether it is selected in training.
10.2 We normalize the regression target d by dividing by the standard object height.
10.3 这里写图片描述

10.4 Classification Loss: $L_{cls}=\left \| y-y^* \right \|^2$ ; BBR loss: $L_{loc}=\sum_{i\in \left \{ tx,ty,bx,by \right \}}\left \| d_i-d_i^* \right \|$ ;

11，(AUGMENTATIONS) We apply left-right flip, translation shift (of 25 pixels), and scale deformation (from [0:8; 1:25]).

aiqiu_gogogo

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
DenseBox: Unifying Landmark Localization with End to End Object Detection 阅读笔记

1，(CHARACTERISTIC) A single FCN perform on object detection, which directly predicts bounding boxes and object class confidences through all locations and scales of an image, and does not require prop...
复制链接

扫一扫