SSD源码和论文学习笔记

最新推荐文章于 2021-11-15 11:23:33 发布

TLP1993

最新推荐文章于 2021-11-15 11:23:33 发布

阅读量493

点赞数

分类专栏： Object Detection 文章标签： ssd caffe 源码论文

本文链接：https://blog.csdn.net/TLP1993/article/details/100918270

版权

Object Detection 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

源码学习

Dtype

caffe 的Dtype设置，Net, Solver都是模板，最后的设置其实是通过caffe.cpp实现的，都是用float

lmdb生成

主要是对数据进行检查并生成lmdb文件。

用户接口：create_data.sh

调用关系：create_data.sh -> create_annoset.py(命令行参数解析和文件路径检查) -> convert_annoset.cpp(创建lmdb或其它db) -> io.cpp(读取图像和标签)

AnnoDataLayer

BatchSampler

由于sample的参数是与物体bbox的最小jaccard overlap,所以与标注内容有关；

BatchSampler或者Sampler属于transfomation的一种，是一种特殊的crop操作

线程载入数据流程

LayerSetup->StartInternalThread->InternelThreadEntry->load_batch()

对数据进行在线增广，核心在于Expand和sample，需要调用TransformAnnotation，进而调用resize, project, mirror等bbox操作；

在线/离线增广

在线增广，通过prefetch充分利用网络训练时空闲的CPU，相比于离线增广，节省磁盘空间和增广时间，可以随时修改增广参数，非常灵活，通过概率设置，增广的量会更多。

离线增广的优点主要是可以利用一些现有的增广程序，不必加入到caffe的程序中

caffe及ssd都是支持在线增广的

BBoxUtil

理解ProjectBBox，LocateBBox是其逆过程(参见纸上笔记）

SSD网络解析之bbox_util](https://blog.csdn.net/qq_21368481/article/details/82227989)

python接口使用caffe.proto参数

参考gen.py的作者部分，包括获取非Parameter的参数

VisualizeBBox

在网络中det_out层加上可视化参数会报错

涉及魔数4，还未修改

当前我的可视化是自己写的函数

CodeType

PriorBoxParameter_CodeType 是priorbox的描述（编码）方式，但编码方式其实直接对应了后面的回归模型的处理方式（体现在Encode()和Decode()中）

PirorBoxLayer

Layer comment :

/**
* @brief Generates prior boxes for a feature map with specified parameters.
*
* @param bottom input Blob vector (length 2)
*
* -# @f$ (N,\, C ,\, H^s ,\, W^s) @f$
* the input feature map @f$ F^s@f$
*
* -# @f$ (N,\, C ,\, H_0 ,\, W_0) @f$
* the data layer @f$ x_0 @f$, used for calculate step(if not provided).
*
* @param top output Blob vector (length 1)
* -# @f$ (1,\, 2 ,\, M_f \times 4) @f$ where @f$ M_f @f$ is the number of all
* priors in the feature map. @see Reshape().
*
* By default, a box of aspect ratio 1 and min_size and a box of aspect
* ratio 1 and sqrt(min_size * max_size) are created.
*/

Reshape(){
    ...
    // Since all images in a batch has same height and width, we only need to
    // generate one set of priors which can be shared across all images.
    top_shape[0] = 1;
    // 2 channels. First channel stores the value of each prior coordinate.
    // Second channel stores the variance of each prior coordinate. # ？？？？ 
    top_shape[1] = 2;
    top_shape[2] = feature_width * feature_height * num_priors_ * 4;
    ...
}

MultiboxLayer

注意：代码中通过计算encoded_box来得到target

/**
 * @brief Perform MultiBox operations. Including the following:
 *
 *  - decode the predictions.
 *  - perform matching between priors/predictions and ground truth.
 *  - perform negative example mining.
 *  - use matched boxes and confidences to compute loss with internal
 *  localization loss layer and confidence loss layer.
 */

/* Data preparations. */
  // Retrieve all ground truth.
  // Retrieve all prior bboxes. It is same within a batch since we assume all images in a batch are of same dimension.
  // Retrieve all predictions.
  // Find matches between source bboxes and ground truth bboxes.
/* Mining. */
	// Sample hard negative (and positive) examples based on mining type.
/* Optionally Visualize data. */
/* Compute location loss. */
/* Compute confidence loss. */
/* Compute multi-box loss */

DetOutLayer

实际的代码实现会更加复杂一下，以下只是表示其流程步骤

/**
 * @brief Generate the detection output based on location and confidence
 * predictions by doing non maximum suppression.
 *
 * Intended for use with MultiBox detection method.
 *
 * NOTE: does not implement Backwards operation.
 */

SSD DetectionOutputLayer
Input: conf, loc, prior
Output: detections
Steps:
  numOfKeptDets = 0;
  allClassBoxesBatch[batchSize]
  // For each image of the batch.
    // Decode loc predictions from priors to boxes.
    { // Post process.
      // For each class, start from 1 to ignore background class
        // For each prior.
          // Filter confPredictions by confidenceThreshold to get classBoxes(optionally share location across classes)(this step will may filter out many predictions)
        // Sort classBoxes with confidence.
        // Apply MNS to classBoxes.
        // Add classBoxes after MNS to allClassBoxes.
      // Sort allClassBoxes with confidence.
      // Keep k results with topest confidence of allClassBoxes per image.
      // Update numOfKeptDets.
    } // Post process.
    // Add allClassBoxes after MNS to allClassBoxesBatch.

  // Reshape output data to numOfKeptDets x detVectorLen.
  // Write allClassBoxesBatch to output data.

DetEvalLayer

DetEval层输出TP/FP score; Solver中计算AP

/**
 * @brief Generate the detection evaluation based on DetectionOutputLayer and
 * ground truth bounding box labels.
 *
 * Intended for use with MultiBox detection method.
 *
 * NOTE: does not implement Backwards operation.
 */
	  
	// Retrieve all detection results.
  // Retrieve all ground truth (including difficult ones).
  // Initialize top_data.
  // Insert detection evaluate status.
    // For each image.
      // Get current image detections
      // If no ground truth for current image, All detections become false_pos.
      // If have ground truth for current image.
      	// For each label type of current image detections.
          // Get detecions of current label.
            // If no ground truth for current label, all detections become false_pos.
            // If have ground truth for current label.
              // Get gt bboxes of current label.
              // Scale ground truth if needed.
              // Sort detections in descend order based on scores.
              // For each detection.
                // Compute max overlap with each ground truth bbox and record the matched gt.
                // If max overlap is greater than threshold and the matched gt is not visited, set this detection as tp, else set fp.

原理和建模

论文：原文+博客解读（https://blog.csdn.net/u010167269/article/details/52563573）

网络结构

查看netscope

AnnotatedDataLayer

output:

data

label: $(1,\, 1,\, N_g,\, 8)$

$N_g$ is the number of gt

the 8 is for

[item_id, group_label, instance_id, xmin, ymin, xmax, ymax, diff]

note it is different from NormalizeBBox

Modeling

Training

input:

image, ground truth(gt)

output:

predicition(pred)

pred 和 gt都是一个带有类别的矩形框

model:

image->conv net feature map-> cls conv + loc conv->conf + bbox transform (+ prior box) ->conf + bbox ->post processed-> pred

bbox transform can be also called the encoded bbox, see encode/decode

prior box is the initial value of pred

loss

loss = L(pred, gt)

Inferrence

**input: **image

output: predicitions

PriorBox generation

single scale prior box generation

input: feature map , data

output: $P^s$ ， prior of feture map with scale s.

multi-scale prior box generation
$\{P^s_{(1,\, 2 ,\, M_f \times 4)}\}_{s\in Scales} \xrightarrow[axis=2]{Concat} P_{(1,\, 2 ,\, M_a \times 4)}$

MultiBox prediction

input

feature map: $F$

output

localization prediction: $l$

confidence prediction: $c$

single scale localization prediction

the number of prior of each feature map cell: $M_c$

number of priors in the feature map : $M_f = H \times W \times M_c$

number of localization prediction in the feature map: $M_f’=M_f \times N_{lc}$

the number of location classes: $N_{lc}$

$N_{lc} = K + 1$ if not share location across classes, in other word, use class-specific bbox, otherwise, $N_{lc} = 1$

so, $C_l = M_c \times N_{lc} \times 4$
$F_{(N,\,C_f,\,H,\,W)}\xrightarrow{Conv} l1_{(N,\,C_l,\,H,\,W)} \xrightarrow[order=0,2,3,1]{Permute} l2_{(N,\,H,\,W,\,C_l)} \xrightarrow[axis=1]{Flatten} l3_{(N,\,H\times W\times C_l)} = l3_{(N,\,M_f \times N_{lc} \times 4)} = l3_{(N,\,M_f' \times 4)}$
single scale confidence prediction

let $C_c = M_c \times (K+1)$ ,
$F_{(N,\,C_f,\,H,\,W)}\xrightarrow{Conv} c1_{(N,\,C_c,\,H,\,W)} \xrightarrow[order=0,2,3,1]{Permute} c2_{(N,\,H,\,W,\,C_c)} \xrightarrow[axis=1]{Flatten} c3_{(N,\,H\times W\times C_c)} = c3_{(N,\,M_f \times (K+1))}$
multi scale localization prediction

$s$ is the scale indicator.

number of all priors: $M_a = \sum_{s\in Scales} M^s_f = \sum_{s\in Scales} （H^s \times W^s \times M^s_c)$

number of all localization prediction: $M_a’ = M_a \times N_{lc}$
$\{l3^s_{(N,\,H^s,\,W^s,\,C^s_l)}\}_{s\in Scales} \xrightarrow[axis=1]{Concat} l_{(N,\,M_a \times N_{lc} \times 4)} = l_{(N,\,M_a' \times 4)}$
multi scale confidence prediction
$\{c3^s_{(N,\,H^s,\,W^s,\,C^s_c)}\}_{s\in Scales} \xrightarrow[axis=1]{Concat} c_{(N,\,M_a \times (K+1))}$

encode/decode

encode

input: $(P, G)$ or (piror box, decoded_box)

output: $l$ or encoded_box

decode

input: $(P, l)$ or (prior box, encoded_box)

output: $G$ or decoded_box

Training

Matching strategy

why and what?

与一般的机器学习问题（一般的分类和回归问题）不同的是，对于一个input(image), 有多个pred和gt，那么在计算损失时，就存在一个某个pred与某个gt匹配的过程，对匹配上的pred和gt之间才能计算损失

who?

directly method: match between pred and gt

可能会出现虽然预测的pred与gt很接近，但prior和gt相差很远，从而pred与prior所在的卷积网络特征没有多少联系，造成训练的混乱，最终预测效果很差。

adopted method: match between prior and gt

其实是在训练过程中给pred和gt之间加了一个约束，即pred的初值prior是比较接近gt的，从而更容易获得准确的pred；这样学到的模型，在推理时，会倾向于只对prior做较小的变换得到pred, 利用的（prior所在的）卷积网络的特征与pred联系更紧密，从而不易造成过大的误差

how?

reffer to paper and source code.

jaccard overlap

应该就是IoU, 这里简称为jo

Bipartite matching

参考Bipartite graph/network的定义，本质上就是一一对应的意思

Bipartite graph/network翻译过来就是：二分图。

维基百科中对二分图的介绍为：二分图是一类图(G,E)，其中G是顶点的集合，E为边的集合，并且G可以分成两个不相交的集合U和V，E中的任意一条边的一个顶点属于集合U，另一顶点属于集合V。一个简单的形象表示如下图：

[外链图片转存失败(img-IaAgZsZz-1568687869591)(paper_source_code_study.assets/20140221143913937-1552273578069.png)]

所以Bipartite matching就是gt与prior一一对应匹配，gt与jo最大的prior匹配.

per-predicition matching

在bipartite matching基础上做的matching, 不过论文和源码有些出入。

论文允许prior匹配任何 $j o > T$ 的gt:

(note by me: After bipartite matching, ) we then match default boxes to any ground truth with jaccard overlap higher than an threshold(0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.

而源码中，在bipartite matching的基础上，prior(不包括bipartite已经匹配的prior)只能匹配 $j o > T$ 的gt中jo最大的gt.

result?

positive: prior matched to gt

negative: prior does not match to gt

Hard negative mining

why?

too many negative priors, significant imbalance b.t. pos and neg

how?

$c^i_{max}$ is the max prediction confidence over all category of i-th prior.
$c^i_{max}=\max \limits_{k\in \{0,1,\dots ,K\}} c^i_k$
for all $i\in Neg$ , sort $c^i_{max}$ ,

pick the top ones so that the ratio between the negatives and positives is at most 3:1.

We found that this leads to faster optimization and a more stable training.

Training objective

MultiBoxLoss input & output

localization prediction: $l$ , see [MultiBox prediction](#MultiBox prediction)

confidence prediction: $c$

piror: $P$ , see [PriorBox generation](#PriorBox generation)

gt: $G$ , see AnnotatedDataLayer

localization is in the format of bbox trainsformation

output: multi-box loss $L$
$\begin{aligned} &shape(l) = (N,\,M_a' \times N_r) \\ &shape(c) = (N,\,M_a \times (K+1)) \\ &shape(P) = (1,\, 2 ,\, M_a \times N_r) \\ &shape(G) = (1,\, 1,\, N_g,\, N_l) \\ \end{aligned}$
$N_r$ is the number of variables to be regressed of a single box.

parameter inference:
$\begin{aligned} M_a &= c[1]/(K+1) \\ N_r &= P[2]/M_a \\ M_a' &= l[1]/N_r \\ N_{lc} &= M_a'/M_a \end{aligned}$
Multi-box loss
$\frac{1}{N_{Pos}+N_{Neg}}L_{cls}(I,c) + \alpha \frac{1}{N_{Pos}} L_{loc}(I,t,l)）$
$\alpha$ is the weight term

$P o s$ 为default box中正例(匹配的）的集合， $N e g$ 为default box中负例的集合

$I_{ij}\in \{1,0\}$ 为第i个default box 与第j个ground truth box是否匹配的指示器

Localization loss:
$L_{loc}(I,t,l) = \sum_{i\in Pos}^{N_{Pos}} \sum_j \sum_{*\in \{x,y,w,h\}} I_{ij}smooth_{L1}(t^*_{ij}-l^*_i)$
$\delta = t^* _{ij}-d^*_i$ 是误差，与RCNN中的bbox regression的误差的区别主要是这里的default box与ground truth box 需要进行匹配

$t^*_{ij}$ : localization target, $l^*_i$ : localization(bbox trainsformation) prediction.

$t = t (G, P)$
$\begin{aligned} t^x_{ij}&=(G_x^j-P_x^i)/P_w^i \\ t^y_{ij}&=(G_y^j-P_y^i)/P_h^i \\ t^w_{ij}&=log(G_w^j/P_w^i) \\ t^h_{ij}&=log(G_h^j/P_h^i) \\ \end{aligned}$
$l$ is input, and is in the format of bbox transformation, the definition is $l=l(\hat G,P )$ , the final localization(bbox) predition can be get by $\hat G=l^{-1}(l,P)$
$\begin{aligned} l^x_{i}&=(\hat{G}_x^i-P_x^i)/P_w^i \\ l^y_{i}&=(\hat{G}_y^i-P_y^i)/P_h^i \\ l^w_{i}&=log(\hat{G}_w^i/P_w^i) \\ l^h_{i}&=log(\hat{G}_h^i/P_h^i) \\ \end{aligned}$
smooth L1 loss:
$smooth_{L1}(x) = \begin{cases} 0.5x^2 & \text{if } \left| x \right| <1,\\ \left| x \right| - 0.5 & \text{otherwise}.\\ \end{cases}$
相对于L2loss，对outlier更加不敏感，更易训练。

Score function:

$s_k^i$ is the predicted score of category k from the i-th input(here input is default box), $k\in\{0,1,\dots ,K\}$ , $k = 0$ is for background.
$s_k^i = f(x^i,w_k)$
Softmax function:

$c_k^i$ is the confidence predicttion of category k from the i-th input.
$c_k^i = P(Y=k|X=x^i) = \frac{e^{s_k^i}}{\sum_{k'}e^{s_{k'}^i}}$
General Softmax loss:
$L^i = -lnc^i_{y^i} = -lnP(Y=y^i|X=x^i)$
Confidence loss:

The first difference from classic softmax loss is the match of default box and gt box.

The second difference is the special class(or category) of background. If the default box does not match any gt box, it is negative and count $c^i_0$ into loss.
$\begin{aligned} L_{conf}(I,c) &= -\sum_{i\in Pos}^{N_{Pos}} \sum_j I_{ij}lnc^i_{j} - \sum_{i\in Neg}^{N_{Neg}} lnc^i_0 \\ &= -\sum_{i\in Pos}^{N_{Pos}} lnc^i_{G^i_c} - \sum_{i\in Neg}^{N_{Neg}} lnc^i_0 \end{aligned}$
$G^i_c$ : the category of the ground truth box which is matched to i-th default box.

$N e g$ : the collection of default boxes that are not matched to gt box

其它

对小物体的检测，用data aug可优化（程序中已经做了论文上的小物体检测的相应处理），或者大尺寸图片；data aug的作用很大。

TLP1993

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
SSD源码和论文学习笔记

源码学习Dtypecaffe 的Dtype设置，Net, Solver都是模板，最后的设置其实是通过caffe.cpp实现的，都是用floatlmdb生成主要是对数据进行检查并生成lmdb文件。用户接口：create_data.sh调用关系：create_data.sh -> create_annoset.py(命令行参数解析和文件路径检查) -> convert_anno...
复制链接

扫一扫

专栏目录