ssd&yolo&fcos

lyyiangang

已于 2023-10-10 12:12:58 修改

阅读量1.5k

点赞数 1

分类专栏：视觉算法文章标签：深度学习计算机视觉机器学习

于 2022-01-11 18:10:49 首次发布

本文链接：https://blog.csdn.net/lyyiangang/article/details/122417633

版权

视觉算法专栏收录该内容

39 篇文章 5 订阅

订阅专栏

ssd

初始化阶段
预设prior box
训练阶段(针对每个样本)
对前面的prior box做label assign(正负样本选择), gt转为根据 prior box编码的gt’, 计算loss.
整个步骤如下:
prior box 生成 ---------> prior box 分配为正负box(label assign) ------------> 计算loss
下面详细介绍下.
prior box 的产生
img不断downscale, downscale 的feature map上,分别产生priors. priors[x, y, w, h]
x, y 用feature map宽度归一化, 在[0, 1]之间, w,h为 min_sizes/image_size, 一般也处于[0, 1]之间, 其中min_size为预设的检测框, image_size 为resize为300*300图片宽度, image_size, mini_size两者都以原图为基准.feature map上任何一点的anchor box都是固定的. net预测的box也是单位坐标系, 最后也要乘以图片的宽高.代码中prior box在dataset定义时就生成了,是一个预先定义的静态的表.

前面生成的那么多prior box命运是不同的, 有些会用来预测正样本, 有些会预测负样本, 下面会用一些策略来强迫某些prior box分别预测正负样本, 这个过程叫做label assign. label assign是在训练时对每张样本图片都会动态做的, 整个过程中预设的prior box的位置是不变的, 变的是这个box被分配为预测正样本还是负样本.

匹配预测正样本prior box
anchor box 匹配原则(1个gt匹配多个prior box)

gt 先与所有prior box匹配, iou最大的那个prior box为正样本(即使这个iou很小,比如0.3, 依然会被当做正样本)
gt 与剩下所有prior box 匹配, iou > 0.5的的prior box为正样本.

def assign_priors(gt_boxes, gt_labels, corner_form_priors,
                  iou_threshold):
    """Assign ground truth boxes and targets to priors.

    Args:
        gt_boxes (num_targets, 4): ground truth boxes.
        gt_labels (num_targets): labels of targets.
        priors (num_priors, 4): corner form priors
    Returns:
        boxes (num_priors, 4): real values for priors.
        labels (num_priros): labels for priors.
    """
    # size: num_priors x num_targets
    ious = iou_of(gt_boxes.unsqueeze(0), corner_form_priors.unsqueeze(1))
    # size: num_priors
    best_target_per_prior, best_target_per_prior_index = ious.max(1)
    # size: num_targets
    best_prior_per_target, best_prior_per_target_index = ious.max(0)

    for target_index, prior_index in enumerate(best_prior_per_target_index):
        best_target_per_prior_index[prior_index] = target_index
    # 2.0 is used to make sure every target has a prior assigned
    best_target_per_prior.index_fill_(0, best_prior_per_target_index, 2)
    # size: num_priors
    labels = gt_labels[best_target_per_prior_index]
    labels[best_target_per_prior < iou_threshold] = 0  # the backgournd id
    boxes = gt_boxes[best_target_per_prior_index]
    return boxes, labels

匹配预测负样本的prior box
上面正样本的匹配规则很严格,导致了只有少量的anchor 为正anchor, 剩下海量的负anchor, 为了保证正负样本均衡, ssd采用了hard negative mining, 对预测背景的prior box loss进行降序排列, 取loss大(loss大意味着背景的prior box将背景识别成了前景)的topK作为训练的负样本, 保证正负1:3.

def hard_negative_mining(loss, labels, neg_pos_ratio):
    """
    It used to suppress the presence of a large number of negative prediction.
    It works on image level not batch level.
    For any example/image, it keeps all the positive predictions and
     cut the number of negative predictions to make sure the ratio
     between the negative examples and positive examples is no more
     the given ratio for an image.

    Args:
        loss (N, num_priors): the loss for each example.
        labels (N, num_priors): the labels.
        neg_pos_ratio:  the ratio between the negative examples and positive examples.
    """
    pos_mask = labels > 0
    num_pos = pos_mask.long().sum(dim=1, keepdim=True)
    num_neg = num_pos * neg_pos_ratio

    loss[pos_mask] = -math.inf
    _, indexes = loss.sort(dim=1, descending=True)
    _, orders = indexes.sort(dim=1)
    neg_mask = orders < num_neg
    return pos_mask | neg_mask

上面label assign 代码可以参考链接

gt box的编码
先验框定义 $d=(d^{cx}, d^{cy}, d^{w}, d^{h})$ , 那么对于一个gt box b, 那么可以通过以下公式编码成网络要预测的值:
pred box 解码
如前面所述, 网络预测的是编码的值, 想映射到原图上还需要解码:
计算loss
训练loss为位置loss和分类loss的和, 位置loss仅对正样本计算, 分类loss对正负样本都计算.
计算位置loss时需要将gt box做编码(转换后的box叫做location), 转换为参考prior box的坐标系.

分类loss为:

注意, ssd将背景也添加到要预测的类别中了.
输出head
ssd输出head端caoncat以前输出WH(4A)C个值, w,h为feature map的size, A为anchor的个数, C为类别的个数, 4为box边个数.换句话说, 每个feature点都尝试用A个anchor来预测.这点和nanodet不同, nanodet的C个类别会公用一个anchor, 也就是WH4*C个, 具体这个anchor box是预测哪个类别的box, 是通过C的置信度控制的.

yolo v3

预测框解码
对于预测框 $t_x, t_y, t_w, t_h)$ , 那么解码后的预测框 $b_x, b_y, b_w, b_h)$ 如下:

在这里插入图片描述

loss
loss分为bbox loss, 类别(class)loss, obj loss. 正样本会计算三个loss, 负样本只会计算obj loss.

ssd 与yolo v3区别

ssd 类别只有正负, 大于0.5为正,小于0.5为负, yolo v3除了正负又添加了忽略样本.
ssd 背景作为一个类别用softmax预测, yolo v3背景没有单独类别, 使用sigmoid预测
ssd的loss 只有分类和bbox回归loss, yolo v3除了类别的分类loss, 位置回归loss, 还有个objness loss

nanodet

相比前面的静态标签分配, nanodet plus使用了动态分配.
对于每张图, 静态分配使用预设的anchor与gt做匹配找到哪些anchor预测正样本, 哪些预测负样本.
但是动态分配使用的是pred box(decode后) 与gt做匹配代价(分类&回归loss), 根据loss来匹配正负样本.

Faster RCNN/SSD/RetinaNet
grid cell中有多个anchor
通过anchor与GT框之间的IoU判定是positive、negative还是ignore
比如RetinaNet里面，IoU低于0.4为negative，高于0.5为positive，其他为ignore
YOLO
grid cell中有多个anchor
YOLOv2
首先通过GT框的中心计算得到对应的grid cell
然后计算GT框与该grid cell中anchor的IoU
选择其中IoU最大的anchor做为positive
计算GT与该grid cell预测proposal的IoU
IoU大于0.6的proposal对应的anchor为ignore
其他anchor为negative
YOLOv3
对于每个proposal
计算其与所有GT框的IoU，如果最大的IoU大于0.7，则该proposal对应的anchor为ignore
对于每个GT框
计算其与所有anchor之间IoU（将x、y置为0，仅保留w、h的值），取IoU最大的anchor为positive（对应的grid cell，通过GT框中心计算得到）
其他anchor为negative
FCOS
grid cell中没有anchor
对于每一层feature map，将其中每个grid cell对应回原图，对应回去之后如果落在某个GT框内，则该grid cell为positive，如果同时落在多个GT框内，选择面积小的GT框与该grid cell对应
其余grid cell为negative
没有ignore

yolov4 & yolov5

数据增广
二者都使用了crop, rotation, flip, hue, exposure, aspect, 图像遮挡(Random Erase，Cutout，Hide and Seek，Grid Mask ，MixUp)CutMix与Mosaic
anchor设置
v4: k-mean算法获取最佳anchor box
v5: 自适应anchor算法. 当发现默认anchor box不适合当前数据集时, 使用kmean重新计算anchor box
activation function
v4: mish(效果更好(比swish更好)但是计算量更大)
v5: leaky relu
cost function
v4:CIOU
v5:CIOU
head
v4,v5都一样, 每个head输出通道255（(80 类别+ 1 objectness概率 +４坐标)*3anchor), 但是v5输出xywh都要用sigmoid激活值处理, 然后在转换为真实box.

grid = make_grid(nx, ny)
y = x.sigmoid()
y[..., 0:2] = (y[..., 0:2]*2  - 0.5 + grid) * stride[i] # 和stride有关. stride:[8, 16, 32, 64]
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid[i] # 和anchor box有关

在这里插入图片描述
(bx,by,bw,bh)为预测框(绿色), (pw,ph)为预测anchor box, cx,cy为anchor box的中心点坐标,
可以看出 $b_x \in[c_x -0.5, c_x + 0.5]$ , 也就是说, 预测的box中心一定位于红色anchor框内部.

fcos

feature map上的点映射到原图上得到(xs, ys),
在原图上与gtbox做target匹配策略比较粗暴，需要符合下面两个条件：

(xs, ys)在gtbbox矩形内部
gtbbox的wh要满足每个stride预测的box范围。比如 regress_ranges=((-1, 32), (32, 64), (64, 120)), 那么最后一层feature map对应的(xs, ys)只需要匹配(64, 120)范围内的目标。

参考:
https://zhuanlan.zhihu.com/p/166275032
https://zhuanlan.zhihu.com/p/33544892
https://zhuanlan.zhihu.com/p/76802514
https://zhuanlan.zhihu.com/p/449912627