yolo锚框，损失函数

dguochuan

已于 2023-02-14 17:55:09 修改

阅读量271

点赞数

文章标签： YOLO python 深度学习

于 2023-02-11 19:58:10 首次发布

本文链接：https://blog.csdn.net/qq_27172615/article/details/128985650

版权

测试代码

from pathlib import Path

import cv2
import numpy as np
import torch
from torch import nn

from utils.loss import FocalLoss
from utils.metrics import bbox_iou


def smooth_BCE(eps=0.1):
    return 1.0 - 0.5 * eps, 0.5 * eps
# 构建候选眶

# 模拟数据
mt = 16
# targets = torch.randn((mt, 5))
# ims = torch.arange(2).repeat((int(mt / 2)), 1).view(mt, 1)
# targets = torch.cat((ims, targets), 1)
ims = cv2.imread('D:\PycharmProjects\swallow\data\data\images\\train2017\\000000000009.jpg')
path = Path('D:\PycharmProjects\swallow\data\data\labels\\train2017\\000000000009.txt')
with open(path) as f:
    lb = [x.split() for x in f.read().strip().splitlines() if len(x)]
    lb = np.array(lb, dtype=np.float32)

lab = torch.from_numpy(lb)
i = torch.zeros(len(lab))
lab = torch.cat((i.view(-1, 1), lab), dim=1)

anchors = torch.tensor([[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]])
anchors = anchors.reshape(3, 3, 2) / torch.tensor([8, 16, 32]).view(3, 1, 1)
#

na, nt = 3, lab.shape[0]
targets = lab
# 每个真实眶 3 个先验眶
targets = targets.repeat(3, 1, 1)
# 将先验眶索引添加到targets
ai = torch.arange(na).view(-1, 1).repeat(1, nt)
targets = torch.cat((targets, ai[..., None]), 2)

# 输出类型
tcls, tbox, indices, anch = [], [], [], []

gain = torch.ones(7)

g = 0.5  # bias
off = torch.tensor(
    [
        [0, 0],
        [1, 0],
        [0, 1],
        [-1, 0],
        [0, -1],  # j,k,l,m
        # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
    ]).float() * g  # offsets

# 大中小 三类先验眶
for i in range(1):
    # 将targets坐标映射到特征图
    gain[2: 6] *= torch.tensor([8, 8, 8, 8])
    t = targets * gain
    # wh 比
    r = targets[:, :, 4:6] / anchors[i].reshape(3, 1, 2)
    # 过滤wh 比
    j = torch.max(r, 1 / r).max(2)[0] < 4
    t = t[j]
    print(t.shape)
    # 偏移

    gxy = t[:, 2:4]  # grid xy
    gxi = gain[[2, 3]] - gxy  # inverse

    j, k = ((gxy % 1 < g) & (gxy > 1)).T
    l, m = ((gxi % 1 < g) & (gxi > 1)).T

    j = torch.stack((torch.ones_like(j), j, k, l, m))
    t = t.repeat((5, 1, 1))[j]
    offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

print(t.shape)
#
bc, gxy, gwh, a = t.chunk(4, 1)  # (image, class), grid xy, grid wh, anchors
gij = (gxy - offsets).long()
gi, gj = gij.T  # grid indices
a, (b, c) = a.long().view(-1), bc.long().T  # anchors, image, class
# Appe
shape = (10, 3, 80, 80, 85)
indices.append((b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1)))  # image, anchor, grid
tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
anch.append(anchors[0][a])  # anchors
tcls.append(c)  # class
ims = np.empty((640, 640, 3), dtype='uint8')
ims[:, :, :] = 114
ims[::80] = 254
ims[:, ::80] = 254
# befor
for p in lb:
    cv2.circle(ims, (int(p[1] * 80 * 8), int(p[2] * 80 * 8)), 4, color=(0, 255, 0))
cv2.imshow('befor', ims)
cv2.waitKey(0)  # 默认为0，无限等待
cv2.destroyAllWindows()
# 8 * 8 = 64
xy = (gij * 80).round().numpy()
pos = xy
for p1 in pos:
    cv2.circle(ims, (int(p1[0]), int(p1[1])), 4, color=(0, 255, 0))
cv2.imshow('after', ims)
cv2.waitKey(0)  # 默认为0，无限等待
cv2.destroyAllWindows()  # 释放所有窗口

# 模拟数据
data = torch.randn((10, 3, 80, 80, 85))

b, a, gj, gi = indices[0]  # image, anchor, gridy, gridx
    # xywh 预测值
tobj = torch.zeros(data.shape[:4])  # target obj
print(tobj.shape)
n = b.shape[0]  # 候选眶数量
lcls = torch.zeros(1)  # class loss
lbox = torch.zeros(1)  # box loss
lobj = torch.zeros(1)  # object loss
if n:
    pxy, pwh, _, pcls = data[b, a, gj, gi].split((2, 2, 1, 80), 1)
    # 预测偏移值
    pxy = pxy.sigmoid() * 2 - 0.5
    pwh = (pwh.sigmoid() * 2) ** 2 * anch[0]
    pbox = torch.cat((pxy, pwh), 1)  # predicted box
    iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()  # iou(prediction, target)
    lbox += (1.0 - iou).mean()  # iou loss

    iou = iou.detach().clamp(0).type(tobj.dtype)
    # 设置置信度数据
    tobj[b, a, gj, gi] = iou  # iou ratio
    # 分类损失
    # 正负样本
    ### https: // blog.csdn.net / qq_38253797 / article / details / 116228065 smooth_BCE
    cp, cn = smooth_BCE()
    t = torch.full_like(pcls, cn)  # target

    # 设置正样本
    t[range(n), tcls[0]] = cp
    # FocalLoss 损失
    # https://zhuanlan.zhihu.com/p/266023273
    BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([1.]))
    BCEcls = FocalLoss(BCEcls, 1.5)

    lcls += BCEcls(pcls, t)  # BC
    # 置信度损失
    BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([1.]))
    obji = BCEobj(data[..., 4], tobj)
    balance = {3: [4.0, 1.0, 0.4]}.get(3, [4.0, 1.0, 0.25, 0.06, 0.02])
    lobj += obji * balance[0]  # obj loss
    print((lbox + lobj + lcls) * 10)

REF

为什么需要先验：

目标检测|Anchor（先验框）的作用 - 知乎

这篇文章的解释是经验决定的。

smoothBCE

【trick 1】Label Smoothing（标签平滑）—— 分类问题中错误标注的一种解决方法_满船清梦压星河HK的博客-CSDN博客_分类模型中样本标签出错怎么办啊

FocalLoss

focal loss 通俗讲解 - 知乎

总结：

构建数据

tcls: 分类数据

tbox: xywh 框数据

indices: 索引数据

image: 图片索引例如一个3大小的批量则[0, 1, 2]，如果筛选出4个标签则可能是[0, 0, 1, 1]

anchor:先眼眶索引，列如筛选出四个标签则可能他们对应是anchor索引为[1, 0, 2, 1]

grid: 网格索引，假如讲特征图划分为8 * 8的大小。筛选出4个标签，则它们的网格有可能是[[3, 4], [3, 5], [4, 6], [5, 5]]

将 image当成key , anchor,grid当成value。看成key-value数据结构，然后结合特征图，要好理解一点。但是总体意思表示的是筛选出来标签数据，他们编号从0开始，image , anchor, grid一一对应。

anch: 先眼眶索引

流程：

1. 一个真实标签，对应三个先眼眶。代码如下

# 生成索引， 如果真实标签有9个，则（3， 9） ，列如[[0, 0, 0...][1, 1,...][2, 2, 2...]]
ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt)
# 将先眼眶的索引，添加到targets上
# 假如原始的targets (9, 6), 那么一个targets对应3个先验 则为[3, 9, 6], 最后再加一个先验的索引
# 则[3, 9, 7]
targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2)

2: 将xywh映射到特征图上代码如下

# targets是归一化之后的数据
# 假如特征图大小为8 * 8 那么 xywh 为 xw * 8 yh * 8 
 gain[2:6] = torch.tensor(shape)[[3, 2, 3, 2]]  # xyxy
 t = targets * gain  # shape(3,n,7)

3: 计算高宽比，通过先验知识，对标签进行过滤

                # 计算真实眶和先验眶的高宽比
                r = t[..., 4:6] / anchors[:, None]  # wh ratio
                # 筛选合格的候选眶
                j = torch.max(r, 1 / r).max(2)[0] < self.hyp['anchor_t']  # compare
                # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
                t = t[j]  # filter

4. 偏移，产生更多的标签数据。xy为标签的中心点。如果x 小于0.5.则减 0.5，然后产生一个新的标签。

               
                 gxy = t[:, 2:4]  # grid xy
                gxi = gain[[2, 3]] - gxy  # inverse
                j, k = ((gxy % 1 < g) & (gxy > 1)).T
                l, m = ((gxi % 1 < g) & (gxi > 1)).T
                j = torch.stack((torch.ones_like(j), j, k, l, m))
                t = t.repeat((5, 1, 1))[j]
                offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]

5. box(x, y, w, h)预测。为什么需要先验，先验是什么。以下是一些个人理解。

先验就是，事先根据经验，或者聚类。大概判断wh可能的大小。然后模型预测也就只是在先验的基础上做微调。

为什么需要先验了，yolo一开始是借鉴rcnn。然后发现召回率提高了不少。我觉得就是一个实验尝试，发现这个确实有效。

# pxy是模型输出数据
# 因为pxy * sigmoid。 意味着模型预测的只是一个偏移量。
               
 pxy = pxy.sigmoid() * 2 - 0.5
#  wh * anch。 意味着模型预测的wh， 是在先验眶上做的微调的数据。
 pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]
                pbox = torch.cat((pxy, pwh), 1)  # predicted box
                iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze()  # iou(prediction, target)
                lbox += (1.0 - iou).mean()  # iou loss

6.分类预测。为什么要smoothBCE， FocalLoss

个人理解，因为标签数据是我们根据经验生成的。你不能说它们全是正确的吧，所以做法就是：比如热编码[0, 1]。对1加一个惩罚变成[0.05, 0.95]。意味着它是正标签的概率是0.95，更合理一点。但是这样就会出现大量的负标签。

为什么需要bce，focalloss损失。

bce损失，-y*log(p)-(1-y*log(1-p))。负样本肯定很多，解决办法加一个惩罚因子就行了，具体的可以看看上面的文章链接。

CrossEntropyLoss损失：y*log(p)。如果只是[0, 1]这样的标签还好说，0直接约去了。如果0变成了0.05，这样的，就不好控制了。bce还可以通过加一个惩罚因子来控制，这个怎么控制呢。

# 生成标签数据， 只关心有标签的特征图。 pcls是根据标签过滤后的数据，结合特征图，特定的一些特征网格里面的数据。
# cn是负标签。 看smooth_BCE
#  
t = torch.full_like(pcls, self.cn, device=self.device)  # targets
# 设置正标签数据
t[range(n), tcls[i]] = self.cp
# 
lcls += self.BCEcls(pcls, t)  # BCE

7 置信度预测，为什么需要置信度呢。

看看前面几个损失，只考虑了真实标签。背景数据却没做处理。试想一下，如果不考虑背景，那么目标检测大量的背景图片，该怎么办。所以置信度训练的是，背景就趋近于0，物体就趋近于iou.

从特征图来理解。因为特征图是不断的卷积得到的，前面两个损失就看了一下，真实物体的数据就结束了，也就是真实标签所在的网格。那么还有大量背景数据是不是也应该看一下，它是背景，才对。这就是置信度损失做的事。

# 所有网格数据，真实标签，背景。
tobj = torch.zeros(pi.shape[:4], dtype=pi.dtype, device=self.device)  # target obj
# 可以看到置信度的标签是通过iou计算出来的，iou又和高宽比有关，意味着锚框预测的准确，置信度才会准确
# 
iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze() 
tobj[b, a, gj, gi] = iou  # iou ratio
obji = self.BCEobj(pi[..., 4], tobj)
# balance是一个经验值，应该是他们调出来的。
# i的值为0，1，2 分别表示大中小的特征图
# 我猜，因为我们只关心总损失值，大的图片会更好训练一点，所以跟容易趋近于0，这样它就在总损失上表现不出来了，总损失可能表现更多的是小的图片，它的损失会更大。所以大的图片就乘以一个更大的数。
self.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, 0.02])  # P3-P7
lobj += obji * self.balance[i]  # obj loss

总体说，深度学习就是看图说话。和人不一样的就是，它不能识别出物体，只能猜如果物体是猫，那么它大概的宽高是多少。

另外一个就是基于特征图，网格预测。列如，看一下第一个网格是不是物体，如果是物体就估计一下它的宽高。特征图是确定的(i,j坐标)，所以xy就预测一下偏移。然后，又分为大中小三种特征图，所以网格能很好的覆盖不同大小的图片。总的来说，就是扫描一遍网格，第一个网格是背景吗？是物体吗？如果是物体那么它的坐标大概是多少呢？