YOLOv5详解（理论+代码）

binveni

已于 2024-01-07 06:50:52 修改

阅读量2.8k

点赞数 53

文章标签： YOLO 人工智能计算机视觉深度学习目标检测

于 2024-01-07 00:58:18 首次发布

本文链接：https://blog.csdn.net/binveni/article/details/135434536

版权

YOLOv5详解（理论+代码）

YOLOv5于2020年首次提出，直至今日YOLOv5仍然在不断进行升级迭代。本文使用代码上5.0版本，yolov5根据精度和时间要求提供了YOLOv5n（6.0版本推出的）、YOLOv5s，YOLOv5m、YOLOv5l和YOLOv5x等不同版本（6.0之后还有一些针对小目标增加检测头的-P系列），这几个模型的结构基本一样，不同的是depth_multiple模型深度和width_multiple模型宽度，通过这两个参数来控制模型整体的参数量。下面给出主流模型参数量、推理时间和在COCO数据集上的mAP。

在这里插入图片描述
本文的使用代码为yolov5 5.0版本。代码下载地址: https://github.com/ultralytics/yolov5/archive/refs/tags/v5.0.zip

YOLOv5 网络结构

在这里插入图片描述

如上图所示，YOLOv5 5.0版本的网络结构主要有BackBone(New CSP-Darknet53)、Neck(FPN-PAN)和Head(YOLOv3 Head)三部分组成。具体的细节结构入下图所示：

在这里插入图片描述

Focus结构

Focus 可以看作是一种特殊的卷积操作，用于将输入的特征图进行压缩和重排，以减少模型的计算量和内存占用，并提高模型的精度和速度。具体操作是对输入的特征图进行间隔采样，然后在通道方向上进行堆叠，就得到了二倍下采样的特征图，在通过一个1x1的卷积调整通道数。虽然 Focus 重排操作可能会引入一些额外的计算和存储成本，但是通过 Focus 操作得到的二倍下采样特征图能够最大程度地保留输入特征图的所有信息。在 YOLOv5 中的应用证明了 Focus 的有效性。但是一些嵌入式推理框架对Focus的支持并不完善，在YOLOv5 6.0版本以后就用Stride=2的Conv结构代替了Focus。Focus 采样方式如图所示：
在这里插入图片描述

代码实现：

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Focus, self).__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
        # self.contract = Contract(gain=2)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

Conv结构

Conv模块组成比较简单，有卷机+BN+SiLU激活函数构成，如果要提升网络在边缘设备上的推理速度，并可以接受小范围的精度损失，可以将SiLU激活函数用ReLU进行替换，对模型的推理速度有较大的提升。

代码实现

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

SPP结构

早期的目标检测算法，例如RCNN等利用Selective Search算法选取出图像中的一些候选区域，在通过CNN提取这些这个候选区域的特征，最后通过SVM进行分类算法和使用全连接层对Bounding-box的位置进行修正，提高定位精度。这就要求CNN输出的特征为度保持不变，因此输入CNN的候选区域尺寸是固定的，在检测各种大小不同的候选区域，需要经过resize、crop，或者warp等一系列操作，这都在一定程度上导致图片信息的丢失和变形，限制了识别精确度。

在这里插入图片描述

SPP-Net对这些网络中存在的缺点进行了改进，基本思想是：输入整张图像，提取出整张图像的特征图，如何在特征图中提取若干个候选区域（region proposal），再使用spatial pyramid pooling layer提取各个region proposal的特征。

在这里插入图片描述

spatial pyramid pooling layer 会将feature map 分成4x4、2x2、1x1个块（Spatial Bins），然后对每块进行全局池化，得到21x256维特征向量。所以SPP层的输出，不随输入尺度的变化而变化。这种方式也被后面的FastRCNN和FasterRCNN采用，在一些资料中被称为ROI Pooling。

虽然对于YOLOv5这种没有region proposal目标检测算法，并不存在region proposal尺度影响定位的问题，但是SPP这种通过不同尺度池化核聚合目标特征的方式依旧值得借鉴，Ultralytics 团队受到启发提出一种适合YOLO的空间池化金字塔(SpatialPyramid Pooling, SPP)结构。该结构首先利用 Conv模块提取特征得到输出，再利用 13×13、9×9、5×5 和 1×1 四个不同大小的池化核对 Conv 模块输出的特征图进行处理，然后将不同池化和的输出在通道方向上拼接在一起，再输入一个 Conv模块块细化特征得到 SPP 模块的输出。SPP 结构可以在基本不增加网络参数和计算量的情况下扩大模型的感受野，整合不同尺度的特征，提高模型检测能力，使得模型能够更好地适应复杂场景下的目标检测任务。
在这里插入图片描述

在YOLOv5 6.0版本中使用了SPP的改进模块SPPF，SPPF将SPP的并行改为了串行，速度上提升了，精度上没有做过消融实验，暂时不清楚。

代码实现

SPP

class SPP(nn.Module):
    # Spatial pyramid pooling layer used in YOLOv3-SPP
    def __init__(self, c1, c2, k=(5, 9, 13)):
        super(SPP, self).__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        return self.cv2(torch.cat([x] + [m(x) for m in self.m], 1))
      
#等效实现
class SPP(nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool1 = nn.MaxPool2d(kernel_size = 5, stride = 1, padding=2)
        self.maxpool2 = nn.MaxPool2d(9, 1, padding=4)
        self.maxpool3 = nn.MaxPool2d(13, 1, padding=6)

    def forward(self, x):
        o1 = self.maxpool1(x)
        o2 = self.maxpool2(x)
        o3 = self.maxpool3(x)
        return torch.cat([x, o1, o2, o3], dim=1)

SPPF

class SPPF(nn.Module):
    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
    def __init__(self, c1, c2, k=5):  # equivalent to SPP(k=(5, 9, 13))
        super().__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * 4, c2, 1, 1)
        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
            y1 = self.m(x)
            y2 = self.m(y1)
            return self.cv2(torch.cat((x, y1, y2, self.m(y2)), 1))
#等效实现
class SPPF(nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool = nn.MaxPool2d(5, 1, padding=2)

    def forward(self, x):
        o1 = self.maxpool(x)
        o2 = self.maxpool(o1)
        o3 = self.maxpool(o2)
        return torch.cat([x, o1, o2, o3], dim=1)

Neck部分结构

在这里插入图片描述

Bochkovskiy 等受PANet的启发在 YOLOv4 的Neck中使用 FPN 连接 PAN 结构形成改良的特征金字塔。FPN 用于将深层的语义特征传递到浅层，从而增强多个尺度上的语义表达，而 PAN 用于将浅层的定位信息传递到深层，增强多个尺度上的定位能力。大量实验证明了该结构的有效性，这一结构也被之后的YOLOv5系列网络所沿用。

Head部分结构

YOLOv5的头部结构与v3和v4都是一致的，通过一个1x1的卷积将Neck输出的三个特征层通道数都调整到（5+目标类别数）* 每个特征层的anchor数，YOLOv5一个特征层设置3个anchor，如果是COCO数据，有80个类别，那么特征通道数就应该是（5+80）* 3 = 255。然后进行解码将目标位置换算到输入图像上。其中5分别对应的是：预测框的中心点横坐标、纵坐标、宽度、高度和置信度。这里的置信度表示预测框的可信度，取值范围为( 0 , 1 ) ，值越大说明该预测框中越有可能存在目标。

Head中的3个检测层分别对应Neck中得到的3种不同尺寸的特征图。特征图上的每个网格都预设了3个不同宽高比的anchor，可以在特征图的通道维度上保存所有基于anchor先验框的位置信息和分类信息，用来预测和回归目标。

下面给出具体的预测特征图目标框回归流程：

在这里插入图片描述

如同所示，其中：

（bx， by， bw， bh）表示预测框的中心点x, y坐标、宽度和高度

（cx， cy）表示预测框中心点所在网格的左上角坐标

（tx， ty）表示预测框的中心点相对于网格左上角坐标的偏移量

（tw， th）表示预测框的宽高相对于anchor宽高的缩放比例

（pw，ph）表示表示先验框anchor的宽高

具体计算公式如下：
$b_x=(2* \sigma(t_x)-0.5) + c_x$

$b_y=(2* \sigma(t_y)-0.5) + c_y$

$b_w=p_w*(2* \sigma(t_w))^2$

$b_h=p_h*(2* \sigma(t_h))^2$

为了将预测框的中心点约束到当前网格中，使用Sigmoid函数处理偏移量，使预测的偏移值保持在(0,1)范围内。这样一来，根据目标框回归计算公式，预测框中心点坐标的偏移量保持在(−0.5,1.5)范围内，如上图紫色区域所示。预测框的宽度和高度对于anchor的放缩范围为（0,4）。

检测头代码

class Detect(nn.Module):
    stride = None  # strides computed during build
    export = False  # onnx export

    def __init__(self, nc=80, anchors=(), ch=()):  # detection layer
        super(Detect, self).__init__()
        self.nc = nc  # number of classes
        self.no = nc + 5  # number of outputs per anchor
        self.nl = len(anchors)  # number of detection layers
        self.na = len(anchors[0]) // 2  # number of anchors
        self.grid = [torch.zeros(1)] * self.nl  # init grid
        a = torch.tensor(anchors).float().view(self.nl, -1, 2)
        self.register_buffer('anchors', a)  # shape(nl,na,2)
        self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2))  # shape(nl,1,na,1,1,2)
        self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv

    def forward(self, x):
        # x = x.copy()  # for profiling
        z = []  # inference output
        self.training |= self.export
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

                y = x[i].sigmoid()
                y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                z.append(y.view(bs, -1, self.no))

        return x if self.training else (torch.cat(z, 1), x)

    @staticmethod
    def _make_grid(nx=20, ny=20):
        yv, xv = torch.meshgrid([torch.arange(ny), torch.arange(nx)])
        return torch.stack((xv, yv), 2).view((1, 1, ny, nx, 2)).float()

非极大值抑制(NMS)

当我们得到对目标的预测后，一个目标通常会产生很多冗余的预测框。Non-maximum suppression（NMS）其核心思想在于抑制非极大值的目标，去除冗余，从而搜索出局部极大值的目标，找到最优值。

在我们对目标产生预测框后，往往会产生大量冗余的边界框，因此我们需要去除位置准确率低的边界框，保留位置准确率高的边界框。NMS的主要步骤为：

对于每个种类的置信度按照从大到小的顺序排序，选出置信度最高的边框。
遍历其余所有剩下的边界框，计算这些边界框与置信度最高的边框的IOU值。如果某一边界框和置信度最高的边框IOU阈值大于我们所设定的IOU阈值，这意味着同一个物体被两个重复的边界框所预测，则去掉这这个边框。
从未处理的边框中再选择一个置信度最高的值，重复第二步的过程，直到选出的边框不再有与它超过IOU阈值的边框。

代码实现：

def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False, multi_label=False,
                        labels=()):
    """Runs Non-Maximum Suppression (NMS) on inference results

    Returns:
         list of detections, on (n,6) tensor per image [xyxy, conf, cls]
    """

    nc = prediction.shape[2] - 5  # number of classes
    xc = prediction[..., 4] > conf_thres  # candidates

    # Settings
    min_wh, max_wh = 2, 4096  # (pixels) minimum and maximum box width and height
    max_det = 300  # maximum number of detections per image
    max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()
    time_limit = 10.0  # seconds to quit after
    redundant = True  # require redundant detections
    multi_label &= nc > 1  # multiple labels per box (adds 0.5ms/img)
    merge = False  # use merge-NMS

    t = time.time()
    output = [torch.zeros((0, 6), device=prediction.device)] * prediction.shape[0]
    for xi, x in enumerate(prediction):  # image index, image inference
        # Apply constraints
        # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height
        x = x[xc[xi]]  # confidence

        # Cat apriori labels if autolabelling
        if labels and len(labels[xi]):
            l = labels[xi]
            v = torch.zeros((len(l), nc + 5), device=x.device)
            v[:, :4] = l[:, 1:5]  # box
            v[:, 4] = 1.0  # conf
            v[range(len(l)), l[:, 0].long() + 5] = 1.0  # cls
            x = torch.cat((x, v), 0)

        # If none remain process next image
        if not x.shape[0]:
            continue

        # Compute conf
        x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

        # Box (center x, center y, width, height) to (x1, y1, x2, y2)
        box = xywh2xyxy(x[:, :4])

        # Detections matrix nx6 (xyxy, conf, cls)
        if multi_label:
            i, j = (x[:, 5:] > conf_thres).nonzero(as_tuple=False).T
            x = torch.cat((box[i], x[i, j + 5, None], j[:, None].float()), 1)
        else:  # best class only
            conf, j = x[:, 5:].max(1, keepdim=True)
            x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]

        # Filter by class
        if classes is not None:
            x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]

        # Apply finite constraint
        # if not torch.isfinite(x).all():
        #     x = x[torch.isfinite(x).all(1)]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        elif n > max_nms:  # excess boxes
            x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence

        # Batched NMS
        c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        if i.shape[0] > max_det:  # limit detections
            i = i[:max_det]
        if merge and (1 < n < 3E3):  # Merge NMS (boxes merged using weighted mean)
            # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
            iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix
            weights = iou * scores[None]  # box weights
            x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(1, keepdim=True)  # merged boxes
            if redundant:
                i = i[iou.sum(1) > 1]  # require redundancy

        output[xi] = x[i]
        if (time.time() - t) > time_limit:
            print(f'WARNING: NMS time limit {time_limit}s exceeded')
            break  # time limit exceeded

    return output

损失函数

YOLOv5 损失函数由三部分组成：

分类损失（Classification Loss）采用的是BCE loss，只计算正样本的分类损失。
置信度损失（Objectness Loss）采用的依然是BCE loss，用网络预测的目标边界框与GT Box的CIoU作为标签，计算的是所有样本的损失。
边界框回归损失（Box Regression Loss）采用的是CIoU loss，只计算正样本的分类损失。

$Loss=\lambda_1L_{cls} + \lambda_2L_{obj} + \lambda_3L_{box}$

式中， $\lambda_1$ 、 $\lambda_2$ 和 $\lambda_3$ 为平衡参数，平衡三部分权重，默认通过工程data目录下的hyp.scratch.yaml文件进行配置cls、obj和box进行配置，默认取0.5，1.0和0.05。

分类损失

YOLOv5默认使用二元交叉熵函数来计算分类损失。二元交叉熵函数的定义为
$L=-ylogp-(1-y)log(1-p)=\begin{cases} -logp && , y=1\\ -log(1-p) && ,y=0 \end{cases}$
式中，yy为输入样本对应的标签（正样本为1，负样本为0），p为模型预测该输入样本为正样本的概率。假设：
$p_t=\begin{cases} p && ,y=1\\ 1 - p && ,y=0 \end{cases}$
交叉熵函数的定义可简化为：
$L=-logp_t$
YOLOv5使用二元交叉熵损失函数计算类别概率和目标置信度得分的损失，各个标签不是互斥的。YOLOv5使用多个独立的逻辑（logistic）分类器替换softmax函数，以计算输入属于特定标签的可能性。在计算分类损失进行训练时，对每个标签使用二元交叉熵损失。这也避免使用softmax函数而降低了计算复杂度。

置信度损失

每个预测框的置信度表示这个预测框的可靠程度，值越大表示该预测框越可靠，也表示越接近真实框。对于置信度标签，YOLO之前的版本认为所有存在目标的网格(正样本)对应的标签值均为1，其余网格(负样本)对应的标签值为0。但是这样带来的问题是有些预测框可能只是在目标的周围，而并不能精准预测框的位置。

YOLOv5的做法是，根据网格对应的预测框与真实框的CIoU作为该预测框的置信度标签。与计算分类损失一样，YOLOv5默认使用二元交叉熵函数来计算置信度损失。

同时，对于目标损失，在不同的预测特征层也给予了不同权重。
$L_obj=4.0*L^{small}_{obj}+1.0*L^{medium}_{obj}+0.4*L^{large}_{obj}$
在源码中，针对预测小目标的预测特征层(大特征图)采用的权重是4.0，针对预测中等目标的预测特征层采用的权重是1.0，针对预测大目标的预测特征层(小特征图)采用的权重是0.4，作者说这是针对COCO数据集设置的超参数。

边界框回归损失

YOLOv5默认使用CIoU来计算边界框损失。CIoU基于DIoU得到，其中DIoU将预测框和真实框之间的距离，重叠率以及尺度等因素都考虑了进去，使得目标框回归变得更加稳定。DIoU的损失函数为：
$L_{DIoU}=1-IoU+ \frac{ \rho^2(b,b^{gt})}{c^2}$
式中， $b$ 和 $b^{gt}$ 分别表示预测框和真实框的中心点，ρ表示两个中心点之间的欧式距离，c表示预测框和真实框的最小闭包区域的对角线距离，gt是ground truth缩写。如下图所示：

在这里插入图片描述

CIoU是在DIoU的惩罚项基础上添加了一个影响因子αv，这个因子将预测框的宽高比和真实框的宽高比考虑进去，即CIoU的损失计算公式为：
$L_{CIoU}=1-IoU+ \frac{ \rho^2(b,b^{gt})}{c^2} + \alpha v$
式中, $\alpha$ 和 $v$ 可以通过下列公式进行计算：
$\alpha=\frac{v}{(1-IoU) + v}$

$v=\frac{4}{\pi^2}(arctan\frac{w^{gt}}{h^{gt}}-arctan\frac{w}{h})^2$

式中， $w^{gt}$ 表示真实框的宽度 $h^{gt}$ 表示真实框的高度 $w$ 和 $h$ 则表示预测框的宽度和高度。

损失计算代码

class ComputeLoss:
    # Compute losses
    def __init__(self, model, autobalance=False):
        super(ComputeLoss, self).__init__()
        device = next(model.parameters()).device  # get model device
        h = model.hyp  # hyperparameters

        # Define criteria
        BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['cls_pw']], device=device))
        BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']], device=device))

        # Class label smoothing https://arxiv.org/pdf/1902.04103.pdf eqn 3
        self.cp, self.cn = smooth_BCE(eps=h.get('label_smoothing', 0.0))  # positive, negative BCE targets

        # Focal loss
        g = h['fl_gamma']  # focal loss gamma
        if g > 0:
            BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)

        det = model.module.model[-1] if is_parallel(model) else model.model[-1]  # Detect() module
        self.balance = {3: [4.0, 1.0, 0.4]}.get(det.nl, [4.0, 1.0, 0.25, 0.06, .02])  # P3-P7
        self.ssi = list(det.stride).index(16) if autobalance else 0  # stride 16 index
        self.BCEcls, self.BCEobj, self.gr, self.hyp, self.autobalance = BCEcls, BCEobj, model.gr, h, autobalance
        for k in 'na', 'nc', 'nl', 'anchors':
            setattr(self, k, getattr(det, k))

    def __call__(self, p, targets):  # predictions, targets, model
        device = targets.device
        lcls, lbox, lobj = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, device=device)
        tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets

        # Losses
        for i, pi in enumerate(p):  # layer index, layer predictions
            b, a, gj, gi = indices[i]  # image, anchor, gridy, gridx
            tobj = torch.zeros_like(pi[..., 0], device=device)  # target obj

            n = b.shape[0]  # number of targets
            if n:
                ps = pi[b, a, gj, gi]  # prediction subset corresponding to targets

                # Regression
                pxy = ps[:, :2].sigmoid() * 2. - 0.5
                pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
                pbox = torch.cat((pxy, pwh), 1)  # predicted box
                iou = bbox_iou(pbox.T, tbox[i], x1y1x2y2=False, CIoU=True)  # iou(prediction, target)
                lbox += (1.0 - iou).mean()  # iou loss

                # Objectness
                tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * iou.detach().clamp(0).type(tobj.dtype)  # iou ratio

                # Classification
                if self.nc > 1:  # cls loss (only if multiple classes)
                    t = torch.full_like(ps[:, 5:], self.cn, device=device)  # targets
                    t[range(n), tcls[i]] = self.cp
                    lcls += self.BCEcls(ps[:, 5:], t)  # BCE

                # Append targets to text file
                # with open('targets.txt', 'a') as file:
                #     [file.write('%11.5g ' * 4 % tuple(x) + '\n') for x in torch.cat((txy[i], twh[i]), 1)]

            obji = self.BCEobj(pi[..., 4], tobj)
            lobj += obji * self.balance[i]  # obj loss
            if self.autobalance:
                self.balance[i] = self.balance[i] * 0.9999 + 0.0001 / obji.detach().item()

        if self.autobalance:
            self.balance = [x / self.balance[self.ssi] for x in self.balance]
        lbox *= self.hyp['box']
        lobj *= self.hyp['obj']
        lcls *= self.hyp['cls']
        bs = tobj.shape[0]  # batch size

        loss = lbox + lobj + lcls
        return loss * bs, torch.cat((lbox, lobj, lcls, loss)).detach()