Pytorch | yolov3原理及代码详解（二）

最新推荐文章于 2024-09-12 07:38:34 发布

NotFound1911

最新推荐文章于 2024-09-12 07:38:34 发布

阅读量1.3w

点赞数 26

分类专栏：自学文章标签： pytorch yolov3

本文链接：https://blog.csdn.net/qq_24739717/article/details/96705055

版权

自学专栏收录该内容

78 篇文章 21 订阅

订阅专栏

阅前可看：

Pytorch | yolov3原理及代码详解（一）

https://blog.csdn.net/qq_24739717/article/details/92399359

分析代码：

https://github.com/eriklindernoren/PyTorch-YOLOv3

了解YOLOv3推演（检测）的过程。

本文主要针对YOLOv3如何训练，即分析train.py文件。

2. 加载数据

2.1训练之前的准备工作（train.py——part1）

2.1.1初始化

from __future__ import division

from models import *
from utils.logger import *
from utils.utils import *
from utils.datasets import *
from utils.parse_config import *
#from test import evaluate

from terminaltables import AsciiTable

import os
import sys
import time
import datetime
import argparse

import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms
from torch.autograd import Variable
import torch.optim as optim

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=100, help="number of epochs")
    parser.add_argument("--batch_size", type=int, default=8, help="size of each image batch")
    parser.add_argument("--gradient_accumulations", type=int, default=2, help="number of gradient accums before step")
    parser.add_argument("--model_def", type=str, default="config/yolov3_myself.cfg", help="path to model definition file")
    parser.add_argument("--data_config", type=str, default="config/voc_myself.data", help="path to data config file")
    parser.add_argument("--pretrained_weights", type=str, default="weights/darknet53.conv.74", help="if specified starts from checkpoint model")
    parser.add_argument("--n_cpu", type=int, default=0, help="number of cpu threads to use during batch generation")
    parser.add_argument("--img_size", type=int, default=416, help="size of each image dimension")
    parser.add_argument("--checkpoint_interval", type=int, default=1, help="interval between saving model weights")
    parser.add_argument("--evaluation_interval", type=int, default=1, help="interval evaluations on validation set")
    parser.add_argument("--compute_map", default=False, help="if True computes mAP every tenth batch")
    parser.add_argument("--multiscale_training", default=True, help="allow for multi-scale training")
    opt = parser.parse_args()
    print(opt)

    logger = Logger("logs")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    os.makedirs("output", exist_ok=True)
    os.makedirs("checkpoints", exist_ok=True)

2.1.2加载网络：

    # Get data configuration
    data_config = parse_data_config(opt.data_config)
    train_path = data_config["train"]
    valid_path = data_config["valid"]
    class_names = load_classes(data_config["names"])

    # Initiate model
    model = Darknet(opt.model_def).to(device)
    model.apply(weights_init_normal)

    # If specified we start from checkpoint
    if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

从.cfg文件中解析出路径，包括训练路径、验证路径、训练类别。同时加载Darknet（YOLOv3）模型到model中。model.apply(weights_init_normal)，自定义初始化方式。

def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)

但是通常训练的时候，会加载预训练模型model.load_state_dict(torch.load(opt.pretrained_weights))。

2.1.3 放进DataLoader

    #DataLoader的collate_fn参数，实现自定义的batch输出
    #- shuffle：设置为True的时候，每个世代都会打乱数据集 
    #- collate_fn：如何取样本的，我们可以定义自己的函数来准确地实现想要的功能 
    #- drop_last：告诉如何处理数据集长度除于batch_size余下的数据。True就抛弃，否则保留
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=True,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
    )
    #使用优化器
    optimizer = torch.optim.Adam(model.parameters())

    metrics = [
        "grid_size",
        "loss",
        "x",
        "y",
        "w",
        "h",
        "conf",
        "cls",
        "cls_acc",
        "recall50",
        "recall75",
        "precision",
        "conf_obj",
        "conf_noobj",
    ]

2.2. 训练并计算loss（train.py——part2）

2.2.1开始迭代

迭代的完整代码如下：

    for epoch in range(opt.epochs):
        model.train()
        start_time = time.time()
        print("len(dataloader):\n",len(dataloader))
        for batch_i, (_, imgs, targets) in enumerate(dataloader):
            batches_done = len(dataloader) * epoch + batch_i
            print("batch_i:\n",batch_i)
            print("imgs.shape:\n",imgs.shape)
            print("batches_done:\n",batches_done)
            imgs = Variable(imgs.to(device))
            targets = Variable(targets.to(device), requires_grad=False)

            loss, outputs = model(imgs, targets)
            loss.backward()

            if batches_done % opt.gradient_accumulations:
                # Accumulates gradient before each step
                optimizer.step()
                optimizer.zero_grad()

2.2.2 从batch中获取图片，从label中获取标签

for batch_i, (_, imgs, targets) in enumerate(dataloader):，这里主要要参考ListDataset中的__getitem__和DataLoader中的collate_fn设置。

ListDataset中的__getitem__（部分）：

        if os.path.exists(label_path):
            
            boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))
            # Extract coordinates for unpadded + unscaled image
            x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)+1#xmin
            y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)+1#ymin
            x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)+1#xmax
            y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)+1#ymax
            # Adjust for added padding
            # 标注的边界框根据pad进行偏移
            x1 += pad[0]#左
            y1 += pad[2]#上
            x2 += pad[1]#右
            y2 += pad[3]#下
            # Returns (x, y, w, h) 坐标进行微调(放缩)
            boxes[:, 1] = ((x1 + x2) / 2) / padded_w
            boxes[:, 2] = ((y1 + y2) / 2) / padded_h
            boxes[:, 3] *= w_factor / padded_w
            boxes[:, 4] *= h_factor / padded_h

            targets = torch.zeros((len(boxes), 6))
            targets[:, 1:] = boxes
            print("len(boxes)：",len(boxes))
            print("boxes:\n",boxes)
            print("targets:\n",targets)

这里是标注的.txt文件中解析坐标，生成VOC数据集标注txt的脚本是voc_label.py。完整代码如下

import xml.etree.ElementTree as ET
import pickle
import os
from os import listdir, getcwd
from os.path import join

sets=[('', 'train'), ('', 'val'), ('', 'test')]

classes = ["nodule"]


def convert(size, box):
    dw = 1./(size[0])
    dh = 1./(size[1])
    x = (box[0] + box[1])/2.0 - 1
    y = (box[2] + box[3])/2.0 - 1
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x*dw
    w = w*dw
    y = y*dh
    h = h*dh
    return (x,y,w,h)

def convert_annotation(year, image_id):
    in_file = open('VOCdevkit/VOC%s/Annotations/%s.xml'%(year, image_id))
    out_file = open('VOCdevkit/VOC%s/labels/%s.txt'%(year, image_id), 'w')
    tree=ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)

    for obj in root.iter('object'):
        #difficult = obj.find('difficult').text
        difficult = 0
        cls = obj.find('name').text
        if cls not in classes or int(difficult)==1:
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text), float(xmlbox.find('ymax').text))
        bb = convert((w,h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')

wd = getcwd()

for year, image_set in sets:
    if not os.path.exists('VOCdevkit/VOC%s/labels/'%(year)):
        os.makedirs('VOCdevkit/VOC%s/labels/'%(year))
    image_ids = open('VOCdevkit/VOC%s/ImageSets/Main/%s.txt'%(year, image_set)).read().strip().split()
    list_file = open('%s_%s.txt'%(year, image_set), 'w')
    for image_id in image_ids:
        list_file.write('%s/VOCdevkit/VOC%s/JPEGImages/%s.png\n'%(wd, year, image_id))
        convert_annotation(year, image_id)
    list_file.close()

os.system("cat 2007_train.txt 2007_val.txt 2012_train.txt 2012_val.txt > train.txt")
os.system("cat 2007_train.txt 2007_val.txt 2007_test.txt 2012_train.txt 2012_val.txt > train.all.txt")

主要注意的是其中的convert函数，以及语句：

        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text), float(xmlbox.find('ymax').text))
        bb = convert((w,h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')

这个脚本把xmax，xmin，ymax，ymin，转换成编辑框坐标中心，并同width和height进行归一化到0~1之间。那么需要在训练的过程中解析这些边界框坐标及大小，放进名为tatgets的张量中进行训练，这个坐标如何转换计算的，可以参考下图。（注：__getitem__函数中的w_factor和h_factor是获取的图像的宽高）。注意，最后放进targets的值是，groud truth的中心点坐标，以及w和h（均是在padw和padh放缩之后的值）。这里targets在下面的坐标预测的时候有用。

collate_fn函数主要是调整imgs的尺寸大小，因为YOLOv3在训练的过程中采用多尺度训练，不断的改变图像的分辨率大小，使得YOLOv3可以很好的适用于各种分辨率大小的图像检测。collate_fn完整代码如下：

    def collate_fn(self, batch):
        paths, imgs, targets = list(zip(*batch))
        # Remove empty placeholder targets
        targets = [boxes for boxes in targets if boxes is not None]
        # Add sample index to targets
        for i, boxes in enumerate(targets):
            boxes[:, 0] = i
        targets = torch.cat(targets, 0)
        # Selects new image size every tenth batch
        if self.multiscale and self.batch_count % 10 == 0:
            # 图像进行放缩 调整分辨率大小
            self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
        # Resize images to input shape
        imgs = torch.stack([resize(img, self.img_size) for img in imgs])
        self.batch_count += 1
        return paths, imgs, targets

需要注意的是targets的变化方式，在ListDataset类的__getitem__函数中，targets的第一位是0，那这个第一位是有什么用呢？targets最后输出的是一个列表，列表的每一个元素都是一张image对应的n个target（这个是张量），并且target[:,0]=0(即前面提到的targets的第一位是0)，target[:,0]表示的是对应image的ID。在训练的时候collate_fn函数都会把所有target融合在一起成为一个张量（targets = torch.cat(targets, 0)），只有这个张量的第一位（target[:,0]）才可以判断这个target属于哪一张图片（即能够匹配图像ID）。

            targets = torch.zeros((len(boxes), 6))
            targets[:, 1:] = boxes

collate_fn函数的使用也是为什么你图像尺寸是512x512的，但是进行训练的时候却是384x384（以像素点32的进行放缩加减）。

2.2.3 Loss计算

loss, outputs = model(imgs, targets)，这里进行计算loss。其实这个loss的计算是在yolo层计算的，其实不难理解，yolo层是负责目标检测的层，需要输出目标的类别、坐标、大小，所以会在这一层进行loss计算。

这个代码可以从Darknet类的前向通路中发现（在训练的时候targets是有值的，不等于None）：

yolo层的具体实现是在YOLOLayer中，可查看其forward函数得知loss计算过程，代码（YOLOLayer部分）如下：

        if targets is None:
            return output, 0
        else:
            iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
                pred_boxes=pred_boxes,
                pred_cls=pred_cls,
                target=targets,
                anchors=self.scaled_anchors,
                ignore_thres=self.ignore_thres,
            )

            # Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean()
            conf_obj = pred_conf[obj_mask].mean()
            conf_noobj = pred_conf[noobj_mask].mean()
            conf50 = (pred_conf > 0.5).float()
            iou50 = (iou_scores > 0.5).float()
            iou75 = (iou_scores > 0.75).float()
            detected_mask = conf50 * class_mask * tconf
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

            self.metrics = {
                "loss": to_cpu(total_loss).item(),
                "x": to_cpu(loss_x).item(),
                "y": to_cpu(loss_y).item(),
                "w": to_cpu(loss_w).item(),
                "h": to_cpu(loss_h).item(),
                "conf": to_cpu(loss_conf).item(),
                "cls": to_cpu(loss_cls).item(),
                "cls_acc": to_cpu(cls_acc).item(),
                "recall50": to_cpu(recall50).item(),
                "recall75": to_cpu(recall75).item(),
                "precision": to_cpu(precision).item(),
                "conf_obj": to_cpu(conf_obj).item(),
                "conf_noobj": to_cpu(conf_noobj).item(),
                "grid_size": grid_size,
            }

            return output, total_loss

可以，batch设置的是8，看到图片的尺寸被放缩成了【352， 352】，分别进行8、16、32倍下采样，即对应的shape是【44，44】【22， 22】【11， 11】

同时使用build_targets函数得到iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf。

obj_mask表示有物体落在特征图中某一个cell的索引，所以在初始化的时候置0，如果有物体落在那个cell中，那个对应的位置会置1。所以会有代码：

obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)
........
obj_mask[b, best_n, gj, gi] = 1

同理，表示没有物体落在特征图中某一个cell的索引,所以在初始化的时候置1，如果没有有物体落在那个cell中，那个对应的位置会置0。同时，如果预测的IOU值过大，（大于阈值ignore_thres）时，那么可以认为这个cell是有物体的，要置0。所以会有代码：

noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
.......   
noobj_mask[b, best_n, gj, gi] = 0    
# Set noobj mask to zero where iou exceeds ignore threshold    
for i, anchor_ious in enumerate(ious.t()):        
    noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

查看build_targets代码如下：

def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):

    ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0)
    nA = pred_boxes.size(1)
    nC = pred_cls.size(-1)
    nG = pred_boxes.size(2)

    # Output tensors
    obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)
    noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
    tx = FloatTensor(nB, nA, nG, nG).fill_(0)
    ty = FloatTensor(nB, nA, nG, nG).fill_(0)
    tw = FloatTensor(nB, nA, nG, nG).fill_(0)
    th = FloatTensor(nB, nA, nG, nG).fill_(0)
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)

    # Convert to position relative to box
    target_boxes = target[:, 2:6] * nG
    gxy = target_boxes[:, :2]
    gwh = target_boxes[:, 2:]
    # Get anchors with best iou
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])
    best_ious, best_n = ious.max(0)
    # Separate target values
    b, target_labels = target[:, :2].long().t()
    gx, gy = gxy.t()
    gw, gh = gwh.t()
    gi, gj = gxy.long().t()
    # Set masks
    obj_mask[b, best_n, gj, gi] = 1
    noobj_mask[b, best_n, gj, gi] = 0

    # Set noobj mask to zero where iou exceeds ignore threshold
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # Coordinates
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
    # One-hot encoding of label
    tcls[b, best_n, gj, gi, target_labels] = 1
    # Compute label correctness and iou at best anchor
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)

    tconf = obj_mask.float()
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

根据下图，不难理解：nB：Batch是多大。nA：多少个Anchor 。nC：训练多少个class，在这里我之训练一个类，所以是1。nG：grid大小，每一行分（列）成多少个cell。如果是3个类的话，请看右下角那张图（这是后期我使用其他数据集，额外添加的）

同时提取targets中的坐标信息，分别给gxy和gwh张量，乘以nG是因为坐标信息是归一化到0~1之间，需要进行放大。

下一步便是用anchor进行计算iou值。

    # Get anchors with best iou
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])
    best_ious, best_n = ious.max(0)

实现的函数为bbox_wh_iou，代码如下：

def bbox_wh_iou(wh1, wh2):
    wh2 = wh2.t()
    w1, h1 = wh1[0], wh1[1]
    w2, h2 = wh2[0], wh2[1]
    inter_area = torch.min(w1, w2) * torch.min(h1, h2)
    union_area = (w1 * h1 + 1e-16) + w2 * h2 - inter_area
    return inter_area / union_area

计算结果如下。仍然把batch设为8。ious.shape为【3， 8】这是因为有三个anchor，每一个anchor都会和标记的label进行计算iou值，即看哪一个anchor和gorund truth（真实的、标注的边界框）最接近。注意：【3，8】的8不是batch是8，而是有8个target，恰好每一张图都有一个target，所以是8，但往往一张图可能存在多个taget。

gxy.t()是为了把shape从n x 2 变成 2 x n。 gi, gj = gxy.long().t()，是通过.long的方式去除小数点，保留整数。如此便可以设置masks。b是指第几个target。gi, gj 便是特征图中对应的左上角的坐标。

    # Set masks
    obj_mask[b, best_n, gj, gi] = 1
    noobj_mask[b, best_n, gj, gi] = 0

接下来是坐标预测，可以参考https://blog.csdn.net/qq_34199326/article/details/84109828。同时要参考YOLOv3坐标预测图。

gx表示x坐标的具体值，gx.floor（）则是向下取整，两者相减即可得到偏移值。

上图中的tx，ty，tw，e^tw, e^th的计算方式如下。gx，gy是这个边界框的中心点的坐标，gx.floor和gy.floor便是cell（中心点左上角）的坐标。因为YOLO的核心思想就是物体的的中心点落在哪一个方格（cell）中，那个方格就预测该物体。有人会问这里为什么没有使用sigmoid，如果你仔细看YoloLayer的前向传播（forward()），在使用build_targets函数处理前，就已经使用sigmoid函数处理过了，置信度和类别概率是一样的代码如下。

        x = torch.sigmoid(prediction[..., 0])  # Center x        
        y = torch.sigmoid(prediction[..., 1])  # Center y
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height
        pred_conf = torch.sigmoid(prediction[..., 4])  # Conf
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred.
        if grid_size != self.grid_size:
            print("··················different··················")
            self.compute_grid_offsets(grid_size, cuda=x.is_cuda)

        # Add offset and scale with anchors
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        pred_boxes[..., 0] = x.data + self.grid_x
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

那么gw和gh为什么要除以anchors的w和h呢？引用上述博客的几段话：

“yolov3需要的训练数据的label是根据原图尺寸归一化了的，这样做是因为怕大的边框的影响比小的边框影响大，因此做了归一化的操作，这样大的和小的边框都会被同等看待了，而且训练也容易收敛。”

“在yolov3里是Gx,Gy减去grid cell左上角坐标Cx,Cy。x,y坐标并没有针对anchon box求偏移量，所以并不需要除以Pw,Ph。

也就是说是tx = Gx - Cx

ty = Gy - Cy

这样就可以直接求bbox中心距离grid cell左上角的坐标的偏移量。

tw和th的公式yolov3和faster-rcnn系列是一样的，是物体所在边框的长宽和anchor box长宽之间的比率，不管Faster-RCNN还是YOLO，都不是直接回归bounding box的长宽而是尺度缩放到对数空间，是怕训练会带来不稳定的梯度。因为如果不做变换，直接预测相对形变tw，那么要求tw>0，因为你的框的宽高不可能是负数。这样，是在做一个有不等式条件约束的优化问题，没法直接用SGD来做。所以先取一个对数变换，将其不等式约束去掉，就可以了。

这里就有个重要的疑问了，一个尺度的feature map有三个anchors，那么对于某个ground truth框，究竟是哪个anchor负责匹配它呢？和YOLOv1一样，对于训练图片中的ground truth，若其中心点落在某个cell内，那么该cell内的3个anchor box负责预测它，具体是哪个anchor box预测它，需要在训练中确定，即由那个与ground truth的IOU最大的anchor box预测它，而剩余的2个anchor box不与该ground truth匹配。YOLOv3需要假定每个cell至多含有一个grounth truth，而在实际上基本不会出现多于1个的情况。与ground truth匹配的anchor box计算坐标误差、置信度误差（此时target为1）以及分类误差，而其它的anchor box只计算置信度误差（此时target为0）。

有了平移（tx,ty）和尺度缩放（tw,th）才能让anchor box经过微调与grand truth重合。如图，红色框为anchor box，绿色框为Ground Truth，平移+尺度缩放可实线红色框先平移到虚线红色框，然后再缩放到绿色框。边框回归最简单的想法就是通过平移加尺度缩放进行微调嘛。”

这也可以很好的解释为什么选择anchors[best_n]进行放缩，因为需要最好的anchor（即IOU值最大，最接近，但这里计算IOU值的时候没有考虑坐标，只考虑anchor的w和h以及targets的w和h，可以查看上述求解iou的代码了解）去匹配预测bounding boxes。

这里在训练的时候非常巧妙，没有直接训练bw和bh，而是训练tw，th。这里注意代码是怎么写的：在build_targets函数中，gw和gh是标准的真实值（target）在该特征图大小小的宽w和高h。

   # Convert to position relative to box
    target_boxes = target[:, 2:6] * nG
    gxy = target_boxes[:, :2]
    gwh = target_boxes[:, 2:]

gw和gh则是通过尺度缩放成tw和th。注意下面代码中的参数：anchors[best_n][:, 0]和anchors[best_n][:, 1]，其实分别只指输入到该特征图大小的anchors的w和h。因为这个函数的输入anchors的值是self.scaled_anchors 。具体代码：

self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors])。

所以tw和th是该特征图大小下的标注的真实值（target）w和h与使用该特征图大小下进行检测的anchor的w和h的自然对数。

    # Coordinates
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)

为了方便理解，一步到位，这里稍微跳跃到后面的一部分。即计算w和h的loss方式。计算方式如下：

            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask]

tw和th我们知道怎么得到了，那么看下w和h是如何得到的：

        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

这里的self.anchor_w和self.anchor_h是不是很熟悉？因为它其实就是self.scaled_anchors 。

        self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1))
        self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1))

参考公式：

其中可以把和当作真实值，和当作预测值，但是yolov3在训练的过程中不是直接做边界框回归，而是w和tw，h和tw进行回归，做loss值。我们通过得到tw和th值就可以得到bw和bw。这是因为：

th同理。

下面这句代码，是表明第b张图片，使用第best_n个anchors来预测哪一类（target_labels）物体。查看b和target_labels的值方便理解。

    # One-hot encoding of label
    tcls[b, best_n, gj, gi, target_labels] = 1

接下来计算class_mask,iou_scores,并返回。 class_mask的计算参考下图理解。b表示的targets对应image的ID，这个上面有解释，这里的b的长度是20，说明有20个target。每一个target都对应一个target_labels,即类别标签，表示这个target是什么类别，这里使用的是3类，所以target_labels的取值范围是0~2。pred_cls的shape也说明了这一点。.argmax(-1)返回最后一维度最大值的索引。注意，pred_cls[b, best_n, gj, gi].shape是【20， 3】和初期的pred_cls.shape是【8， 3， 12， 12， 3】是不一样的。pred_cls[b, best_n, gj, gi]的值如下图所示，可以抽象一点理解，[b, best_n, gj, gi]是索引号，pred_cls[b, best_n, gj, gi]便是这些索引号对应的张量堆叠而成的。如果pred_cls[b, best_n, gj, gi].argmax(-1) 等于target_labels的话，这相应位置的class_mask置1，表示这个特征地图的第gj行、第gi的cell预测的类别是正确的。

    # Compute label correctness and iou at best anchor
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
)
    tconf = obj_mask.float()
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

下面是计算iou值，使用iou_scores函数。还记得上面计算iou值是只考虑w和h，这里是需要考虑w，h还有坐标x，y。

原因：1.上面计算的iou值是anchor和target形状大小的匹配程度，不需要考虑坐标，这是因为那里只需要得到一个和真实形状（target）最接近的anchor去进行预测（检测），然后由于IOU值很高，就可以通过平移放缩的方式进行微调，边界框回归。

2.这里是要计算IOU值的得分，必须要考虑预测框和真实框的坐标。

完整代码如下：

def bbox_iou(box1, box2, x1y1x2y2=True):
    """
    Returns the IoU of two bounding boxes
    """
    if not x1y1x2y2:
        # Transform from center and width to exact coordinates
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        # Get the coordinates of bounding boxes
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    # get the corrdinates of the intersection rectangle
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    # Intersection area
    # torch.clamp torch.clamp(input, min, max, out=None) → Tensor
    # 将输入input张量每个元素的夹紧到区间 [min,max][min,max]，并返回结果到一个新张量。
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(
        inter_rect_y2 - inter_rect_y1 + 1, min=0
    )
    
    # Union Area
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou

接下来就是loss值计算。第一部边界框损失，包含x,y,w,h。第二部分是置信度损失。第三部分是类别损失，根据以下代码，我们写出YOLOv3的损失函数（转载请注明出处）。

            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

（注意上述Loss公式是一个yolo层的loss，yolov3使用了3个yolo层，所以最终的loss是三个yolo层的loss之和。同理，yolov3-tiny只使用了2个yolo层，最终的loss是两个yolo层的loss之和）式中batch是指批数据量的大小，anchor是指预测使用的框，每一层yolo中的anchor数为3，grid是特征图的尺寸。表示batch中的第i个数据，第j个anchor，在特征图中的第k个cell有预测的物体。和是惩罚项因子，在代码中是self.obj_scale和self.nobj_scale。关于置信度的预测的概念，大家可以参考：https://zhuanlan.zhihu.com/p/37850811。可以通过下图来进行理解：

下面一部分就是计算各种指标：

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean()
            conf_obj = pred_conf[obj_mask].mean()
            conf_noobj = pred_conf[noobj_mask].mean()
            conf50 = (pred_conf > 0.5).float()
            iou50 = (iou_scores > 0.5).float()
            iou75 = (iou_scores > 0.75).float()
            detected_mask = conf50 * class_mask * tconf
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

在计算出loss值之和，便进行反向传播，梯度优化。

            loss, outputs = model(imgs, targets)          
            
            loss.backward()

            if batches_done % opt.gradient_accumulations:
                # Accumulates gradient before each step
                optimizer.step()
                optimizer.zero_grad()

这里有点疑惑的是，在训练过程中不是每一次都进行参数优化，而是每两次进行参数优化一次，因为gradient_accumulations默认为2，如果设置为1则每次都会进行参数优化。

本文内容较多，暂时把loss的计算流程按照程序执行顺序进行整理了，还有一点点内容以及总结放在第三部分。

~~Pytorch | yolov3原理及代码详解（三）（更新中）~~

已更完

Pytorch | yolov3原理及代码详解（三）

https://blog.csdn.net/qq_24739717/article/details/96966371