pytorch模型构建（二）——datasets部分之数据增广及其他数据处理

本文链接：https://blog.csdn.net/weixin_39263657/article/details/121658068

一、简介

1. 图像中数据增强目的及分类概述：
目的：

增量
丰富多样性
提高模型泛化能力（尽可能的使得训练的数据与真实场景中的数据分布一致。）

原则：

保持标签一致(相应的转换)
针对业务场景(例如路标箭头检测就不要用翻转了)
不要引入无关的数据。

方法：

单样本：
主要包括 翻转，旋转，扭曲，图像仿射变换，缩放，压缩，随机crop，随机padding，HSV变换，噪声，模糊，图像(feature map)区域随机擦除，风格转换(GAN)，GAN网络生成等。
多样本
主要包括 mosaic，MixUp，cutout等。

2. 主要介绍图像中的常用的数据增强方法（后续遇到再添加）：
主要包括mosaic，MixUp，cutout，random_perspective(随机旋转、缩放、平移、裁剪，透视变换)，augment_hsv，翻转。

3. 矩形训练、推理：

目的：减少冗余信息，减少网络产生的无意义的框的数量，加速训练、推理。

二、数据增强构造过程

1. mosaic

原理： 将四张图片进行随机裁剪，然后拼接到一张图上进行训练。
目的： 丰富了图片背景，变相的提高了batch_size（原本单块GPU训练的batch_size可能很小，不利于训练，这样一来使得单块GPU也能变相的享受更大batch_size进行训练），在BN的时候也是一下计算四张图的数据。
步骤：
1. 在[0, 2 * img_size]之间随机取2个数作为拼接图像的中心坐标(xc, yc)；
2. 取出当前图片，然后随机在剩余的图片中选取三张，将这些图片先resize到输入模型的图片尺寸img_size;
3. 使用np生成一个(2 * img_size, 2 * img_size, channel)的新图片，先用114填充；
4. 以(xc, yc)为中心，分别计算出4张原始图片按照中心坐标在新图片中的位置；
5. 分别计算出在四张原始图片上需要裁剪的区域的坐标(裁剪的区域是要放到新图片上的)；
6. 将截取的四张原始图片填充到新图片（mosaic图片）相应的位置；
7. 计算每个原始图像与新图片(mosaic图像)在wh维度相差多少，方便后面对labels进行处理。
8. 随机翻转、缩放、色域变换；
代码：

def load_mosaic(self, index):
    """用在LoadImagesAndLabels模块的__getitem__函数 进行mosaic数据增强
    将四张图片拼接在一张马赛克图像中  loads images in a 4-mosaic
    :param index: 需要获取的图像索引
    :return: img4: mosaic和随机透视变换后的一张图片  numpy(640, 640, 3)
             labels4: img4对应的target  [M, cls+x1y1x2y2]
    """
    # labels4: 用于存放拼接图像（4张图拼成一张）的label信息(不包含segments多边形)
    labels4 = []
    s = self.img_size  # 输入到网络训练图片大小，eg:640*640
    # 1. 随机初始化拼接图像的中心点坐标  [0, s*2]之间随机取2个数作为拼接图像的中心坐标
    yc, xc = [int(random.uniform(-x, 2 * s + x)) for x in self.mosaic_border]  # mosaic center x, y

    # 2. 从dataset中随机寻找额外的三张图像进行拼接 即随机选三张图片的index
    indices = [index] + random.choices(self.indices, k=3)

    # 遍历四张图像进行拼接 4张不同大小的图像
    for i, index in enumerate(indices):
        # load image   每次拿一张图片 并将这张图片resize到self.size(h,w)
        img, _, (h, w) = load_image(self, index)

        # place img in img4
        if i == 0:  # top left  原图[375, 500, 3] load_image->[552, 736, 3]   hwc
            # 3. 创建马赛克图像，shape=(2*img_size, 2*img_size, num_channel),先用114进行填充
            img4 = np.full((s * 2, s * 2, img.shape[2]), 114, dtype=np.uint8)  # base image with 4 tiles
            # 4. 计算马赛克图像中的坐标信息(将图像填充到马赛克图像中)   w=736  h = 552  马赛克图像：(x1a,y1a)左上角 (x2a,y2a)右下角
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc  # xmin, ymin, xmax, ymax (large image)
            # 5. 计算截取的图像区域信息(以xc,yc为第一张图像的右下角坐标填充到马赛克图像中，丢弃越界的区域)  图像：(x1b,y1b)左上角 (x2b,y2b)右下角
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h  # xmin, ymin, xmax, ymax (small image)
        elif i == 1:  # top right
            # 计算马赛克图像中的坐标信息(将图像填充到马赛克图像中)
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            # 计算截取的图像区域信息(以xc,yc为第二张图像的左下角坐标填充到马赛克图像中，丢弃越界的区域)
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # bottom left
            # 计算马赛克图像中的坐标信息(将图像填充到马赛克图像中)
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            # 计算截取的图像区域信息(以xc,yc为第三张图像的右上角坐标填充到马赛克图像中，丢弃越界的区域)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        elif i == 3:  # bottom right
            # 计算马赛克图像中的坐标信息(将图像填充到马赛克图像中)
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            # 计算截取的图像区域信息(以xc,yc为第四张图像的左上角坐标填充到马赛克图像中，丢弃越界的区域)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)

        # 6. 将截取的图像区域填充到马赛克图像的相应位置   img4[h, w, c]
        # 将图像img的【(x1b,y1b)左上角 (x2b,y2b)右下角】区域截取出来填充到马赛克图像的【(x1a,y1a)左上角 (x2a,y2a)右下角】区域
        img4[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]  # img4[ymin:ymax, xmin:xmax]
        # 7. 计算pad(当前图像边界与马赛克边界的距离，越界的情况padw/padh为负值)  用于后面的label映射
        padw = x1a - x1b   # 当前图像与马赛克图像在w维度上相差多少
        padh = y1a - y1b   # 当前图像与马赛克图像在h维度上相差多少

        # labels: 获取对应拼接图像的所有正常label信息(如果有segments多边形会被转化为矩形label)
        # segments: 获取对应拼接图像的所有不正常label信息(包含segments多边形也包含正常gt)
        labels, segments = self.labels[index].copy(), self.segments[index].copy()
        if labels.size:
            if self.hyp['edges'] == 4:
                # normalized xywh normalized to pixel xyxy format
                labels[:, 1::2] = labels[:, 1::2] * w + padw
                labels[:, 2::2] = labels[:, 2::2] * h + padh
                # segments = [xyn2xy(x, w, h, padw, padh) for x in segments]
            else:
                # normalized xywh normalized to pixel xyxy format
                labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, padw, padh)
                segments = [xyn2xy(x, w, h, padw, padh) for x in segments]

        labels4.append(labels)      # 更新labels4
        segments4.extend(segments)  # 更新segments4

    # Concat/clip labels4 把labels4（[(2, 5), (1, 5), (3, 5), (1, 5)] => (7, 5)）压缩到一起
    labels4 = np.concatenate(labels4, 0)
    # 防止越界  label[:, 1:]中的所有元素的值（位置信息）必须在[0, 2*s]之间,小于0就令其等于0,大于2*s就等于2*s   out: 返回
    for x in (labels4[:, 1:], *segments4):
        np.clip(x, 0, 2 * s, out=x)  # clip when using random_perspective()

    # 8. 对拼接后的moaic图片进行数据增强 
    img4, labels4 = random_perspective(img4, labels4, segments4,
                                       degrees=self.hyp['degrees'],
                                       translate=self.hyp['translate'],
                                       scale=self.hyp['scale'],
                                       shear=self.hyp['shear'],
                                       perspective=self.hyp['perspective'],
                                       border=self.mosaic_border)  # border to remove

    return img4, labels4

2. MixUp

**原理：**随机将两张样本按照随机比例混合，分类的话，标签按照比例进行分配。检测的话标签整合起来（或者在计算loss时候也加入一个alpha权重）。
目的： 提高模型泛化能力，缓解过拟合。
代码：

def mixup(im, labels, im2, labels2):
    """
    :params im:图片1  numpy (640, 640, 3)
    :params labels:[N, 5]=[N, cls+x1y1x2y2]
    :params im2:图片2  (640, 640, 3)
    :params labels2:[M, 5]=[M, cls+x1y1x2y2]
    :return img: 两张图片mixup增强后的图片 (640, 640, 3)
    :return labels: 两张图片mixup增强后的label标签 [M+N, cls+x1y1x2y2]
    """
    # 随机从beta分布中获取比例,range[0, 1]
    r = np.random.beta(32.0, 32.0)  # mixup ratio, alpha=beta=32.0
    # 按照比例融合两张图片
    im = (im * r + im2 * (1 - r)).astype(np.uint8)
    # 将两张图片标签拼接到一起
    labels = np.concatenate((labels, labels2), 0)
    return im, labels

3. Cutout

原理： 随机将样本的部分区域cut掉，并用0像素值填充，label不变。
1. 针对性的擦除：擦除掉图像中的重要特征区域，需在保存每张图片的最大激活特征图，在下一个epoch对特征图上采样到和原图一样大，然后使用某一阈值将其转换为二值图作为mask覆盖到原图上再进行输入训练。（每一个epoch的重要特征区域不一样，更加增强了网络的鲁棒性）
2. 随机擦除：以一定概率的随机选择一个正方形区域用0填充。
目的： 增强网络的鲁棒性。
代码：

import torch
import numpy as np

class Cutout(object):
    """Randomly mask out one or more patches from an image.
    Args:
        n_holes (int): Number of patches to cut out of each image.
        length (int): The length (in pixels) of each square patch.
    """
    def __init__(self, n_holes, length=8):
        # 裁剪区域的个数
        self.n_holes = n_holes
        
        # cutout区域的长度(宽度)
        self.length = length

    def __call__(self, img):
        """
        Args:
            img (Tensor): Tensor image of size (C, H, W).
        Returns:
            Tensor: Image with n_holes of dimension length x length cut out of it.
        """
        h = img.size(1)
        w = img.size(2)

        mask = np.ones((h, w), np.float32)

        for n in range(self.n_holes):
            # 随机在图像中选择一个点作为cutout矩形区域的中心坐标点
            y = np.random.randint(h)
            x = np.random.randint(w)
            # 计算cutout矩形坐标
            y1 = np.clip(y - self.length // 2, 0, h)
            y2 = np.clip(y + self.length // 2, 0, h)
            x1 = np.clip(x - self.length // 2, 0, w)
            x2 = np.clip(x + self.length // 2, 0, w)
            # 将cutout区域的mask置为0
            mask[y1: y2, x1: x2] = 0.

        mask = torch.from_numpy(mask)
        # 将mask扩展成和img一样的size，expand的内容为mask原有的值进行重复
        mask = mask.expand_as(img)
        img = img * mask

        return img

4. CutMix

原理： 随机cut掉的区域不用0像素值填充，而是随机填充训练集中其他数据的区域像素值，label按照一定比例分配
目的： 提高模型泛化能力，缓解过拟合。
代码：

def cut_mix(dataset, index, num_mix=1, beta=1., prob=1.0):
    '''
    用于加载数据集时或者for loop读取dataloader时候，没有试验过
    如果是分类问题则label按照比例融合，若是目标检测则直接融合所有坐标
    仅仅是主要代码或者伪代码，以便理解其原理
    :param dataset: 包含一个batch或者整个数据集和对应的标签
    :param index: 当前图片
    :param num_mix: mix的个数
    :param beta: beta分布的值
    :param prob: cut mix的概率
    :return:
    '''
    def rand_bbox(size, lam):
        # 原始img 的 宽和高
        W = size[2]
        H = size[3]

        # 求出 cut区域想对于原始wh的比例
        cut_rat = np.sqrt(1. - lam)
        # cut 区域的wh
        cut_w = np.int(W * cut_rat)
        cut_h = np.int(H * cut_rat)

        #  在原始img的随机选取一个中心点，中心点+cut的wh才能确定出一个最终的cut区域
        cx = np.random.randint(W)
        cy = np.random.randint(H)
        # 限制坐标区域不超过样本大小
        bbx1 = np.clip(cx - cut_w // 2, 0, W)
        bby1 = np.clip(cy - cut_h // 2, 0, H)
        bbx2 = np.clip(cx + cut_w // 2, 0, W)
        bby2 = np.clip(cy + cut_h // 2, 0, H)
        # 返回裁剪B区域的坐标值
        return bbx1, bby1, bbx2, bby2

    # 获取当前的图片和标签, 此处的标签为分类标签的话则为one hot类型
    img, label = dataset[index]

    # 循环mix的个数
    for _ in range(num_mix):
        # 随机一个概率，看是否满足mix的概率条件
        r = np.random.rand(1)
        if beta <= 0 or r > prob:
            continue

        # 随机选取另外一张图像的index，为将此img的一部分区域mix到当前index图片作准备
        rand_index = random.choice(range(len(dataset)))
        img2, label2 = dataset[rand_index]

        # 在beta分布中获得一个随机值，用于下面的相对于wh的比例值
        lam = np.random.beta(beta, beta)
        # 获得cut mix的区域坐标
        bbx1, bby1, bbx2, bby2 = rand_bbox(img.size(), lam)
        # 进行cut mix
        img[:, bbx1:bbx2, bby1:bby2] = img2[:, bbx1:bbx2, bby1:bby2]

        lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (img.size()[-1] * img.size()[-2]))
        label = label * lam + label2 * (1. - lam)

    return img, label

5. random_perspective

原理： 对图片进行随机旋转、平移、缩放、裁剪、透视变换，要注意对应的box的label也要对应进行转换，转换后把图片和label画出来，查看是否出错。
目的： 提高模型泛化能力，缓解过拟合。
代码：

def random_perspective(img, targets=(), degrees=10, translate=.1,
                       scale=.1, shear=10, perspective=0.0, border=(0, 0)):
    """这个函数会用于load_mosaic中用在mosaic操作之后
    随机透视变换  对mosaic整合后的图片进行随机旋转、缩放、平移、裁剪，透视变换，并resize为输入大小img_size
    :params img: mosaic整合后的图片img4 [2*img_size, 2*img_size]
    如果mosaic后的图片没有一个多边形标签就使用targets, segments为空  如果有一个多边形标签就使用segments, targets不为空
    :params targets: mosaic整合后图片的所有正常label标签labels4(不正常的会通过segments2boxes将多边形标签转化为正常标签) [N, cls+xyxy]
    :params degrees: 旋转和缩放矩阵参数
    :params translate: 平移矩阵参数
    :params scale: 缩放矩阵参数
    :params shear: 剪切矩阵参数
    :params perspective: 透视变换参数
    :params border: 用于确定最后输出的图片大小 一般等于[-img_size, -img_size] 那么最后输出的图片大小为 [img_size, img_size]
    :return img: 通过透视变换/仿射变换后的img [img_size, img_size]
    :return targets: 通过透视变换/仿射变换后的img对应的标签 [n, cls+x1y1x2y2]  (通过筛选后的)
    """
    # 设定输出图片的 H W
    # border=-s // 2  所以最后图片的大小直接减半 [img_size, img_size, 3]
    height = img.shape[0] + border[0] * 2  # # 最终输出图像的H
    width = img.shape[1] + border[1] * 2   # 最终输出图像的W

    # ============================ 开始变换 =============================
    # 需要注意的是，其实opencv是实现了仿射变换的, 不过我们要先生成仿射变换矩阵M
    # Center 设置中心平移矩阵
    C = np.eye(3)
    C[0, 2] = -img.shape[1] / 2  # x translation (pixels)
    C[1, 2] = -img.shape[0] / 2  # y translation (pixels)

    # Perspective  设置透视变换矩阵
    P = np.eye(3)
    P[2, 0] = random.uniform(-perspective, perspective)  # x perspective (about y)
    P[2, 1] = random.uniform(-perspective, perspective)  # y perspective (about x)

    # Rotation and Scale  设置旋转和缩放矩阵
    R = np.eye(3)    # 初始化R = [[1,0,0], [0,1,0], [0,0,1]]    (3, 3)
    # a: 随机生成旋转角度 范围在(-degrees, degrees)
    # a += random.choice([-180, -90, 0, 90])  # add 90deg rotations to small rotations
    a = random.uniform(-degrees, degrees)
    # a += random.choice([-180, -90, 0, 90])  # add 90deg rotations to small rotations
    # s: 随机生成旋转后图像的缩放比例 范围在(1 - scale, 1 + scale)
    # s = 2 ** random.uniform(-scale, scale)
    s = random.uniform(1 - scale, 1 + scale)
    # s = 2 ** random.uniform(-scale, scale)
    # cv2.getRotationMatrix2D: 二维旋转缩放函数
    # 参数 angle:旋转角度  center: 旋转中心(默认就是图像的中心)  scale: 旋转后图像的缩放比例
    R[:2] = cv2.getRotationMatrix2D(angle=a, center=(0, 0), scale=s)

    # Shear  设置剪切矩阵
    S = np.eye(3)  # 初始化T = [[1,0,0], [0,1,0], [0,0,1]]
    S[0, 1] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # x shear (deg)
    S[1, 0] = math.tan(random.uniform(-shear, shear) * math.pi / 180)  # y shear (deg)

    # Translation 设置平移矩阵
    T = np.eye(3)  # 初始化T = [[1,0,0], [0,1,0], [0,0,1]]    (3, 3)
    T[0, 2] = random.uniform(0.5 - translate, 0.5 + translate) * width  # x translation (pixels)
    T[1, 2] = random.uniform(0.5 - translate, 0.5 + translate) * height  # y translation (pixels)

    # Combined rotation matrix  @ 表示矩阵乘法  生成仿射变换矩阵M
    M = T @ S @ R @ P @ C  # order of operations (right to left) is IMPORTANT
    # 将仿射变换矩阵M作用在图片上
    if (border[0] != 0) or (border[1] != 0) or (M != np.eye(3)).any():  # image changed
        if perspective:
            # 透视变换函数  实现旋转平移缩放变换后的平行线不再平行
            # 参数和下面warpAffine类似
            img = cv2.warpPerspective(img, M, dsize=(width, height), borderValue=(114, 114, 114))
        else:
            # 仿射变换函数  实现旋转平移缩放变换后的平行线依旧平行
            # image changed  img  [1472, 1472, 3] => [736, 736, 3]
            # cv2.warpAffine: opencv实现的仿射变换函数
            # 参数： img: 需要变化的图像   M: 变换矩阵  dsize: 输出图像的大小  flags: 插值方法的组合（int 类型！）
            #       borderValue: （重点！）边界填充值  默认情况下，它为0。
            img = cv2.warpAffine(img, M[:2], dsize=(width, height), borderValue=(114, 114, 114))

    # Transform label coordinates
    # 同样需要调整标签信息
    n = len(targets)
    if n:
        # 判断是否可以使用segment标签: 只有segments不为空时即数据集中有多边形gt也有正常gt时才能使用segment标签 use_segments=True
        #                          否则如果只有正常gt时segments为空 use_segments=False
        use_segments = any(x.any() for x in segments)
        new = np.zeros((n, 4))  # [n, 0+0+0+0]

        # 不使用segments标签 使用正常的矩形的标签targets
        # warp boxes
        # 直接对box透视/仿射变换
        # 由于有旋转，透视变换等操作，所以需要对四个角点都进行变换
        xy = np.ones((n * 4, 3))
        xy[:, :2] = targets[:, [1, 2, 3, 4, 1, 4, 3, 2]].reshape(n * 4, 2)  # x1y1, x2y2, x1y2, x2y1
        xy = xy @ M.T  # transform 每个角点的坐标
        xy = (xy[:, :2] / xy[:, 2:3] if perspective else xy[:, :2]).reshape(n, 8)  # perspective rescale or affine

        # create new boxes
        x = xy[:, [0, 2, 4, 6]]
        y = xy[:, [1, 3, 5, 7]]
        new = np.concatenate((x.min(1), y.min(1), x.max(1), y.max(1))).reshape(4, n).T

        # clip  去除太小的target(target大部分跑到图外去了)
        new[:, [0, 2]] = new[:, [0, 2]].clip(0, width)
        new[:, [1, 3]] = new[:, [1, 3]].clip(0, height)

        # filter candidates  过滤target 筛选box
        # 长和宽必须大于wh_thr个像素 裁剪过小的框(面积小于裁剪前的area_thr)  长宽比范围在(1/ar_thr, ar_thr)之间的限制
        # 筛选结果 [n] 全是True或False   使用比如: box1[i]即可得到i中所有等于True的矩形框 False的矩形框全部删除
        i = box_candidates(box1=targets[:, 1:5].T * s, box2=new.T, area_thr=0.01 if use_segments else 0.10)
        # 得到所有满足条件的targets
        targets = targets[i]
        targets[:, 1:5] = new[i]

    return img, targets

6. augment_hsv

原理： 色域空间增强，通过提取出HSV三通道，然后以一定概率随机改变三通道的值，然后再进行合并三通道，最终转换为RGB图像。
目的： 增强不同色域情况下的模型泛化能力，缓解过拟合。
代码：

import cv2

def augment_hsv(img, hgain=0.5, sgain=0.5, vgain=0.5):
    """一般用在LoadImagesAndLabels模块的__getitem__函数
    hsv色域增强  处理图像hsv，不对label进行任何处理
    :param img: 待处理图片  BGR [736, 736]
    :param hgain: h通道色域参数 用于生成新的h通道
    :param sgain: h通道色域参数 用于生成新的s通道
    :param vgain: h通道色域参数 用于生成新的v通道
    :return: 返回hsv增强后的图片 img
    """
    if hgain or sgain or vgain:
        # 随机取-1到1三个实数，乘以hyp中的hsv三通道的系数  用于生成新的hsv通道
        r = np.random.uniform(-1, 1, 3) * [hgain, sgain, vgain] + 1  # random gains
        # cv2.split(img) 用于通道拆分
        hue, sat, val = cv2.split(cv2.cvtColor(img, cv2.COLOR_BGR2HSV))  # 图像的通道拆分 h s v
        dtype = img.dtype  # uint8

        x = np.arange(0, 256, dtype=r.dtype)
        lut_hue = ((x * r[0]) % 180).astype(dtype)         # 生成新的h通道
        lut_sat = np.clip(x * r[1], 0, 255).astype(dtype)  # 生成新的s通道
        lut_val = np.clip(x * r[2], 0, 255).astype(dtype)  # 生成新的v通道

        # 图像的通道合并 img_hsv=h+s+v  随机调整hsv之后重新组合hsv通道
        # cv2.LUT(hue, lut_hue)   通道色域变换 输入变换前通道hue 和变换后通道lut_hue
        # cv2.，merge(img) 用于通道合并
        img_hsv = cv2.merge((cv2.LUT(hue, lut_hue), cv2.LUT(sat, lut_sat), cv2.LUT(val, lut_val)))
        # no return needed  dst:输出图像
        cv2.cvtColor(img_hsv, cv2.COLOR_HSV2BGR, dst=img)  # no return needed  hsv->bgr

7. 翻转

原理： 使用nump的函数，注意box labels也要转换。
目的： 提高模型泛化能力，缓解过拟合。
代码：

def flip(img, labels, is_flipud=True, is_fliplr=True):
    '''
    :param img: 
    :param labels: xywh， 归一化到 0~1（相对坐标）
    :param is_flipud: 
    :param is_fliplr: 
    :return: 
    '''
    # 上下翻转 x轴坐标不变
    if is_flipud:
        img = np.flipud(img)
        # 对应的坐标也要翻转
        labels[:, 1::2] = 1 - labels[:, 1::2]
    
    # 左右翻转  y轴坐标不变
    if is_fliplr:
        img = np.fliplr(img)   # np.fliplr 将数组在左右方向翻转
        labels[:, 0::2] = 1 - labels[:, 0::2]

8. DropBlock和DropOut

原理：
1. droupout:
droupout用于全连接层，解决过拟合和训练速度慢。每次随机失活一些神经元，只有在训练的时候使用dropout。一般是nn.Linear() ->nn.ReLU()-> nn.Dropout()->nn.Linear()-> nn.Dropout()，手动实现dropout的话，一般是使用一个mask，随机产生0对原始tensor进行一个覆盖（相乘）。
2. droupblock:
dropout如果用于卷积层的话，不起作用即不能处理过拟合的问题，因为卷积具有空间性，一个元素失活了，附近的相邻的元素也依然能其作用。dropblock相当于二维的dropout，卷积中的dropout。droupout的随机失活是随机的、离散的，而dropblock的失活是整块、局部的一整个区域。而且dropblock是作用在特征图feature map上的（非原始图像上，原始图像上是cutout）。
- 每一个feature channel上单独使用各自的dropblock效果会比较好；
- 有两个主要的值block_size和gamma，block_size为大小，gamma为控制要失活的元素个数，非直接使用概率计算。计算mask时候使用的最大池化这个技巧。
目的： 提高模型泛化能力，缓解过拟合。
代码：

droupout:

nn.Dropout(p=p)

droupblock:
使用pip下载相应的库
pip install dropblock

import torch
from dropblock import DropBlock2D

# For 2D inputs (DropBlock2D):
# (bsize, n_feats, height, width)
x = torch.rand(100, 10, 16, 16)
drop_block = DropBlock2D(block_size=3, drop_prob=0.3)
regularized_x = drop_block(x)


# For 3D inputs (DropBlock3D):
import torch
from dropblock import DropBlock3D
# (bsize, n_feats, depth, height, width)
x = torch.rand(100, 10, 16, 16, 16)
drop_block = DropBlock3D(block_size=3, drop_prob=0.3)
regularized_x = drop_block(x)

三、矩形训练\推理处理

简介：
矩形推理的目的：减少冗余信息，加速训练推理。（一般输入网络中的图片会实现resize到一个设定的img_size的正方形，虽然也是按照最长边进行resize，另外一边按照原来的宽高比进行resize，这样一来填充为灰色或黑色的区域会多，造成更多的冗余信息，导致网络处理较慢，如下图所示）

在这里插入图片描述

一般输入网络的图片只要是stride=32的倍数即可，不需要是正方形。
一般的输入训练的img_size=640640，如果不用矩形训练推理的话，则resize的过程是按照宽高的最长边进行shape转换为设定的img_size。使用mosaic的话，则拼接成的图片大小为2img_size。
矩形训练\推理步骤：

resize shape的准备：（训练）因为矩形训练输入到网络的shape的形状要保持一致（不同batch的shape可以不一致），所以为一个batch选择一个合适的shape进行矩形resize很重要，所以需要对所有的数据集的宽高比进行排序，并得出相应的index以便获取与之对应的数据，经过排序后，在这个batch中选择一个适合整个batch的shape这样代价比较小，选择过程中，并非选择最大、最小、或者平均值的高宽比，代码中有对应的选择。（推理测试）如果数推理的话就直接使用设定的size即可。
选择最当前图像的wh与shape对应的wh最接近的一个比例，使用这个比例对wh进行缩放。
计算pad：根据缩放后的shape和输入的new_shape的wh的差值dw、dh计算填充的宽边和高边，在两侧填充要除以2，使用cv2.copyMakeBorder()这个函数进行填充，一般使用灰色（114像素值）或者黑色（0像素值）进行填充。

代码：

# 1. 训练时，resize shape的准备
        if self.rect:
            # Sort by aspect ratio
            s = self.shapes  # wh
            ar = s[:, 1] / s[:, 0]  # aspect ratio
            irect = ar.argsort()  # 根据高宽比排序
            self.img_files = [self.img_files[i] for i in irect]      # 获取排序后的img_files
            self.label_files = [self.label_files[i] for i in irect]  # 获取排序后的label_files
            self.labels = [self.labels[i] for i in irect]            # 获取排序后的labels
            self.shapes = s[irect]                                   # 获取排序后的wh
            ar = ar[irect]                                           # 获取排序后的aspect ratio

            # 计算每个batch采用的统一尺度 Set training image shapes
            shapes = [[1, 1]] * nb    # nb: number of batches
            for i in range(nb):
                ari = ar[bi == i]     # bi: batch index
                mini, maxi = ari.min(), ari.max()   # 获取第i个batch中，最小和最大高宽比
                # 如果高/宽小于1(w > h)，将w设为img_size（保证原图像尺度不变进行缩放）
                if maxi < 1:
                    shapes[i] = [maxi, 1]   # maxi: h相对指定尺度的比例  1: w相对指定尺度的比例
                # 如果高/宽大于1(w < h)，将h设置为img_size（保证原图像尺度不变进行缩放）
                elif mini > 1:
                    shapes[i] = [1, 1 / mini]

            # 计算每个batch输入网络的shape值(向上设置为32的整数倍)
            # 要求每个batch_shapes的高宽都是32的整数倍，所以要先除以32，取整再乘以32（不过img_size如果是32倍数这里就没必要了）
            self.batch_shapes = np.ceil(np.array(shapes) * img_size / stride + pad).astype(np.int) * stride


def letterbox(img, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleFill=False, scaleup=True, stride=32):
    """用在LoadImagesAndLabels模块的__getitem__函数  只在val时才会使用
    将图片缩放调整到指定大小
    Resize and pad image while meeting stride-multiple constraints
    https://github.com/ultralytics/yolov3/issues/232
    :param img: 原图 hwc
    :param new_shape: 缩放后的最长边大小
    :param color: pad的颜色
    :param auto: True 保证缩放后的图片保持原图的比例 即 将原图最长边缩放到指定大小，再将原图较短边按原图比例缩放（不会失真）
                 False 将原图最长边缩放到指定大小，再将原图较短边按原图比例缩放,最后将较短边两边pad操作缩放到最长边大小（不会失真）
    :param scale_fill: True 简单粗暴的将原图resize到指定的大小 相当于就是resize 没有pad操作（失真）
    :param scale_up: True  对于小于new_shape的原图进行缩放,大于的不变
                     False 对于大于new_shape的原图进行缩放,小于的不变
    :return: img: letterbox后的图片 HWC
             ratio: wh ratios
             (dw, dh): w和h的pad
    """
    shape = img.shape[:2]  # 第一层resize后图片大小[h, w] = [343, 512](load_img时候会进行第一次按照接近设置size的一边进行resize)
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)  # (512, 512)

	# 2.选择缩放比例
    # scale ratio (new / old)   1.024   new_shape=(384, 512)
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])   # r=1

    # 只进行下采样 因为上采样会让图片模糊
    # (for better test mAP) scale_up = False 对于大于new_shape（r<1）的原图进行缩放,小于new_shape（r>1）的不变
    if not scaleup:  # only scale down, do not scale up (for better test mAP)
        r = min(r, 1.0)

    # Compute padding
    ratio = r, r  # width, height ratios   (1, 1)
    # 3. 计算pad 的 wh
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))  # wh(512, 343) 保证缩放后图像比例不变
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding  dw=0 dh=41
    if auto:  # minimum rectangle  保证原图比例不变，将图像最大边缩放到指定大小
        # 这里的取余操作可以保证padding后的图片是32的整数倍(416x416)，如果是(512x512)可以保证是64的整数倍
        dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding dw=0 dh=0
    elif scaleFill:  # stretch 简单粗暴的将图片缩放到指定尺寸
        dw, dh = 0.0, 0.0
        new_unpad = (new_shape[1], new_shape[0])
        ratio = new_shape[1] / shape[1], new_shape[0] / shape[0]  # width, height ratios

    # 在较小边的两侧进行pad, 而不是在一侧pad
    dw /= 2  # divide padding into 2 sides  将padding分到上下，左右两侧  dw=0
    dh /= 2  # dh=20.5

    # shape:[h, w]  new_unpad:[w, h]
    if shape[::-1] != new_unpad:  # resize  将原图resize到new_unpad（长边相同，比例相同的新图）
        img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))  # 计算上下两侧的padding  # top=20 bottom=21
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))  # 计算左右两侧的padding  # left=0 right=0

    # 填充 add border/pad
    img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border

    # img: (384, 512, 3) ratio=(1.0,1.0) 这里没有缩放操作  (dw,dh)=(0.0, 20.5)
    return img, ratio, (dw, dh)

四、检测时的加载视频流等形式进行处理

加载图片：

class LoadImages:  # for inference
    """在detect.py中使用
    load 文件夹中的图片/视频
    定义迭代器 用于detect.py
    """
    def __init__(self, path, img_size=640, stride=32):
        p = str(Path(path).absolute())  # os-agnostic absolute path
        # glob.glab: 返回所有匹配的文件路径列表   files: 提取图片所有路径
        if '*' in p:
            # 如果p是采样正则化表达式提取图片/视频, 可以使用glob获取文件路径
            files = sorted(glob.glob(p, recursive=True))  # glob
        elif os.path.isdir(p):
            # 如果p是一个文件夹，使用glob获取全部文件路径
            files = sorted(glob.glob(os.path.join(p, '*.*')))  # dir
        elif os.path.isfile(p):
            # 如果p是文件则直接获取
            files = [p]  # files
        else:
            raise Exception(f'ERROR: {p} does not exist')

        # images: 目录下所有图片的图片名  videos: 目录下所有视频的视频名
        images = [x for x in files if x.split('.')[-1].lower() in img_formats]
        videos = [x for x in files if x.split('.')[-1].lower() in vid_formats]
        # 图片与视频数量
        ni, nv = len(images), len(videos)

        self.img_size = img_size
        self.stride = stride   # 最大的下采样率
        self.files = images + videos  # 整合图片和视频路径到一个列表
        self.nf = ni + nv  # number of files
        self.video_flag = [False] * ni + [True] * nv  # 是不是video
        self.mode = 'image'  # 默认是读image模式
        if any(videos):
            # 判断有没有video文件  如果包含video文件，则初始化opencv中的视频模块，cap=cv2.VideoCapture等
            self.new_video(videos[0])  # new video
        else:
            self.cap = None
        assert self.nf > 0, f'No images or videos found in {p}. ' \
                            f'Supported formats are:\nimages: {img_formats}\nvideos: {vid_formats}'

    def __iter__(self):
        """迭代器"""
        self.count = 0
        return self

    def __next__(self):
        """与iter一起用？"""
        if self.count == self.nf:  # 数据读完了
            raise StopIteration
        path = self.files[self.count]  # 读取当前文件路径

        if self.video_flag[self.count]:  # 判断当前文件是否是视频
            # Read video
            self.mode = 'video'
            # 获取当前帧画面，ret_val为一个bool变量，直到视频读取完毕之前都为True
            ret_val, img0 = self.cap.read()
            # 如果当前视频读取结束，则读取下一个视频
            if not ret_val:
                self.count += 1
                self.cap.release()
                # self.count == self.nf表示视频已经读取完了
                if self.count == self.nf:  # last video
                    raise StopIteration
                else:
                    path = self.files[self.count]
                    self.new_video(path)
                    ret_val, img0 = self.cap.read()

            self.frame += 1  # 当前读取视频的帧数
            print(f'video {self.count + 1}/{self.nf} ({self.frame}/{self.frames}) {path}: ', end='')

        else:
            # Read image
            self.count += 1
            img0 = cv2.imread(path)  # BGR
            assert img0 is not None, 'Image Not Found ' + path
            print(f'image {self.count}/{self.nf} {path}: ', end='')

        # Padded resize
        img = letterbox(img0, self.img_size, stride=self.stride)[0]

        # Convert
        img = img[:, :, ::-1].transpose(2, 0, 1)  # BGR to RGB and HWC to CHW
        img = np.ascontiguousarray(img)

        # 返回路径, resize+pad的图片, 原始图片, 视频对象
        return path, img, img0, self.cap

    def new_video(self, path):
        # 记录帧数
        self.frame = 0
        # 初始化视频对象
        self.cap = cv2.VideoCapture(path)
        # 得到视频文件中的总帧数
        self.frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))

    def __len__(self):
        return self.nf  # number of files

加载视频流：

class LoadStreams:
    """
    load 文件夹中视频流
    multiple IP or RTSP cameras
    定义迭代器 用于detect.py
    """
    def __init__(self, sources='streams.txt', img_size=640, stride=32):
        self.mode = 'stream'  # 初始化mode为images
        self.img_size = img_size
        self.stride = stride  # 最大下采样步长

        # 如果sources为一个保存了多个视频流的文件  获取每一个视频流，保存为一个列表
        if os.path.isfile(sources):
            with open(sources, 'r') as f:
                sources = [x.strip() for x in f.read().strip().splitlines() if len(x.strip())]
        else:
            # 反之，只有一个视频流文件就直接保存
            sources = [sources]

        n = len(sources)  # 视频流个数
        # 初始化图片 fps 总帧数 线程数
        self.imgs, self.fps, self.frames, self.threads = [None] * n, [0] * n, [0] * n, [None] * n
        self.sources = [clean_str(x) for x in sources]  # clean source names for later

        # 遍历每一个视频流
        for i, s in enumerate(sources):  # index, source
            # Start thread to read frames from video stream
            # 打印当前视频index/总视频数/视频流地址
            print(f'{i + 1}/{n}: {s}... ', end='')
            if 'youtube.com/' in s or 'youtu.be/' in s:  # if source is YouTube video
                check_requirements(('pafy', 'youtube_dl'))
                import pafy
                s = pafy.new(s).getbest(preftype="mp4").url  # YouTube URL
            s = eval(s) if s.isnumeric() else s  # i.e. s = '0' local webcam 本地摄像头
            # s='0'打开本地摄像头，否则打开视频流地址
            cap = cv2.VideoCapture(s)
            assert cap.isOpened(), f'Failed to open {s}'
            # 获取视频的宽和长
            w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
            h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
            # 获取视频的帧率
            self.fps[i] = max(cap.get(cv2.CAP_PROP_FPS) % 100, 0) or 30.0  # 30 FPS fallback
            # 帧数
            self.frames[i] = max(int(cap.get(cv2.CAP_PROP_FRAME_COUNT)), 0) or float('inf')  # infinite stream fallback

            # 读取当前画面
            _, self.imgs[i] = cap.read()  # guarantee first frame
            # 创建多线程读取视频流，daemon表示主线程结束时子线程也结束
            self.threads[i] = Thread(target=self.update, args=([i, cap]), daemon=True)
            print(f" success ({self.frames[i]} frames {w}x{h} at {self.fps[i]:.2f} FPS)")
            self.threads[i].start()
        print('')  # newline

        # check for common shapes
        # 获取进行resize+pad之后的shape，letterbox函数默认(参数auto=True)是按照矩形推理进行填充
        s = np.stack([letterbox(x, self.img_size, stride=self.stride)[0].shape for x in self.imgs], 0)  # shapes
        self.rect = np.unique(s, axis=0).shape[0] == 1  # rect inference if all shapes equal
        if not self.rect:
            print('WARNING: Different stream shapes detected. For optimal performance supply similarly-shaped streams.')

    def update(self, i, cap):
        # Read stream `i` frames in daemon thread
        n, f = 0, self.frames[i]
        while cap.isOpened() and n < f:
            n += 1
            # _, self.imgs[index] = cap.read()
            cap.grab()
            # 每4帧读取一次
            if n % 4:  # read every 4th frame
                success, im = cap.retrieve()
                self.imgs[i] = im if success else self.imgs[i] * 0
            time.sleep(1 / self.fps[i])  # wait time

    def __iter__(self):
        self.count = -1
        return self

    def __next__(self):
        self.count += 1
        if not all(x.is_alive() for x in self.threads) or cv2.waitKey(1) == ord('q'):  # q to quit
            cv2.destroyAllWindows()
            raise StopIteration

        # Letterbox
        img0 = self.imgs.copy()
        img = [letterbox(x, self.img_size, auto=self.rect, stride=self.stride)[0] for x in img0]

        # Stack  将读取的图片拼接到一起
        img = np.stack(img, 0)

        # Convert
        img = img[:, :, :, ::-1].transpose(0, 3, 1, 2)  # BGR to RGB and BHWC to BCHW
        img = np.ascontiguousarray(img)

        return self.sources, img, img0, None

    def __len__(self):
        return 0  # 1E12 frames = 32 streams at 30 FPS for 30 years

加载web网页数据：

class LoadWebcam:  # for inference
    """用到很少 load web网页中的数据"""
    def __init__(self, pipe='0', img_size=640, stride=32):
        self.img_size = img_size
        self.stride = stride

        if pipe.isnumeric():
            pipe = eval(pipe)  # local camera
        # pipe = 'rtsp://192.168.1.64/1'  # IP camera
        # pipe = 'rtsp://username:password@192.168.1.64/1'  # IP camera with login
        # pipe = 'http://wmccpinetop.axiscam.net/mjpg/video.mjpg'  # IP golf camera

        self.pipe = pipe
        self.cap = cv2.VideoCapture(pipe)  # video capture object
        self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 3)  # set buffer size

    def __iter__(self):
        self.count = -1
        return self

    def __next__(self):
        self.count += 1
        if cv2.waitKey(1) == ord('q'):  # q to quit
            self.cap.release()
            cv2.destroyAllWindows()
            raise StopIteration

        # Read frame
        if self.pipe == 0:  # local camera
            ret_val, img0 = self.cap.read()
            img0 = cv2.flip(img0, 1)  # flip left-right
        else:  # IP camera
            n = 0
            while True:
                n += 1
                self.cap.grab()
                if n % 30 == 0:  # skip frames
                    ret_val, img0 = self.cap.retrieve()
                    if ret_val:
                        break

        # Print
        assert ret_val, f'Camera Error {self.pipe}'
        img_path = 'webcam.jpg'
        print(f'webcam {self.count}: ', end='')

        # Padded resize
        img = letterbox(img0, self.img_size, stride=self.stride)[0]

        # Convert
        img = img[:, :, ::-1].transpose(2, 0, 1)  # BGR to RGB and HWC to CHW
        img = np.ascontiguousarray(img)

        return img_path, img, img0, None

    def __len__(self):
        return 0