yolov4_u5版复现—4. 训练 train.py

最新推荐文章于 2023-12-24 21:45:06 发布

yunxi1008

最新推荐文章于 2023-12-24 21:45:06 发布

阅读量1.5k

点赞数 1

文章标签： pytorch 深度学习神经网络计算机视觉

本文链接：https://blog.csdn.net/weixin_39759689/article/details/122089442

版权

1.随机种子点
通过设置随机种子点保证实验结果的可复现性

cudnn.deterministic
torch.backends.cudnn.deterministic =True 的话，每次返回的卷积算法将是确定的，即默认算法

cudnn.benchmark
设置 torch.backends.cudnn.benchmark=True 将会让程序在开始时花费一点额外时间，为整个网络的每个卷积层搜索最适合它的卷积实现算法，进而实现网络的加速。适用场景是网络结构固定（不是动态变化的），网络的输入形状（包括 batch size，图片大小，输入的通道）是不变的，其实也就是一般情况下都比较适用。反之，如果卷积层的设置一直变化，将会导致程序不停地做优化，反而会耗费更多的时间。
此程序中使用以下设置（faster, less reproducible）：
cudnn.deterministic = False
cudnn.benchmark = True

def init_seeds(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    torch_utils.init_seeds(seed=seed)
    
def init_seeds(seed=0):        #   torch_utils.init_seeds
    # torch.rand 固定
    torch.manual_seed(seed)

    # Speed-reproducibility tradeoff https://pytorch.org/docs/stable/notes/randomness.html
    if seed == 0:  # slower, more reproducible
        cudnn.deterministic = True
        cudnn.benchmark = False
    else:  # faster, less reproducible
        cudnn.deterministic = False
        cudnn.benchmark = True

2.nominal batch size
nominal batch size = 64为名义批次,比如实际批次为16,那么64/16=4,每4次迭代，才进行一次反向传播更新权重，可以节约显存

# define nominal batch size, set weight_decay
    nominal_batch_size = 64
    accumulate = max(round(nominal_batch_size / total_batch_size), 1)
    hyp['weight_decay'] *= total_batch_size * accumulate / nominal_batch_size

# optimizer
            if batch_index_now % accumulate == 0:
                optimizer.step()
                optimizer.zero_grad()
                if ema is not None:
                    ema.update(model)

3.优化器
SGD、Momentum、RMSProp、Adam

SGD 是最普通的优化器, 也可以说没有加速效果, 而 Momentum 是 SGD 的改良版, 它加入了动量原则. 后面的 RMSprop 又是 Momentum 的升级版. 而 Adam 又是 RMSprop 的升级版. 不过从这个结果中我们看到, Adam 的效果似乎比 RMSprop 要差一点. 所以说并不是越先进的优化器, 结果越佳. 我们在自己的试验中可以尝试不同的优化器, 找到那个最适合你数据/网络的优化器。https://blog.csdn.net/qq_34690929/article/details/79932416

Momentum
可参见这里 https://blog.csdn.net/weixin_43378396/article/details/90741645

Momentum 传统的参数 W 的更新是把原始的 W 累加上一个负的学习率(learning rate) 乘以校正值 (dx). 此方法比较曲折。

我们把这个人从平地上放到了一个斜坡上, 只要他往下坡的方向走一点点, 由于向下的惯性, 他不自觉地就一直往下走, 走的弯路也变少了. 这就是 Momentum 参数更新。

冲量”这个概念源自于物理中的力学，表示力对时间的积累效应。
在普通的梯度下降法x+=v
中，每次x的更新量v为v=−dx∗lr，其中dx为目标函数func(x)对x的一阶导数，。
当使用冲量时，则把每次x的更新量v考虑为本次的梯度下降量−dx∗lr与上次x的更新量v乘上一个介于[0,1]的因子momentum的和，即

v=−dx∗lr+v∗momemtum

当本次梯度下降- dx * lr的方向与上次更新量v的方向相同时，上次的更新量能够对本次的搜索起到一个正向加速的作用。
当本次梯度下降- dx * lr的方向与上次更新量v的方向相反时，上次的更新量能够对本次的搜索起到一个减速的作用。

Nesterov Accelerated Gradient（Momentum改进版本，更优）
https://zhuanlan.zhihu.com/p/22810533

learning rate
学习率较小时，收敛到极值的速度较慢。
学习率较大时，容易在搜索过程中发生震荡。

learning rate decay
在使用梯度下降法求解目标函数func(x) = x * x的极小值时，更新公式为x += v，其中每次x的更新量v为v = - dx * lr，dx为目标函数func(x)对x的一阶导数。可以想到，如果能够让lr随着迭代周期不断衰减变小，那么搜索时迈的步长就能不断减少以减缓震荡。学习率衰减因子由此诞生：

lri=lrstart∗1.0/(1.0+decay∗i)
decay越小，学习率衰减地越慢，当decay = 0时，学习率保持不变。
decay越大，学习率衰减地越快，当decay = 1时，学习率衰减最快。

以上部分转载自这里

权重衰减（weight decay）
L2正则化参考： https://blog.csdn.net/program_developer/article/details/80867468

为何BatchNorm出现后基本不用L2 Regularization：

BatchNorm出现之后，基本很少场景中看到L2 Regularization的使用了，通俗的理解是BatchNorm的出现就是为了解决层与层之间数据分布差异大的问题，使用BatchNorm理论上是允许每层的数据差异性可在一定的范围内，而L2 Regularization是让数据变小防止输入差异大导致层层传递误差累积过大，所以BatchNorm更像是从本质上解决了问题，而L2 Regularization更像是一个技巧而已

https://baijiahao.baidu.com/s?id=1653085297096293714&wfr=spider&for=pc

为什么optimizer.step()可以更新Model()中的参数
optimizer.step()直观的是更新了optimizer.param_group()中的参数，而optimizer.param_group()中的参数来源于optimizer.add_param_group()，此方法将从model.named_parameters()中读取到的参数按照不同用途加入到optimizer.param_group()中，但是为什么optimizer.step()更新参数后model中的参数就随之改变了呢，其实是因为传递的全部是对张量数据的引用，所以不管保存成字典或是其他，都是对同一份张量数据进行操作。

Adam：
https://www.cnblogs.com/yifdu25/p/8183587.html

pg0, pg1, pg2 = [], [], []  # optimizer parameter groups
    for k, v in model.named_parameters():
        if v.requires_grad:
            if '.bias' in k:
                pg2.append(v)  # biases
                # 一般来说，权重衰减会用到网络中所有需要学习的参数上面。
                # 然而仅仅将权重衰减用到卷积层和全连接层，不对biases，BN层的 \gamma, \beta 做权重衰减，效果会更好。
            elif '.weight' in k and '.bn' not in k:
                pg1.append(v)  # apply weight decay
            else:
                pg0.append(v)  # all else

    if hyp['optimizer'] == 'adam':  # https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#OneCycleLR
        optimizer = optim.Adam(pg0, lr=hyp['lr0'], betas=(hyp['momentum'], 0.999))  # adjust beta1 to momentum
    else:
        optimizer = optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)

    optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']})  # add pg1 with weight_decay
    optimizer.add_param_group({'params': pg2})  # add pg2 (biases)
    print('Optimizer groups: %g .bias, %g conv.weight, %g other' % (len(pg2), len(pg1), len(pg0)))
    del pg0, pg1, pg2

4. Exponential moving average
滑动平均(exponential moving average)，或者叫做指数加权平均(exponentially weighted moving average)，可以用来估计变量的局部均值，使得变量的更新与一段时间内的历史取值有关。
https://www.cnblogs.com/wuliytTaotao/p/9479958.html

# Exponential moving average
    ema = torch_utils.ModelEMA(model) if rank in [-1, 0] else None

    class ModelEMA:
    """ Model Exponential Moving Average from https://github.com/rwightman/pytorch-image-models
        Keep a moving average of everything in the model state_dict (parameters and buffers).
        This is intended to allow functionality like
        https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage
        A smoothed version of the weights is necessary for some training schemes to perform well.
        This class is sensitive where it is initialized in the sequence of model init,
        GPU assignment and distributed training wrappers.
        """
    def __init__(self, model, decay=0.9999, updates=0):
        self.ema = deepcopy(model.module if is_parallel(model) else model).eval()    # FP32 EMA
        self.decay = lambda x: decay * (1 - math.exp(-x / 2000))   # decay exponential ramp (to help early epochs)
        self.updates = updates    # number of EMA updates
        for p in self.ema.parameters():
            p.requires_grad_(False)

    def update(self, model):
        # Update EMA parameters
        with torch.no_grad():
            self.updates += 1
            belta = self.decay(self.updates)

            model_state_dict = model.module.state_dict() if is_parallel(model) else model.state_dict()

            for k, v in self.ema.state_dict().items():
                if v.dtype.is_floating_point:
                    v *= belta
                    v += (1. - belta) * model_state_dict[k].detach()

    def update_attr(self, model, include=(), exclude=('process_group', 'reducer')):
        # Update EMA attributes
        copy_attr(self.ema, model, include, exclude)

5. DP与DDP
DP 单机多卡
DDP 多机多卡
https://zhuanlan.zhihu.com/p/68717029

# DP mode
    if device.type != 'cpu' and rank == -1 and torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)
   # DDP mode
    if device.type != 'cpu' and rank != -1:
        model = DDP(model, device_ids=[rank], output_device=rank)

6. Mixed precision training
为了帮助提高Pytorch的训练效率，英伟达提供了混合精度训练工具Apex。号称能够在不降低性能的情况下，将模型训练的速度提升2-4倍，训练显存消耗减少为之前的一半，接下来是混合精度的实现，这里主要用到Apex的amp工具。代码修改为：加上这一句封装，model, optimizer = amp.initialize(model, optimizer, opt_level=“O1”)
实际流程为：调用amp.initialize按照预定的opt_level对model和optimizer进行设置。在计算loss时使用amp.scale_loss进行回传。

    if mixed_precision:
        model, optimizer = amp.initialize(model, optimizer, opt_level='O1', verbosity=0)

# Backward
            if mixed_precision:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

7. 余弦退火

# 相当于余弦退火 lr_scheduler.CosineAnnealingLR 公式中最小学习率设置为0.2
    lf = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.8 + 0.2  # cosine
    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)

8. warm-up
这里不太懂为什么bias 的学习率要从0.1衰减到lr0

number_warmup = max(3 * number_batches, 1e3)       # number of warmup iterations, max(3 epochs, 1k iterations)
# warm up
            if batch_index_now <= number_warmup:
                xp = [0, number_warmup]

                accumulate = max(1, np.interp(batch_index_now, xp, [1, nominal_batch_size / total_batch_size]).round())

                for j, param in enumerate(optimizer.param_groups):
                    # set bias lr from 0.1 to lr0,
                    param['lr'] = np.interp(batch_index_now, xp,
                                            [0.1 if j == 2 else 0.0, param['initial_lr'] * lr_cosine(epoch)])
                    if 'momentum' in param:
                        param['momentum'] = np.interp(batch_index_now, xp, [0.9, hyp['momentum']])

9. 类别不均衡
类别权重，图像权重
https://blog.csdn.net/Graceying/article/details/120288985

def labels_to_class_weights(labels, number_class=80):
    if labels[0] is None:
        return torch.Tensor()
    labels = np.concatenate(labels, 0)
    class_label = labels[:, 0]  # labels = [class xywh]
    class_frequency = np.bincount(class_label, minlength=number_class)  # occurences per class

    class_frequency[class_frequency == 0] = 1
    # 取每个类别样本个数倒数作为类别权重，数量越少的权重越大
    class_weights = 1 / class_frequency
    # 归一化权重
    class_weights_norm = class_weights / class_weights.sum()
    return torch.from_numpy(class_weights_norm)

def labels_to_image_weights(labels, number_class=80, class_weights=np.ones(80)):
	# labels[i][:, 0] 表示第i幅图的所有目标类别，class_counts表示数据集中每幅图里面的每个类别targets个数统计
    class_counts = [np.bincount(labels[i][:, 0].astype(np.int), minlength=number_class) for i in range(len(labels))]
    # 每幅图的每个类别targets个数乘以整体类别权重再求和得到每幅图的图像权重
    image_weights = np.array(class_counts * class_weights.reshape(1, number_class)).sum(1)
    return image_weights

model.class_weights = labels_to_class_weights(dataset.labels, number_class).to(device)  # attach class weights

        if dataset.image_weights:
            # Generate indices.
            if rank in [-1, 0]:
           		 # 这里我理解是对识别效果不好的类别施加更多的权重
                class_weights = model.class_weights.cpu().numpy() * (1 - maps) ** 2  # class weights
                image_weights = labels_to_image_weights(dataset.labels, number_class, class_weights)
                # 根据图像权重重新确定图像index
                dataset.indices = random.choices(range(dataset.num_files), weights=image_weights, k=dataset.num_files)

10.loss计算（见下一章）

train.py 代码

# Hyperparameters
hyp = {'optimizer': 'SGD',  # ['adam', 'SGD', None] if none, default is SGD
       'lr0': 0.01,  # initial learning rate (SGD=1E-2, Adam=1E-3)
       'momentum': 0.937,  # SGD momentum/Adam beta1
       'weight_decay': 5e-4,  # optimizer weight decay
       'giou': 0.05,  # giou loss gain
       'cls': 0.5,  # cls loss gain
       'cls_pw': 1.0,  # cls BCELoss positive_weight
       'obj': 1.0,  # obj loss gain (*=img_size/320 if img_size != 320)
       'obj_pw': 1.0,  # obj BCELoss positive_weight
       'iou_t': 0.20,  # iou training threshold
       'anchor_t': 4.0,  # anchor-multiple threshold
       'fl_gamma': 0.0,  # focal loss gamma (efficientDet default is gamma=1.5)
       'hsv_h': 0.015,  # image HSV-Hue augmentation (fraction)
       'hsv_s': 0.7,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.4,  # image HSV-Value augmentation (fraction)
       'degrees': 0.0,  # image rotation (+/- deg)
       'translate': 0.0,  # image translation (+/- fraction)
       'scale': 0.5,  # image scale (+/- gain)
       'shear': 0.0}  # image shear (+/- deg)


def train(hyp, tb_writer, opt, device):
    # --------------------------------------------------------------------------------------------------------------
    # create results dir
    print(f'Hyperparameters {hyp}')
    log_dir = tb_writer.log_dir if tb_writer else 'runs/result'
    weight_dir = os.path.join(log_dir, 'weights')
    os.makedirs(weight_dir, exist_ok=True)
    best_pt_dir = os.path.join(weight_dir, 'best.pt')
    last_pt_dir = os.path.join(weight_dir, 'last.pt')
    results_file = os.path.join(log_dir, 'results.txt')

    with open(os.path.join(log_dir, 'hyp.yaml'), 'w') as f:
        yaml.dump(hyp, f, sort_keys=False)
    with open(os.path.join(log_dir, 'opt.yaml'), 'w') as f:
        yaml.dump(vars(opt), f, sort_keys=False)

    # TODO: Init DDP logging. Only the first process is allowed to log.
    epochs, batch_size, total_batch_size, weights, rank = \
        opt.epochs, opt.batch_size, opt.totol_batch_size, opt.weights, opt.local_rank

    # Remove previous results
    if rank in [-1, 0]:
        for f in glob.glob(log_dir + os.sep + '*_batch*.jpg') + glob.glob(results_file):
            os.remove(f)

    # --------------------------------------------------------------------------------------------------------------
    # init random seeds
    init_seeds(2 + rank)

    # --------------------------------------------------------------------------------------------------------------
    # create model
    with open(opt.data, 'r') as f:
        data_dict = yaml.load(f, Loader=yaml.FullLoader)
    number_class, names = (1, ['item']) if opt.single_cls else (int(data_dict['nc']), data_dict['names'])
    assert len(names) == number_class, '%g names found for nc=%g dataset in %s' % (len(names), number_class, opt.data)  # check

    model = Model(opt.cfg, number_class=number_class).to(device)

    # img_size is multiple of max_stride
    max_stride = int(max(model.stride))
    img_size, img_size_val = [check_img_size(size, max_stride=max_stride) for size in opt.img_size]

    # --------------------------------------------------------------------------------------------------------------
    # define nominal batch size, set weight_decay
    nominal_batch_size = 64
    accumulate = max(round(nominal_batch_size / total_batch_size), 1)
    hyp['weight_decay'] *= total_batch_size * accumulate / nominal_batch_size

    # create optimizer parameter groups
    pg0, pg1, pg2 = [], [], []
    for k, v in model.named_parameters():
        if v.requires_grad:
            if '.bias' in k:    #bias
                pg2.append(v)
            elif '.weight' in k and 'bn' not in k:           # weight need weight decay
                pg1.append(v)
            else:
                pg0.append(v)

    # --------------------------------------------------------------------------------------------------------------
    # create optimizer
    if hyp['optimizer'] == 'adam':
        optimizer = optim.Adam(pg0, lr=hyp['lr0'], betas=(hyp['momentum'], 0.999))
    else:
        optimizer = optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)

    optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']})  # add pg1 with weight_decay
    optimizer.add_param_group({'params': pg2})   # biases
    print('Optimizer groups: %g .bias, %g conv.weight, %g other' % (len(pg2), len(pg1), len(pg0)))
    del pg0, pg1, pg2

    # --------------------------------------------------------------------------------------------------------------
    # load model
    if weights == True:
        weights = ''  # train from scratch

    start_epoch, best_fitness = 0, 0.0

    if weights.endswith('.pt'):
        ckpt = torch.load(weights, map_location=device)
        try:
            exclude = ['anchor']
            ckpt['model'] = {k: v for k, v in ckpt['model'].float().state_dict().items()
                             if k in model.state_dict() and not any([x in k for x in exclude])
                             and model.state_dict()[k].shape == v.shape}
            model.load_state_dict(ckpt['model'], strict=False)
            print('Transferred %g/%g items from %s' % (len(ckpt['model']), len(model.state_dict()), weights))
        except KeyError as e:
            s = "%s is not compatible with %s. This may be due to model differences or %s may be out of date. " \
                "Please delete or update %s and try again, or use --weights '' to train from scratch." \
                % (weights, opt.cfg, weights, weights)
            raise KeyError(s) from e

        # load optimizer
        if ckpt['optimizer'] is not None:
            optimizer.load_state_dict(ckpt['optimizer'])
            best_fitness = ckpt['best_fitness']

        # load results
        if ckpt['training_results'] is not None:
            with open(results_file, 'w') as f:
                f.write(ckpt['training_results'])          # write results.txt

        # epochs
        start_epoch = ckpt['epoch'] + 1
        if epochs < start_epoch:
            print('%s has been trained for %g epochs. Fine-tuning for %g additional epochs.' %
                  (weights, ckpt['epoch'], epochs))
            epochs += ckpt['epoch']       # finetune additional epochs

        del ckpt

    # --------------------------------------------------------------------------------------------------------------
    # Mixed precision training https://github.com/NVIDIA/apex
    if mixed_precision:
        model, optimizer = amp.initialize(model, optimizer, opt_level="O1", verbosity=0)

    # --------------------------------------------------------------------------------------------------------------
    # Scheduler(learning rate decay) https://arxiv.org/pdf/1812.01187.pdf
    # lr_scheduler.CosineAnnealingLR
    lr_cosine = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.8 + 0.2  # cosine
    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_cosine)

    # --------------------------------------------------------------------------------------------------------------
    # DataParallel
    if device.type != 'cpu' and rank == -1 and torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)

    # --------------------------------------------------------------------------------------------------------------
    # Exponential moving average
    ema = torch_utils.ModelEMA(model) if rank in [-1, 0] else None

    # --------------------------------------------------------------------------------------------------------------
    # DistributedDataParallel
    if device.type != 'cpu' and rank != -1:
        if opt.sync_bn:
            model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
        model = DDP(model, device_ids=[rank], output_device=rank)

    # --------------------------------------------------------------------------------------------------------------
    # prepare data
    # trainloader
    train_data_dir = data_dict['train']
    dataloader, dataset = create_dataloader(train_data_dir, img_size, batch_size, max_stride, opt, hyp,
                                            augment=True, cache=opt.cache_images, rect=opt.rect,
                                            local_rank=rank, world_size=opt.world_size)
    max_label_class = np.concatenate(dataset.labels, 0)[:, 0].max()
    number_batches = len(dataloader)
    assert max_label_class < number_class, 'Label class %g exceeds nc=%g in %s. Possible class labels are 0-%g' % \
                                           (max_label_class, number_class, opt.data, number_class - 1)

    # testloader
    val_data_dir = data_dict['val']
    if rank in [-1, 0]:
        valloader = create_dataloader(val_data_dir, img_size_val, total_batch_size, max_stride, opt, hyp,
                                      augment=False, cache=opt.cache_images, rect=True, local_rank=-1,
                                      world_size=opt.world_size)[0]

    # --------------------------------------------------------------------------------------------------------------
    # Model parameters
    hyp['cls'] *= number_class / 80.  # scale coco-tuned hyp['cls'] to current dataset
    model.number_class = number_class
    model.hyp = hyp
    model.iou_ratio = 1.0
    model.class_weights = labels_to_class_weights(dataset.labels, number_class).to(device)  # attach class weights
    model.names = names

    # --------------------------------------------------------------------------------------------------------------
    # plot labels class frequency
    if rank in [-1, 0]:
        labels = np.concatenate(dataset.labels, 0)
        c = torch.tensor(labels[:, 0])   # class
        plot_labels(labels, save_dir=log_dir)
        if tb_writer:
            tb_writer.add_histogram('classes', c, 0)

    # --------------------------------------------------------------------------------------------------------------
    # start training
    t0 = time.time()
    number_warmup = max(3 * number_batches, 1e3)       # number of warmup iterations, max(3 epochs, 1k iterations)
    maps = np.zeros(number_class)   # mAP per class
    results = (0, 0, 0, 0, 0, 0, 0)    # 'P', 'R', 'mAP', 'F1', 'val GIoU', 'val Objectness', 'val Classification'
    scheduler.last_epoch = start_epoch - 1
    if rank in [-1, 0]:
        print('Image sizes %g train, %g test' % (img_size, img_size_val))
        print('Using %g dataloader workers' % dataloader.num_workers)
        print('Starting training for %g epochs...' % epochs)

    for epoch in range(start_epoch, epochs):

        model.train()

        # Update image weights (optional)
        # When in DDP mode, the generated indices will be broadcasted to synchronize dataset.
        if dataset.image_weights:
            # Generate indices.
            if rank in [-1, 0]:
                class_weights = model.class_weights.cpu().numpy() * (1 - maps) ** 2  # class weights
                image_weights = labels_to_image_weights(dataset.labels, number_class, class_weights)
                dataset.indices = random.choices(range(dataset.num_files), weights=image_weights, k=dataset.num_files)

            # Broadcast if DDP
            if rank != -1:
                indices = torch.zeros([dataset.num_files], dtype=torch.int)
                if rank == 0:
                    indices[:] = torch.tensor(dataset.indices, dtype=torch.int)
                dist.broadcast(indices, 0)
                if rank != 0:
                    dataset.indices = indices.cpu().numpy()

        mloss = torch.zeros(4, device=device)  # mean losses

        # DDP random indices
        if rank != -1:
            dataloader.sampler.set_epoch(epoch)

        pbar = enumerate(dataloader)
        if rank in [-1, 0]:
            print(('\n' + '%10s' * 8) % ('Epoch', 'gpu_mem', 'CIoU_add', 'obj', 'cls', 'total', 'targets', 'img_size'))
            pbar = tqdm(pbar, total=number_batches)  # progress bar

        optimizer.zero_grad()

        for i, (img_batch, label_batch, paths, _) in pbar:

            batch_index_now = i + number_batches * epoch   # number integrated batches (since train start)
            img_batch = img_batch.to(device, non_blocking=True).float() / 255.0 # uint8 to float32, 0 - 255 to 0.0 - 1.0

            # ----------------------------------------------------------------------------------------------------------
            # warm up
            if batch_index_now <= number_warmup:
                xp = [0, number_warmup]

                accumulate = max(1, np.interp(batch_index_now, xp, [1, nominal_batch_size / total_batch_size]).round())

                for j, param in enumerate(optimizer.param_groups):
                    # set bias lr from 0.1 to lr0,
                    param['lr'] = np.interp(batch_index_now, xp,
                                            [0.1 if j == 2 else 0.0, param['initial_lr'] * lr_cosine(epoch)])
                    if 'momentum' in param:
                        param['momentum'] = np.interp(batch_index_now, xp, [0.9, hyp['momentum']])

            # ----------------------------------------------------------------------------------------------------------
            # Multi-scale
            if opt.multi_scale:
                scale_size = random.randrange(img_size * 0.5, img_size * 1.5 + max_stride) // max_stride * max_stride
                ratio = scale_size / max(img_batch.shape[2:])
                if ratio != 1:
                    new_shape = [math.ceil(x * ratio / max_stride) * max_stride for x in img_batch.shape[2:]]
                    img_batch = F.interpolate(img_batch, new_shape, align_corners=False)

            # ----------------------------------------------------------------------------------------------------------
            # forward
            pred = model(img_batch)

            # ----------------------------------------------------------------------------------------------------------
            # compute loss
            loss, loss_items = compute_loss(pred, label_batch.to(device), model)

            if rank != -1:
                loss *= opt.world_size    # gradient averaged between devices in DDP mode
            if not torch.isfinite(loss):
                print('WARNING: non-finite loss, ending training ', loss_items)
                return results

            # ----------------------------------------------------------------------------------------------------------
            # loss back propagation
            if mixed_precision:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            # ----------------------------------------------------------------------------------------------------------
            # optimizer
            if batch_index_now % accumulate == 0:
                optimizer.step()
                optimizer.zero_grad()
                if ema is not None:
                    ema.update(model)

            # ----------------------------------------------------------------------------------------------------------
            # print
            if rank in [-1, 0]:
                mloss = (mloss * i + loss_items) / (i + 1)  # update mean losses
                mem = '%.3gG' % (torch.cuda.memory_cached() / 1E9 if torch.cuda.is_available() else 0)  # (GB)
                s = ('%10s' * 2 + '%10.4g' * 6) % (
                    '%g/%g' % (epoch, epochs - 1), mem, *mloss, label_batch.shape[0], img_batch.shape[-1])
                pbar.set_description(s)

            # ----------------------------------------------------------------------------------------------------------
            # Plot
            if batch_index_now < 3:
                f = os.path.join(log_dir, 'train_batch%g.jpg' % batch_index_now)   # filename
                result = plot_images(images=img_batch, targets=label_batch, paths=paths, fname=f)
                if tb_writer and result is not None:
                    tb_writer.add_image(f, result, dataformats='HWC', global_step=epoch)

            # end batch ------------------------------------------------------------------------------------------------

        # --------------------------------------------------------------------------------------------------------------
        # Scheduler
        scheduler.step()

        # --------------------------------------------------------------------------------------------------------------
        # validation

        # Only the first process in DDP mode is allowed to log or save checkpoints.
        if rank in [-1, 0]:
            if ema is not None:
                ema.update_attr(model, include=['model_yaml', 'number_class', 'hyp', 'giou_ratio', 'names', 'stride'])
            final_epoch = epoch + 1 == epochs
            if not opt.notest or final_epoch:  # Calculate mAP
                results, maps, times = test.test()

            # write
            with open(results_file, 'a') as f:
                f.write(s + '%10.4g' * 8 % results + '\n')    # P, R, mAP, F1, test_losses=(GIoU, obj, cls)

            # tensorboard
            if tb_writer:
                tags = ['train/giou_loss', 'train/obj_loss', 'train/cls_loss',
                        'metrics/precision', 'metrics/recall', 'metrics/mAP_0.5', 'metrics/mAP_0.75',
                        'metrics/mAP_0.5:0.95', 'val/giou_loss', 'val/obj_loss', 'val/cls_loss']
                for x, tag in zip(list(mloss[:-1]) + list(results), tags):
                    tb_writer.add_scalar(tag, x, epoch)

            # Update best mAP
            fi = results[4]
            if fi > best_fitness:
                best_fitness = fi

            # ----------------------------------------------------------------------------------------------------------
            # save model
            save = (not opt.nosave) or final_epoch
            if save:
                with open(results_file, 'r') as f:  # create checkpoint
                    ckpt = {'epoch': epoch,
                            'best_fitness': best_fitness,
                            'training_results': f.read(),
                            'model': ema.ema.module if hasattr(ema, 'module') else ema.ema,
                            'optimizer': optimizer.state_dict() if not final_epoch else None}
                # Save last, best and delete
                torch.save(ckpt, last_pt_dir)
                if (best_fitness == fi) and not final_epoch:
                    torch.save(ckpt, best_pt_dir)
                del ckpt
        # end epoch ----------------------------------------------------------------------------------------------------
    # end training

    if rank in [-1, 0]:
        # Strip optimizers
        # isnumeric() 方法检测字符串是否只由数字组成。这种方法是只针对unicode对象
        n = ('_' if len(opt.name) and not opt.name.isnumeric() else '') + opt.name
        fresults, flast, fbest = 'results%s.txt' % n, weight_dir + 'last%s.pt' % n, weight_dir + 'best%s.pt' % n
        for f1, f2 in zip([weight_dir + 'last.pt', weight_dir + 'best.pt', 'results.txt'], [flast, fbest, fresults]):
            if os.path.exists(f1):
                os.rename(f1, f2)  # rename
                ispt = f2.endswith('.pt')  # is *.pt
                strip_optimizer(f2) if ispt else None  # strip optimizer
        # Finish
        plot_results(save_dir=log_dir)   # save as results.png
        print('%g epochs completed in %.3f hours.\n' % (epoch - start_epoch + 1, (time.time() - t0) / 3600))
    dist.destroy_process_group() if rank not in [-1, 0] else None
    torch.cuda.empty_cache()
    return results
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--cfg', type=str, default='models/yolov4l-mish.yaml', help='model.yaml path')
    parser.add_argument('--data', type=str, default='data/dota.yaml', help='data.yaml path')
    parser.add_argument('--hyp', type=str, default='', help='hyp.yaml path (optional)')
    parser.add_argument('--epochs', type=int, default=300)
    parser.add_argument('--batch-size', type=int, default=8, help="Total batch size for all gpus.")  # 16
    parser.add_argument('--img-size', nargs='+', type=int, default=[1024, 1024], help='train,test sizes')
    parser.add_argument('--resume', nargs='?', const='get_last', default=False,
                        help='resume from given path/to/last.pt, or most recent run if blank.')
    parser.add_argument('--weights', type=str, default='weights/yolov4s-mish.pt', help='initial weights path')
    parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')

    parser.add_argument('--rect', action='store_true', help='rectangular training')
    parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
    parser.add_argument('--notest', action='store_true', help='only test final epoch')
    parser.add_argument('--noautoanchor', action='store_true', help='disable autoanchor check')
    parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
    parser.add_argument('--cache-images', action='store_true', help='cache images for faster training')
    parser.add_argument('--name', default='', help='renames results.txt to results_name.txt if supplied')
    parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%%')
    parser.add_argument('--single-cls', action='store_true', help='train as single-class dataset')
    parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
    parser.add_argument('--local_rank', type=int, default=-1, help='DistributedDataParallel parameter, do not modify')
    opt = parser.parse_args()

    # --------------------------------------------------------------------------------------------------------------
    # resume from most recent run
    last = get_latest_run() if opt.resume == 'get_last' and not opt.weights else opt.resume

    if last and not opt.weights:
        print(f'resume from {last}')
    opt.weights = last if opt.resume and not opt.weights else opt.resume

    # --------------------------------------------------------------------------------------------------------------
    # check_git_status check file
    if opt.local_rank in [0, -1]:
        check_git_status()

    opt.cfg = check_file(opt.cfg)

    opt.data = check_file(opt.data)

    # --------------------------------------------------------------------------------------------------------------
    # update hyps
    if opt.hyp:
        opt.hyp = check_file(opt.hyp)
        with open(opt.hyp) as f:
            hyp.update(yaml.load(f, Loader=yaml.FullLoader))

    # --------------------------------------------------------------------------------------------------------------
    # extend img_size to 2 sizes (train, test)
    opt.img_size.extend([opt.img_size[-1]] * (2 - len(opt.img_size)))

    # --------------------------------------------------------------------------------------------------------------
    # device, total_batch_size, DDP mode
    device = select_device(opt.device, apex=mixed_precision, batch_size=opt.batch_size)

    opt.total_batch_size = opt.batch_size
    opt.world_size = 1

    if device.type == 'cpu':
        mixed_precision = False
    elif opt.local_rank != -1:
        # DDP mode
        assert torch.cuda.device_count() > opt.local_rank
        torch.cuda.set_device(opt.local_rank)
        device = torch.device('cuda', opt.local_rank)
        # distributed backend
        dist.init_process_group(backend='nccl', init_method='env://')
    print(opt)

    # --------------------------------------------------------------------------------------------------------------
    # tensorboard
    if opt.local_rank in [-1, 0]:
        tb_writer = SummaryWriter(log_dir=increment_dir('runs/exp', opt.name))
    else:
        tb_writer = None

    # --------------------------------------------------------------------------------------------------------------
    # train
    train(hyp, tb_writer, opt, device)
    # --------------------------------------------------------------------------------------------------------------