如果已经完成了100epoch的训练,如果想继续在此基础上继续增加若干epoch进行训练,参考了ultralytics官方教程,可按照以下步骤,亲测可行:
第一步:ultralytics/engine/trainer.py中注释掉self.epochs = self.args.epochs ,重新将self.epochs写为需要训练的轮次数,例如这里改为200,相当于增加了100个epoch
class BaseTrainer:
def __init__(self, cfg=DEFAULT_CFG, overrides=None, _callbacks=None):
self.args = get_cfg(cfg, overrides)
self.check_resume(overrides)
self.device = select_device(self.args.device, self.args.batch)
self.validator = None
self.metrics = None
self.plots = {}
init_seeds(self.args.seed + 1 + RANK, deterministic=self.args.deterministic)
self.save_dir = get_save_dir(self.args)
self.args.name = self.save_dir.name # update name for loggers
self.wdir = self.save_dir / "weights" # weights dir
if RANK in {-1, 0}:
self.wdir.mkdir(parents=True, exist_ok=True) # make dir
self.args.save_dir = str(self.save_dir)
yaml_save(self.save_dir / "args.yaml", vars(self.args)) # save run args
self.last, self.best = self.wdir / "last.pt", self.wdir / "best.pt" # checkpoint paths
self.save_period = self.args.save_period
self.batch_size = self.args.batch
# self.epochs = self.args.epochs
self.epochs = 200 ## 这里修改
self.start_epoch = 0
if RANK == -1:
print_args(vars(self.args))
第二步:ultralytics/engine/trainer.py中找到resume_training函数,将start_epoch改为已经训练完成的轮次数,已经完成了100个epoch的训练,故start_epoch设置为=100
def resume_training(self, ckpt):
#ckpt = torch.load('runs/obb/train/weights/last.pt')
"""Resume YOLO training from given epoch and best fitness."""
if ckpt is None or not self.resume:
return
best_fitness = 0.0
# start_epoch = ckpt.get("epoch", -1) + 1
start_epoch = 100
第三步,参考官方写法,编写resume脚本训练
from ultralytics import YOLO
# Load a model
model = YOLO('runs/detect/train6/weights/last.pt') # load a partially trained model
results = model.train(resume=True)
成功开始继续训练
4.注意!!!训练完后,对ultralytics/engine/trainer.py所做的所有更改撤销,不然下一次训练时所设置的epoch参数不起效了。
总结,想对已经训练完成后的模型增加epoch继续训练,本质上还是用到了resume机制,但是如果直接用resume脚本,yolo会报已经训练完成,无需resume的错误,因此,本次要将训练参数self.epochs(总训练轮次)强制修改为新增加的,start_epoch强制修改为最后一个轮次。