1.版本变动很多,参考docs时认准最新版
2.'ddp'并没有比默认并行strategy快,由于速度已经满意,没有深究ddp optimization
3. tensorboard logger添加图片用self.log报错,可以调用tensorboard原始方法:
self.logger.experiment.add_image("target image", target_img_plot, self.global_step, dataformats='NCHW')
4.global_step是optimizer update的次数,不是单纯的iteration次数,所以如果有n个optimizer,值会翻n倍。
5. 想要每N个iteration保存一次模型:
checkpoint_callback = ModelCheckpoint(
every_n_train_steps=50000,
every_n_epochs=0,
auto_insert_metric_name=False,
dirpath=os.path.join(opt.checkpoints_dir, opt.name),
save_top_k=-1, # save all models
filename="step_{step}",
)
trainer = Trainer(
...
callbacks=[checkpoint_callback],
...
)