【分布式训练（4）】accelerator.sync_gradients 和 checkpointing 深入理解

最新推荐文章于 2024-10-15 21:54:01 发布

多恩Stone

最新推荐文章于 2024-10-15 21:54:01 发布

阅读量1k

点赞数 4

分类专栏：编程学习文章标签：分布式 AIGC python AI 深度学习神经网络

本文链接：https://blog.csdn.net/weixin_44212848/article/details/142929819

版权

编程学习专栏收录该内容

108 篇文章

订阅专栏

【分布式训练 debug】VS Code Debug 技巧：launch.json实用参数
 【分布式训练（2）】深入理解 DeepSpeed 的 ZeRO 内存优化策略 (三阶段的区别)
【分布式训练（3）】accelerator + deepspeed debug 报错 “Timed out waiting for debuggee to spawn“ 解决方法✅

本文目录

- - - accelerator.sync_gradients
    - checkpointing（检查点）

accelerator.sync_gradients

sync_gradients（同步梯度）

sync_gradients 是一个在分布式训练中使用的策略，它涉及到在多个训练节点（或GPU）之间同步梯度。
在分布式训练中，每个节点计算其自己的梯度（即损失函数对模型参数的偏导数），然后这些梯度需要被聚合以更新模型的全局参数。
sync_gradients 通常在每个优化步骤后执行，以确保所有节点上的模型参数保持一致。

# Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                global_step += 1
                accelerator.log({"train_loss": train_loss}, step=global_step)
                train_loss = 0.0

在该代码片段中，accelerator.sync_gradients 可能是一个标志（flag），指示是否需要执行梯度同步。
如果是这样，那么在每次优化步骤后，代码会更新进度条，记录日志，并可能执行其他清理或记录操作。

checkpointing（检查点）

Checkpointing 是一种保存训练过程中的关键状态的机制，以便在发生故障或为了恢复训练时可以从这些点重新开始。
在深度学习中，检查点通常包括模型的参数（权重和偏置），优化器的状态（如动量项），以及可能的损失值和当前的迭代次数。

if global_step % args.checkpointing_steps == 0:
   if accelerator.is_main_process:
       # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
       if args.checkpoints_total_limit is not None:
           checkpoints = os.listdir(args.output_dir)
           checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
           checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))

           # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
           if len(checkpoints) >= args.checkpoints_total_limit:
               num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
               removing_checkpoints = checkpoints[0:num_to_remove]

               logger.info(
                   f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
               )
               logger.info(f"removing checkpoints: {', '.join(removing_checkpoints)}")

               for removing_checkpoint in removing_checkpoints:
                   removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
                   shutil.rmtree(removing_checkpoint)