deepspeed多机多卡，断点继续预训练

最新推荐文章于 2024-08-16 15:53:37 发布

AI生成曾小健2

最新推荐文章于 2024-08-16 15:53:37 发布

阅读量878

点赞数 25

文章标签：人工智能

本文链接：https://blog.csdn.net/zengxiaojian2/article/details/140578978

版权

以下是对你提供的日志的详细解析，以及其对训练的影响：

日志解析

恢复训练

pgsql

复制

gpu009: [INFO|trainer.py:2109] 2024-07-20 22:38:47,510 >>   Continuing training from checkpoint, will skip to saved global_step
gpu009: [INFO|trainer.py:2110] 2024-07-20 22:38:47,510 >>   Continuing training from epoch 0
gpu009: [INFO|trainer.py:2111] 2024-07-20 22:38:47,510 >>   Continuing training from global step 2760
gpu009: [INFO|trainer.py:2113] 2024-07-20 22:38:47,510 >>   Will skip the first 0 epochs then the first 2760 batches in the first epoch.

含义：训练从上次保存的检查点恢复：
- epoch 0：表示从第 0 轮开始。
- global step 2760：表示从全局步骤 2760 开始。
- 跳过的批次：在第 0 轮中跳过前 2760 个批次。

参数不匹配警告

pgsql

复制

gpu009: Warning: The following arguments do not match the ones in the `trainer_state.json` within the checkpoint directory: 
gpu009: 	save_steps: 1200 (from args) != 120 (from trainer_state.json)
gpu004: Warning: The following arguments do not match the ones in the `trainer_state.json` within the checkpoint directory: 
gpu004: 	save_steps: 1200 (from args) != 120 (from trainer_state.json)
gpu004: Warning: The following arguments do not match the ones in the `trainer_state.json` within the checkpoint directory: 
gpu004: 	save_steps: 1200 (from args) != 120 (from trainer_state.json)

含义：存在参数不匹配：
- save_steps：当前配置的保存步数为 1200，而检查点中的配置为 120。这意味着在恢复训练时，模型的保存频率不同。

对训练的影响

恢复训练：从检查点恢复训练通常不会对训练产生负面影响，只要恢复的状态是有效的。如果检查点是完整且正确的，训练将顺利进行。
参数不匹配：
- 潜在影响：不匹配的参数可能导致模型在训练过程中表现不一致。例如，保存频率的改变可能导致模型在某些时刻未能保存，影响调试和模型的持续训练。
- 建议：确保训练参数与检查点一致，以避免潜在的问题。如果需要更改参数，最好在训练开始前进行调整。