模型保存相关
1.如果不想在训练过程中保存模型而只是在结束的时候保存最好的模型,可以设置save_strategy为no并且load_best_model_at_end为True。
2. sft 使用deepspeed会保存一个很大的global_step文件,如果不需要的话可以通过设置–save_only_model来取消。
save_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`):
The checkpoint save strategy to adopt during training. Possible values are:
- `"no"`: No save is done during training.
- `"epoch"`: Save is done at the end of each epoch.
- `"steps"`: Save is done every `save_steps`.
save_steps (`int` or `float`, *optional*, defaults to 500):
Number of updates steps before two checkpoint saves if `save_strategy="steps"`. Should be an integer or a
float in range `[0,1)`. If smaller than 1, will be interpreted as ratio of total training steps.
save_total_limit (`int`, *optional*):
If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
`output_dir`. When `load_best_model_at_end` is enabled, the "best" checkpoint according to
`metric_for_best_model` will always be retained in addition to the most recent ones. For example, for
`save_total_limit=5` and `load_best_model_at_end`, the four last checkpoints will always be retained
alongside the best model. When `save_total_limit=1` and `load_best_model_at_end`, it is possible that two
checkpoints are saved: the last one and the best one (if they are different).
save_safetensors (`bool`, *optional*, defaults to `True`):
Use [safetensors](https://huggingface.co/docs/safetensors) saving and loading for state dicts instead of
default `torch.load` and `torch.save`.
save_on_each_node (`bool`, *optional*, defaults to `False`):
When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on
the main one.
This should not be activated when the different nodes use the same storage as the files will be saved with
the same names for each node.
save_only_model (`bool`, *optional*, defaults to `False`):
When checkpointing, whether to only save the model, or also the optimizer, scheduler & rng state.
Note that when this is true, you won't be able to resume training from checkpoint.
This enables you to save storage by not storing the optimizer, scheduler & rng state.
You can only load the model using `from_pretrained` with this option set to `True`.
load_best_model_at_end (`bool`, *optional*, defaults to `False`):
Whether or not to load the best model found during training at the end of training. When this option is
enabled, the best checkpoint will always be saved. See
[`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit)
for more.
更详细的内容可以通过阅读https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L208来查找