HuggingFase的transformers库里集成了Deepspeed,
参考Qwen的finetune.py脚本:
from transformers import Trainer, GPTQConfig, deepspeed
注意,transformers里集成的deepspeed,有些特性不完全支持,这里有集成的说明以及很多实用经验的介绍transformers/docs/source/zh/main_classes/deepspeed.md at eb5b968c5d80271ecb29917dffecc8f4c00247a8 · huggingface/transformers · GitHub
跟到transformers/trainer.py: trainer.train()里,
看起来self.model_wrapped就是为deepspeed转换的新model,见2122行的注释:
# important: at this point:
# self.model is the Transformers Model
# self.model_wrapped is DDP(Transformers Model), Deepspeed(Transformers Model),
# FSDP(Transformers Model), Dynamo Optimized Module(Transformers Model) etc.
model = self._wrap_model(self.model_wrapped),
调试torchrun
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Module",
"type": "python",
"request": "launch",
"module": "torch.distributed.run",
"cwd": "/mnt/workspace/workgroup/dlz/tmp2/Qwen",
"args": [
"--nproc_per_node", "2",
"--nnodes", "1",
"--node_rank", "0",
"--master_addr", "localhost",
"--master_port", "6819",
"finetune.py",
"--model_name_or_path", "/path/to/models/Qwen1.5-7B",
"--data_path", "/path/to/finetune_full/data_100000.json",
"--bf16", "True",
"--output_dir", "output_qwen_full_1gpu_bs1",
"--num_train_epochs", "2",
"--per_device_train_batch_size", "2",
"--per_device_eval_batch_size", "1",
"--gradient_accumulation_steps", "4",
"--evaluation_strategy", "no",
"--save_strategy", "steps",
"--save_steps", "1000",
"--save_total_limit", "10",
"--learning_rate", "1e-5",
"--weight_decay", "0.1",
"--adam_beta2", "0.95",
"--warmup_ratio", "0.01",
"--lr_scheduler_type", "cosine",
"--logging_steps", "1",
"--report_to", "none",
"--model_max_length", "2048",
"--gradient_checkpointing", "True",
"--lazy_preprocess", "True",
"--deepspeed", "finetune/ds_config_zero2.json"
],
"env": {
"CUDA_VISIBLE_DEVICES": "0,1" // 添加环境变量
},
"justMyCode": false
}
]
}
开两张卡的话,会出现两个子线程,想跟进哪个线程里面查数据,直接点进去对应的函数那行就行
index0的model.optimizer
这块儿分成了两个组,大小完全一样,16位的参数
与这里面的data相对。。
这是整个模型的总参数量(这个是进到_inner_traning_loop之后,环境参数里的,不是model里的):
还有两个参数,一个是params_in_partition另一个是params_not_in_partition代表在和不在这个分区上的参数
index1的model.optimizer,可以看到
看model中的param_names,可以看到,一共1+32*12+2=387个参数模块,所以看到上面的两个是按第193个节点来划分的,地中第193个节点各占一部分?
4张GPU的时候: