用deepspeed
运行python程序时出现报错
AttributeError: ‘DeepSpeedCPUAdam’ object has no attribute ‘ds_opt_adam’ 的一种情况是由ninja的报错导致的:
xxx/.local/bin/ninja", line 5, in
from ninja import ninja
ModuleNotFoundError: No module named ‘ninja’
分析是由于程序识别ninja的位置问题导致的,上述报错说明python去用户本地路径下而不是从conda环境中寻找ninja
测试环境 python 3.12 + CUDA 12.1
accelerate==0.34.2
deepspeed==0.15.1
ninja==1.11.1.1
torch==2.4.0
transformers==4.44.2
trl==0.11.1
解决方法
- 将用户路径中的ninja进行更名备份:
mv ~/.local/bin/ninja ~/.local/bin/ninja.bak
- 使用
pip
安装ninja
pip install ninja
- 重新加载conda环境
source ~/.bashrc
conda activate ENV_NAME
问题解决
附上相关代码/指令参考
deepspeed_config: 设置 bf16
为 True
以防止报错Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
deepspeed_config = {
"fp16": {
"enabled": False,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": True,
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": True
},
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"steps_per_print": 100,
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"activation_checkpointing": {
"partition_activations": True,
"contiguous_memory_optimization": True
},
"wall_clock_breakdown": False
}
finetune.py
base_model = '/path/to/Meta-Llama-3.1-8B-Instruct'
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map=device_map,
)
train_args=transformers.TrainingArguments(
deepspeed=deepspeed_config,
num_train_epochs=num_epochs,
learning_rate=learning_rate,
fp16=deepspeed_config['fp16']['enabled'],
bf16=deepspeed_config['bf16']['enabled'],
save_safetensors=False,
logging_steps=50,
logging_dir=os.path.join(output_dir, 'logs'),
evaluation_strategy="steps" if val_set_size > 0 else "no",
save_strategy="steps",
eval_steps=eval_steps if val_set_size > 0 else None,
save_steps=save_steps,
output_dir=output_dir,
save_total_limit=2,
group_by_length=group_by_length,
save_only_model=True,
seed=42,
data_seed=42,
)
# trainer = transformers.Trainer(
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=train_args,
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
),
)
model.config.use_cache = False
trainer.train()
model.save_pretrained(output_dir, safe_serialization=False, max_shard_size='5GB')
运行指令: 使用deepspeed
(而不是python
)运行; 不适用于CUDA_VISIBLE_DEVICES
deepspeed
,使用localhost
指定GPU。
LR=5e-4
GPUID=2
MODEL_DIR=/path/to/Meta-Llama-3.1-8B-Instruct
deepspeed --include localhost:${GPUID} finetune.py \
--base_model ${MODEL_DIR} \
--output_dir ${OUTPUT_DIR} \
--batch_size 4 \
--micro_batch_size 4 \
--num_epochs 20 \
--learning_rate ${LR} \
--cutoff_len 150 \
--eval_steps 2000 \
--save_steps 2000 \
--val_set_size 6000 \
--train_on_inputs false \
--group_by_length \