微调命令
CUDA_VISIBLE_DEVICES=0 python /aaa/LLaMA-Factory/src/train_bash.py \
--stage sft \
--model_name_or_path /aaa/LLaMA-Factory/models/chatglm2-6b \
--do_train \
--dataset bbbccc \
--template chatglm2 \
--finetuning_type lora \
--lora_target query_key_value \
--output_dir output/dddeee/ \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 10 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss
已经从huggingface下载完整的模型并配置正确路径,也对自定义数据集仿照alpaca_gpt4_data_zh.json在dataset_info.json中写入相关配置。但运行如上命令还是有报错如下:
[INFO|training_args.py:1798] 2023-11-02 16:00:19,165 >> PyTorch: setting up devices
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:1760] 2023-11-02 16:00:19,402 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-11-02 16:00:19,402 >> Num examples = 1,372
[INFO|trainer.py:1762] 2023-11-02 16:00:19,402 >> Num Epochs = 3
[INFO|trainer.py:1763] 2023-11-02 16:00:19,402 >> Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-11-02 16:00:19,402 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-11-02 16:00:19,403 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-11-02 16:00:19,403 >> Total optimization steps = 255
[INFO|trainer.py:1769] 2023-11-02 16:00:19,404 >> Number of trainable parameters = 1,949,696
0%| | 0/255 [00:00<?, ?it/s]/aaa/envs/bbb_llama_factory_py310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/aaa/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/aaa/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/aaa/LLaMA-Factory/src/llmtuner/tuner/tune.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/aaa/LLaMA-Factory/src/llmtuner/tuner/sft/workflow.py", line 67, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step
loss = self.compute_loss(model, inputs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss
outputs = model(**inputs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward
return self.base_model(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
return self.model.forward(*args, **kwargs)
File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 830, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 631, in forward
layer_ret = torch.utils.checkpoint.checkpoint(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 230, in forward
outputs = run_function(*args)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 376, in forward
mixed_x_layer = self.query_key_value(hidden_states)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/tuners/lora.py", line 902, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
0%| | 0/255 [00:00<?, ?it/s]
命令运行过程中,看上去已经成功加载模型了,应该是训练第1个epoch时的报错。我--fp16
加到上面的命令中运行,也有报错。
这是与开源社区交流的记录: https://github.com/hiyouga/LLaMA-Factory/issues/1359
- 原因:cuda 环境问题
- 解决方案:
pip install torch==2.0.1
- 排查:打log看
torch.cuda.is_available()
输出为False说明CUDA环境有问题