单卡4090上用最新LLaMA-Factory微调qwen3 最新模型-CSDN博客

本文链接：https://blog.csdn.net/Python_cocola/article/details/148056843

实测用最新的 LLaMA-Factory 项目 SFT 微调最新的 qwen3 模型，只需如下几步：（LLaMA-Factory 在5.1之前以第一时间完成了 Qwen3 的深层次优化，包括训练和推理逻辑）

1.镜像构建

LLaMA-Factory 项目提供了镜像构建的 dockerfile： https://github.com/hiyouga/LLaMA-Factory/blob/main/docker/docker-cuda/Dockerfile

只需根据自己的环境略作修改即可。比如我的环境cuda驱动是12.1，所以修改 Dockerfile 的基础镜像为

FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel

构建好的镜像，启动的容器中，主要库的版本：

accelerate                1.6.0datasets                  3.5.0llamafactory              0.9.3.dev0   /appnvidia-cuda-cupti-cu12    12.1.105nvidia-cuda-nvrtc-cu12    12.1.105nvidia-cuda-runtime-cu12  12.1.105opencv-python-headless    4.5.5.64peft                      0.15.1stack-data                0.6.2torch                     2.5.1+cu121torchaudio                2.5.1+cu121torchelastic              0.2.2torchvision               0.20.1+cu121transformers              4.51.3trl                       0.9.6types-dataclasses         0.6.6tzdata                    2025.2uvicorn                   0.34.2

模型和数据集准备

为了便于快速测试验证，采用尺寸最小的qwen3模型：https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base

数据集：https://huggingface.co/datasets/qihoo360/Light-R1-SFTData/stage2-3k.json

编写 LLaMA-Factory 要求格式的 dataset_info.json 放到容器挂载的 /datasets 根目录下：

{    "Light-R1-SFT-stage2": {      "file_name": "qihoo360/Light-R1-SFTData/stage2-3k.json",      "file_sha1": "481cd356262d36b9d16ac49f7fc8ff3d4c9f349c",      "formatting": "sharegpt",      "columns": {        "messages": "conversations"      },      "tags": {        "role_tag": "from",        "content_tag": "value",        "user_tag": "user",        "assistant_tag": "assistant"      },      "ranking": false,      "field": "auto"    }}

训练执行

容器启动后，在界面表单上选择如下内容：

模型路径：/root/.cache/modelscope/hub/qwen/Qwen3-0.6B
微调方法： Lora
数据路径：data
数据集：Light-R1-SFT-stage2

界面设置好后，点击 “Preview command” 按钮（或预览命令）按钮，显示的命令内容应如下所示：

llamafactory-cli train \    --stage sft \    --do_train True \    --model_name_or_path /root/.cache/modelscope/hub/qwen/Qwen3-0.6B \    --preprocessing_num_workers 16 \    --finetuning_type lora \    --template default \    --flash_attn auto \    --dataset_dir data \    --dataset Light-R1-SFT-stage2 \    --cutoff_len 2048 \    --learning_rate 5e-05 \    --num_train_epochs 3.0 \    --max_samples 100000 \    --per_device_train_batch_size 2 \    --gradient_accumulation_steps 8 \    --lr_scheduler_type cosine \    --max_grad_norm 1.0 \    --logging_steps 5 \    --save_steps 100 \    --warmup_steps 0 \    --packing False \    --report_to none \    --output_dir saves/Qwen3-0.6B-Base/lora/train_2025-05-16-03-58-21 \    --bf16 True \    --plot_loss True \    --trust_remote_code True \    --ddp_timeout 180000000 \    --include_num_input_tokens_seen True \    --optim adamw_torch \    --lora_rank 8 \    --lora_alpha 16 \    --lora_dropout 0 \    --lora_target all

点击开始按钮，启动训练，容器后台可以看到如下日志：

...[INFO|configuration_utils.py:691] 2025-05-16 03:59:17,683 >> loading configuration file /root/.cache/modelscope/hub/qwen/Qwen3-0.6B/config.json[INFO|configuration_utils.py:765] 2025-05-16 03:59:17,685 >> Model config Qwen3Config {  "architectures": [    "Qwen3ForCausalLM"  ],  "attention_bias": false,  "attention_dropout": 0.0,  "bos_token_id": 151643,  "eos_token_id": 151645,  "head_dim": 128,  "hidden_act": "silu",  "hidden_size": 1024,  "initializer_range": 0.02,  "intermediate_size": 3072,  "max_position_embeddings": 40960,  "max_window_layers": 28,  "model_type": "qwen3",  "num_attention_heads": 16,  "num_hidden_layers": 28,  "num_key_value_heads": 8,  "rms_norm_eps": 1e-06,  "rope_scaling": null,  "rope_theta": 1000000,  "sliding_window": null,  "tie_word_embeddings": true,  "torch_dtype": "bfloat16",  "transformers_version": "4.51.3",  "use_cache": true,  "use_sliding_window": false,  "vocab_size": 151936}...[INFO|2025-05-16 03:59:18] llamafactory.data.loader:143 >> Loading dataset qihoo360/Light-R1-SFTData/stage2/stage2-3k.json...Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.Generating train split: 3259 examples [00:01, 2540.89 examples/s]Converting format of dataset (num_proc=16): 100%|██████████| 3259/3259 [00:00<00:00, 8510.00 examples/s]Running tokenizer on dataset (num_proc=16): 100%|██████████| 3259/3259 [00:07<00:00, 459.41 examples/s]training example:input_ids:...[INFO|trainer.py:2414] 2025-05-16 03:59:30,187 >> ***** Running training *****[INFO|trainer.py:2415] 2025-05-16 03:59:30,187 >>   Num examples = 3,259[INFO|trainer.py:2416] 2025-05-16 03:59:30,187 >>   Num Epochs = 3[INFO|trainer.py:2417] 2025-05-16 03:59:30,187 >>   Instantaneous batch size per device = 2[INFO|trainer.py:2420] 2025-05-16 03:59:30,187 >>   Total train batch size (w. parallel, distributed & accumulation) = 16[INFO|trainer.py:2421] 2025-05-16 03:59:30,187 >>   Gradient Accumulation steps = 8[INFO|trainer.py:2422] 2025-05-16 03:59:30,187 >>   Total optimization steps = 609[INFO|trainer.py:2423] 2025-05-16 03:59:30,190 >>   Number of trainable parameters = 5,046,272[INFO|2025-05-16 03:59:51] llamafactory.train.callbacks:143 >> {'loss': 0.6663, 'learning_rate': 4.9995e-05, 'epoch': 0.02, 'throughput': 7542.93}{'loss': 0.6663, 'grad_norm': 0.36491671204566956, 'learning_rate': 4.999467794024707e-05, 'epoch': 0.02, 'num_input_tokens_seen': 163840}...[INFO|2025-05-16 04:06:20] llamafactory.train.callbacks:143 >> {'loss': 0.6069, 'learning_rate': 4.6810e-05, 'epoch': 0.49, 'throughput': 7971.24} 16%|█▋        | 100/609 [06:50<34:33,  4.07s/it][INFO|trainer.py:3984] 2025-05-16 04:06:20,824 >> Saving model checkpoint to saves/Qwen3-0.6B-Base/lora/train_2025-05-16-03-58-21/checkpoint-100...

训练结果

日志：

***** train metrics *****  epoch                    =     2.9963  num_input_tokens_seen    =   19960896  total_flos               = 49692689GF  train_loss               =     0.6036  train_runtime            = 0:41:45.61  train_samples_per_second =      3.902  train_steps_per_second   =      0.243Figure saved at: saves/Qwen3-0.6B-Base/lora/train_2025-05-16-03-58-21/training_loss.png

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述