大模型训练与微调之Unsloth-实站之Qwen3微调

最新推荐文章于 2025-08-12 11:58:37 发布

Python编程杰哥

最新推荐文章于 2025-08-12 11:58:37 发布

阅读量915

点赞数 11

CC 4.0 BY-SA版权

文章标签： python 开发语言人工智能机器学习算法 langchain java

本文链接：https://blog.csdn.net/xx_nm98/article/details/149407016

全新 Qwen3 模型在推理、指令执行、代理功能和多语言支持方面均实现了最先进的进步。Unsloth 使 Qwen3 的微调速度提升了 2 倍，显存占用减少了 70%，并支持 8 倍的上下文长度。

由于 Qwen3 同时支持推理和非推理，您可以使用非推理数据集对其进行微调，但这可能会影响其推理能力。如果您想保持其推理能力（可选），可以混合使用直接答案和思路链示例。在数据集中使用75% 的推理和25% 的非推理，以使模型保留其推理能力。

一、安装依赖包

%%captureimport osif "COLAB_" not in "".join(os.environ.keys()):    print("not COLAB")    # !conda env list    # !conda activate jupyter    !pip install --user unslothelse:    # Do this only in Colab notebooks! Otherwise use pip install unsloth    print("COLAB")    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer    !pip install --no-deps unsloth
二、加载模型与分词器

from unsloth import FastLanguageModelimport torchmodel, tokenizer = FastLanguageModel.from_pretrained(    # model_name = "unsloth/Qwen3-14B",    model_path = "/home/models/Qwen3-0.6B",    max_seq_length = 2048,   # Context length - can be longer, but uses more memory    load_in_4bit = True,     # 4bit uses much less memory    load_in_8bit = False,    # A bit more accurate, uses 2x memory    full_finetuning = False, # We have full finetuning now!    # token = "hf_...",      # use one if using gated models)

添加 LoRA 适配器，这样只需更新全部参数的 1% 到 10%！

model = FastLanguageModel.get_peft_model(    model,    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",                      "gate_proj", "up_proj", "down_proj",],    lora_alpha = 32,  # Best to choose alpha = rank or rank*2    lora_dropout = 0, # Supports any, but = 0 is optimized    bias = "none",    # Supports any, but = "none" is optimized    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context    random_state = 3407,    use_rslora = False,   # We support rank stabilized LoRA    loftq_config = None,  # And LoftQ)

二、准备训练数据

Qwen3 同时具备推理模式和非推理模式。因此，需要使用两个数据集：

1. 我们使用了赢得 AIMO（AI 数学奥林匹克 - 进步奖 2）挑战赛所用的 Open Math Reasoning 数据集！我们从使用 DeepSeek R1 生成、准确率 > 95% 的可验证推理轨迹中采样了 10%。

2. 我们还利用了 Maxime Labonne 的 FineTome-100k 数据集（ShareGPT 格式）。但我们也需要将其转换为 HuggingFace 标准的多轮对话格式。

from datasets import load_datasetreasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

数据集结构如下：

以下代码将推理数据集转换为会话数据格式：

def generate_conversation(examples):    problems  = examples["problem"]    solutions = examples["generated_solution"]    conversations = []    for problem, solution in zip(problems, solutions):        conversations.append([            {"role" : "user",      "content" : problem},            {"role" : "assistant", "content" : solution},        ])    return { "conversations": conversations, }
reasoning_conversations = tokenizer.apply_chat_template(    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],    tokenize = False,)

以下代码将非推理数据集，使用unsloth提供的standardize_sharegpt方法转换为会话格式：

from unsloth.chat_templates import standardize_sharegptdataset = standardize_sharegpt(non_reasoning_dataset)non_reasoning_conversations = tokenizer.apply_chat_template(    dataset["conversations"],    tokenize = False,)

查看一下数据集大小，可以看到大小为：19252与100000

print(len(reasoning_conversations))print(len(non_reasoning_conversations))

非推理数据集要大得多。我们假设希望模型保留一定的推理能力，但我们的目标是专门打造一个聊天模型。

让我们设定一个纯聊天数据的比例。目标就是定义这两组数据的某种混合比例。

我们选择基于 75% 推理数据和 25% 聊天数据的比例：

chat_percentage = 0.25import pandas as pdnon_reasoning_subset = pd.Series(non_reasoning_conversations)non_reasoning_subset = non_reasoning_subset.sample(    int(len(reasoning_conversations)*(chat_percentage/(1 - chat_percentage))),    random_state = 2407,)print(len(reasoning_conversations))print(len(non_reasoning_subset))print(len(non_reasoning_subset) / (len(non_reasoning_subset) + len(reasoning_conversations)))
#结果为：19252 6417 0.2499902606256574

将两类数据集组合在一起，就完成发数据集的准备：

data = pd.concat([    pd.Series(reasoning_conversations),    pd.Series(non_reasoning_subset)])data.name = "text"from datasets import Datasetcombined_dataset = Dataset.from_pandas(pd.DataFrame(data))combined_dataset = combined_dataset.shuffle(seed = 3407)

三、训练模型

下面代码使用 Huggingface TRL 的 SFTTrainer，只进行 60 步训练以加快速度，但你可以设置 num_train_epochs=1 来运行完整的一轮，并将 max_steps=None 关闭（即不设最大步数限制）。

from trl import SFTTrainer, SFTConfigtrainer = SFTTrainer(    model = model,    tokenizer = tokenizer,    train_dataset = combined_dataset,    eval_dataset = None, # Can set up evaluation!    args = SFTConfig(        dataset_text_field = "text",        per_device_train_batch_size = 2,        gradient_accumulation_steps = 4, # Use GA to mimic batch size!        warmup_steps = 5,        # num_train_epochs = 1, # Set this for 1 full training run.        max_steps = 30,        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs        logging_steps = 1,        optim = "adamw_8bit",        weight_decay = 0.01,        lr_scheduler_type = "linear",        seed = 3407,        report_to = "none", # Use this for WandB etc    ),)
trainer_stats = trainer.train()

四、运行模型

使用Unsloth的原生接口来调用模型，根据Qwen团队的推荐，推理模式下参数设置为temperature = 0.6, top_p = 0.95, top_k = 20，会话模式下参数设置为：temperature = 0.7, top_p = 0.8, top_k = 20

还可以将 enable_thinking 设置为true以便使用推理模式，可以看到推理与会话模式的输出内容是完全不一样的，但是解方程的答案都是正确的，不一致的只是过程。

messages = [    {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}]text = tokenizer.apply_chat_template(    messages,    tokenize = False,    add_generation_prompt = True, # Must add for generation    enable_thinking = False, # Disable thinking)from transformers import TextStreamer_ = model.generate(    **tokenizer(text, return_tensors = "pt").to("cuda"),    max_new_tokens = 256, # Increase for longer outputs!    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking    streamer = TextStreamer(tokenizer, skip_prompt = True),)

五、保存与加载模型

5.1 保存LoRA适配器

保存最终的 LoRA 适配器模型有两种方式：

通过 Huggingface 的 push_to_hub 保存至云端
使用 save_pretrained 进行本地保存

【注意】此操作仅保存 LoRA 适配器，而非完整模型。如需保存 16bit 或 GGUF 格式的完整模型，请向下滚动查看相关说明！

model.save_pretrained("lora_model")  # Local savingtokenizer.save_pretrained("lora_model")# model.push_to_hub("your_name/lora_model", token = "...") # Online saving# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

加载模型代码：

from unsloth import FastLanguageModelmodel, tokenizer = FastLanguageModel.from_pretrained(   model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING   max_seq_length = 2048,   load_in_4bit = True,)

5.2 合并并保存为float16格式

选择 merged_16bit 保存为 float16 格式
选择 merged_4bit 保存为 int4 格式

同时提供 LoRA 适配器作为备用保存方案。

注意：if False只是一种写程序表达的技巧，你需要什么格式就使用对应格式的代码即可。

# Merge to 16bitif False:    model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)if False: # Pushing to HF Hub    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")# Merge to 4bitif False:    model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)if False: # Pushing to HF Hub    model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")# Just LoRA adaptersif False:    model.save_pretrained_merged("model", tokenizer, save_method = "lora",)if False: # Pushing to HF Hub    model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

5.3. 合并并保存为GGUF格式

unsloth支持保存为 GGUF / llama.cpp 格式，默认以 q8_0 量化方式保存。支持包括 q4_k_m 在内的所有量化方法：

本地保存：使用 save_pretrained_gguf
上传至 Hugging Face：使用 push_to_hub_gguf

部分支持的量化方法：

q8_0 - 转换速度快；资源消耗较高，但通常可接受**
q4_k_m - 推荐使用。对半数 attention.wv 和 feed_forward.w2 张量采用 Q6_K 量化，其余使用 Q4_K
q5_k_m - 推荐使用。对半数 attention.wv 和 feed_forward.w2 张量采用 Q6_K 量化，其余使用 Q5_K

# Save to 8bit Q8_0if False:    model.save_pretrained_gguf("model", tokenizer,)# Remember to go to https://huggingface.co/settings/tokens for a token!# And change hf to your username!if False:    model.push_to_hub_gguf("hf/model", tokenizer, token = "")# Save to 16bit GGUFif False:    model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")if False: # Pushing to HF Hub    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")# Save to q4_k_m GGUFif False:    model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")if False: # Pushing to HF Hub    model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")# Save to multiple GGUF options - much faster if you want multiple!if False:    model.push_to_hub_gguf(        "hf/model", # Change hf to your username!        tokenizer,        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],        token = "", # Get a token at https://huggingface.co/settings/tokens    )

现在，您可以在 llama.cpp 或 Jan、Open WebUI 等图形界面系统中加载以下模型文件：model-unsloth.gguf 或 model-unsloth-Q4_K_M.gguf。

至止，我们已经完成了对Qwen3模型的微调，实际项目中将数据集换成你自己的数据集即可。调整完成之后，需要使用一定的测试数据集进行效果的测试与评估。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述