保姆级教程：Unsloth微调DeepSeek

呆呆珝

已于 2025-03-17 19:45:17 修改

阅读量1.5k

点赞数 28

分类专栏：模型量化文章标签：人工智能深度学习

于 2025-03-17 17:54:32 首次发布

本文链接：https://blog.csdn.net/weixin_43999691/article/details/146320845

版权

模型量化专栏收录该内容

9 篇文章

订阅专栏

1.背景

Unsloth介绍

Unsloth是一个开源项目，主要致力于提高大型语言模型（LLMs）的微调效率。该项目由Daniel Han和Michael Han领导的团队开发，旨在为开发者提供一个高效且低内存占用的微调解决方案。

以下是Unsloth的一些主要特点和功能：

高效的微调性能：Unsloth通过优化内核和手动反向传播引擎，实现了微调训练速度的2倍提升。同时，与传统微调方法相比，内存使用减少了70%至80%，使得在相同硬件资源下可以处理更大的数据集和更复杂的模型。
广泛的模型支持：Unsloth支持多种流行的LLMs，包括Llama 3.3、Mistral、Phi-4、Qwen 2.5和Gemma等。无论是微调具有70B参数的Llama 3.3模型，还是9B参数的Gemma模型，Unsloth都能满足需求。
易于使用的免费笔记本：Unsloth提供了多个免费的笔记本示例，这些笔记本对初学者非常友好。用户只需添加自己的数据集并运行所有代码，即可获得微调后的模型。
动态4位量化：Unsloth引入了动态4位量化技术，通过动态选择不量化某些参数，显著提高了模型的准确性，同时只比标准的4位量化多使用了不到10%的显存。

此外，Unsloth还提供了以下功能：

支持将微调后的模型导出为GGUF、Ollama、vLLM格式，或上传到Hugging Face。
长上下文支持，例如Llama 3.3 (70B)模型的89K上下文窗口，以及Llama 3.1 (8B)模型的342K上下文窗口。
视觉模型支持，如Llama 3.2 Vision (11B)、Qwen 2.5 VL (7B)和Pixtral (12B)等。
提供多种推理优化选项，显著提高推理速度。

Unsloth的安装和使用相对简单，它支持多种安装方式，包括Conda和Pip安装，并提供了详细的文档和教程来帮助用户快速上手。

2.unsloth安装

自己在使用unsloth安装的时候很被动，一直搞不清楚怎样安装，这几天将安装步骤弄清楚后，分享出来，也可以自己看看。（个人建议最好安装conda来管理，python版本使用的是3.11，3.10应该也可以）

step1:驱动安装

cuda的驱动要安装好，这个网上很多的，我不再说明，安装好驱动后查看cuda的版本，我这里是12.2。

step2:安装unsloth

我安装的是torch2.4.0，配套的torchvision是0.19，驱动是cu122（如果是12.1就写121，根据实际情况安装），还有xformers的版本需要和torch的版本对应（如果安装的是0.0.28，那会更新torch版本和cuda相关的pip库，很麻烦就是了）

pip install --upgrade pip
pip install torch==2.4.0
pip install torchvision==0.19
pip install "unsloth[cu122-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install --upgrade --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth-zoo.git
pip install xformers==0.0.27.post2（如果torch版本不同，请查找对应的xfomers版本）

step3:测试unsloth

出现这个界面，恭喜你，已经安装成功，开始训练吧

3.模型训练

step1:模型准备

本人的GPU比较弱，只有6G显存，所以就训练1.5B版本来练手，模型下载指令

git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

step2:数据准备

中医的语料：数据下载链接

https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/tree/main

step3:训练代码

参数说明参考链接

https://docs.unsloth.ai/get-started/fine-tuning-guide

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)


prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""


question = "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""


OS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }


from datasets import load_dataset
dataset = load_dataset("./FreedomIntelligence/medical-o1-reasoning-SFT", 'zh', split = "train[0:500]", trust_remote_code=True)
print(dataset.column_names)

dataset = dataset.map(formatting_prompts_func, batched = True)
dataset["text"][0]


model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)


trainer_stats = trainer.train()
print(question)
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

step4:微调的lora模型

微调完成后，会在文件中生成outputs文件，这个文件是lora模型，测试的时候加载lora模型(需要保留基座模型：deepseekr1-1.5b，否者无法推理)

from unsloth import FastLanguageModel

max_seq_length = 2048 
dtype = None 
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "./outputs/checkpoint-60",
    model_name = "./outputs/checkpoint-60",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit, 
)

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""


question = "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？"
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])