微调实战 - 使用 Unsloth 微调 QwQ 32B 4bit （单卡4090）

编程乐园

已于 2025-03-22 20:50:53 修改

阅读量1.4k

点赞数 15

分类专栏： # AI 极简实践文章标签： unsloth 微调 QwQ 4090 GPU reasoning 医疗

于 2025-03-22 20:47:39 首次发布

本文链接：https://blog.csdn.net/lovechris00/article/details/146446046

版权

AI 极简实践专栏收录该内容

22 篇文章

订阅专栏

本文参考视频教程：赋范课堂 – 只需20G显存，QwQ-32B高效微调实战！4大微调工具精讲！知识灌注+问答风格微调，DeepSeek R1类推理模型微调+Cot数据集创建实战打造定制大模型！
https://www.bilibili.com/video/BV1YoQoYQEwF/
课件资料：https://kq4b3vgg5b.feishu.cn/wiki/LxI9wmuFmiaLCkkoiCIcKvOan7Q
在此之上有删改

赋范课堂有非常好的课程，推荐大家去学习观看

文章目录

一、基本准备

1、安装unsloth

pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

2、wandb 安装与注册

wandb 类似于 tensorboard，但比它稳定

注册和使用，详见：https://blog.csdn.net/lovechris00/article/details/146437418

安装库

pip install wandb

wandb login

3、下载模型

https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit

安装 huggingface_hub

pip install huggingface_hub

使用screen开启持久化会话

模型下载时间可能持续0.5-1个小时。避免因为关闭会话导致下载中断

安装 screen

sudo apt install screen

screen -S qwq

设置模型国内访问镜像

Linux 上 ~/.bashrc 添加环境变量

export HF_ENDPOINT='https://hf-mirror.com'

下载模型

huggingface-cli download --resume-download  unsloth/QwQ-32B-unsloth-bnb-4bit

修改模型默认下载地址

模型默认下载到 ~/.cache/huggingface/hub/，如果想改到其它地方，可以设置 HF_HOME 键

export HF_HOME="/root/xx/HF_download"

二、模型调用测试

modelscope 调用

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "unsloth/QwQ-32B-unsloth-bnb-4bit"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Ollama 调用

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)

prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]

response = client.chat.completions.create(
    messages=messages,
    model='qwq-32b-bnb',
)

print(response.choices[0].message.content)

模型注册

查看是否注册成功

ollama list

使用 openai 库请求

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)

prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]

vLLM 调用

vllm serve /root/autodl-tmp/QwQ-32B-unsloth-bnb-4bit \
--quantization bitsandbytes \
--load-format bitsandbytes \
--max-model-len 2048

请求测试

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "你好，好久不见！"
messages = [
    {"role": "user", "content": prompt}
]

response = client.chat.completions.create(
    model="/root/autodl-tmp/QwQ-32B-unsloth-bnb-4bit",
    messages=messages,
)

print(response.choices[0].message.content)

三、下载微调数据集

推理类模型回复结构与微调数据集结构要求

QwQ-32B模型和DeepSeek R1类似，推理过程的具体体现就是在回复内容中，会同时包含推理部分内容和最终回复部分内容，并且其推理部分内容会通过（一种在模型训练过程中注入的特殊标记）来进行区分。

下载 NuminaMath CoT 数据集

https://huggingface.co/datasets/AI-MO/NuminaMath-CoT

huggingface-cli download AI-MO/NuminaMath-CoT --repo-type dataset

除了NuminaMath CoT数据集外，还有APPs（编程数据集）、TACO（编程数据集）、long_form_thought_data_5k（通用问答数据集）等，都是CoT数据集，均可用于推理模型微调。相关数据集介绍，详见公开课：《借助DeepSeek R1进行模型蒸馏，模型蒸馏入门实战！》| https://www.bilibili.com/video/BV1X1FoeBEgW/

下载 medical-o1-reasoning-SFT数据集

https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

huggingface-cli download FreedomIntelligence/medical-o1-reasoning-SFT --repo-type dataset

你也可以使用 Python - datasets 库来下载

from datasets import load_dataset

# 此处先下载前500条数据即可完成实验
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)

# 查看数据集情况
dataset[0]

四、加载模型

from unsloth import FastLanguageModel 

max_seq_length = 2048 
dtype = None 
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/QwQ-32B-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

此时消耗 GPU : 22016MB

五、微调前测试

查看模型信息

>>> model
Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 5120, padding_idx=151654)
    (layers): ModuleList(
      (0): Qwen2DecoderLayer(
        ...
      (62): Qwen2DecoderLayer(
        ...
      )
      (63): Qwen2DecoderLayer(
        ...
    )
    (norm): Qwen2RMSNorm((5120,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=5120, out_features=152064, bias=False)
)

tokenizer 信息

>>> tokenizer
Qwen2TokenizerFast(name_or_path='unsloth/QwQ-32B-unsloth-bnb-4bit', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|vision_pad|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
  ...
	151667: AddedToken("<think>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	151668: AddedToken("</think>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
}
)

基本问答测试

# 将模型调整为推理模式
FastLanguageModel.for_inference(model)  

# 带入问答模板进行回答 

prompt_style_chat = """请写出一个恰当的回答来完成当前对话任务。
***
### Instruction:
你是一名助人为乐的助手。
***
### Question:
{}
***
### Response:
<think>{}"""

question = "你好，好久不见！"
prompt = [prompt_style_chat.format(question, "")] 

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=2048,
    use_cache=True,
)

# GPU 消耗到 22412 mb 
'''
>>> outputs
tensor([[ 14880, 112672,  46944, 112449, 111423,  36407,  60548,  67949, 105051,
          ...
          35946, 106128,  99245, 101037,  11319, 144236, 151645]],
       device='cuda:0')
'''

response = tokenizer.batch_decode(outputs)
# response --> ['请写出一个恰当的回答来完成当前对话任务。\n***\n### Instruction:\n你是一名助人为乐的助手。\n***\n### Question:\n你好，好久不见！\n***\n### Response:\n<think>:\n好的，用户发来问候“你好，好久不见！”，我需要回应并延续对话。首先，应该友好回应他们的问候，比如“你好！确实很久没联系了，希望你一切都好！”这样既回应了对方，也表达了关心。接下来，可能需要询问对方近况，或者引导对话继续下去。比如可以问：“最近有什么新鲜事吗？或者你有什么需要帮助的吗？”这样可以让对话更自然，也符合助人为乐的角色设定。还要注意语气要亲切，保持口语化，避免过于正式。另外，用户可能希望得到情感上的回应，所以需要体现出关心和愿意帮助的态度。检查有没有语法错误，确保句子流畅。最后，确定回应简洁但足够友好，符合对话的流程。\n</think>\n\n你好！确实好久不见了，希望你一切都好！最近有什么新鲜事分享，或者需要我帮忙什么吗？😊<|im_end|>']

print(response[0].split("### Response:")[1])

复杂问题测试

question = "请证明根号2是无理数。"

inputs = tokenizer([prompt_style_chat.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=1200,
    use_cache=True,
)

# GPU 用到 22552MiB

response = tokenizer.batch_decode(outputs)

print(response[0].split("### Response:")[1])

原始模型的医疗问题问答

# 重新设置问答模板 
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
***
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 
***
### Question:
{}
***
### Response:
<think>{}"""

question_1 = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"

question_2 = "Given a patient who experiences sudden-onset chest pain radiating to the neck and left arm, with a past medical history of hypercholesterolemia and coronary artery disease, elevated troponin I levels, and tachycardia, what is the most likely coronary artery involved based on this presentation?"


inputs1 = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

outputs1 = model.generate(
    input_ids=inputs1.input_ids,
    max_new_tokens=1200,
    use_cache=True,
)

response1 = tokenizer.batch_decode(outputs1)

print(response1[0].split("### Response:")[1])

inputs2 = tokenizer([prompt_style.format(question_2, "")], return_tensors="pt").to("cuda")

outputs2 = model.generate(
    input_ids=inputs2.input_ids,
    max_new_tokens=1200,
    use_cache=True,
)
# GPU 22842 MiB 

response2 = tokenizer.batch_decode(outputs2)

print(response2[0].split("### Response:")[1])

六、最小可行性实验

接下来我们尝试进行模型微调

对于当前数据集而言，我们可以带入原始数据集的部分数据进行微调，也可以带入全部数据并遍历多次进行微调。

对于大多数的微调实验，我们都可以从最小可行性实验入手进行微调，也就是先尝试带入少量数据进行微调，并观测微调效果。

若微调可以顺利执行，并能够获得微调效果，再考虑带入更多的数据进行更大规模微调。

定义提示词

import os
from datasets import load_dataset

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
***
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 
***
### Question:
{}
***
### Response:
<think>
{}
</think>
{}"""

EOS_TOKEN = tokenizer.eos_token  # '<|im_end|>'

定义数据集处理函数

用于对medical-o1-reasoning-SFT数据集进行修改，Complex_CoT 列和 Response 列进行拼接，并加上文本结束标记：

def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

整理数据

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)  
''' 
{
	'Question': 'A 61-year-old ... contractions?',
	'Complex_CoT': "Okay, let's ... incontinence.",
	'Response': 'Cystometry in ... the test.' 
}
'''

# 结构化处理 
dataset = dataset.map(formatting_prompts_func, batched = True,) 

# 查看  
dataset["text"][0]
'''
Below is an instruction that ... response.
***
### Instruction:
You are a medical ... medical question. 
***
### Question:
A 61-year-old woman ... contractions?
***
### Response:
<think>
Okay,...Yup, I think that makes sense given her symptoms and the typical presentations of stress urinary incontinence.
</think>
Cystometry ... is primarily related to physical e
'''

开启微调

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 创建有监督微调对象
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length, 
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

微调说明

这段代码主要是用 SFTTrainer 进行 监督微调（Supervised Fine-Tuning, SFT），适用于 transformers 和 Unsloth 生态中的模型微调：

模型微调参数解析

① `SFTTrainer` 部分

参数	作用
`model=model`	指定需要进行微调的预训练模型
`tokenizer=tokenizer`	指定分词器，用于处理文本数据
`train_dataset=dataset`	传入训练数据集
`dataset_text_field="text"`	指定数据集中哪一列包含训练文本（在 `formatting_prompts_func` 里处理）
`max_seq_length=max_seq_length`	最大序列长度，控制输入文本的最大 Token 数量
`dataset_num_proc=2`	数据加载的并行进程数，提高数据预处理效率

② `TrainingArguments` 部分

参数	作用
`per_device_train_batch_size=2`	每个 GPU/设备的训练批量大小（较小值适合大模型）
`gradient_accumulation_steps=4`	梯度累积步数（相当于 `batch_size=2 × 4 = 8`）
`warmup_steps=5`	预热步数（初始阶段学习率较低，然后逐步升高）
`max_steps=60`	最大训练步数（控制训练的总步数，此处总共约消耗60*8=480条数据）
`learning_rate=2e-4`	学习率（`2e-4` = 0.0002，控制权重更新幅度）
`fp16=not is_bfloat16_supported()`	如果 GPU 不支持 `bfloat16`，则使用 `fp16`（16位浮点数）
`bf16=is_bfloat16_supported()`	如果 GPU 支持 `bfloat16`，则启用 `bfloat16`（训练更稳定）
`logging_steps=10`	每 10 步记录一次训练日志
`optim="adamw_8bit"`	使用 `adamw_8bit`（8-bit AdamW优化器）减少显存占用
`weight_decay=0.01`	权重衰减（L2 正则化），防止过拟合
`lr_scheduler_type="linear"`	学习率调度策略（线性衰减）
`seed=3407`	随机种子（保证实验结果可复现）
`output_dir="outputs"`	训练结果的输出目录

设置 wandb、开始微调

import wandb
wandb.login(key="8c7...242bd")
run = wandb.init(project='Fine-tune-QwQ-32B-4bit on Medical COT Dataset', )

# 开始微调
trainer_stats = trainer.train()

如果出现 CUDA out of memory 的情况，可以酌情修改参数。

试试如下代码(仅用于测试，不保证效果)：

import torch
torch.cuda.empty_cache()

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" 

from unsloth import FastLanguageModel 

max_seq_length = 1024
dtype = None 
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/QwQ-32B-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

import os
from datasets import load_dataset

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
***
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 
***
### Question:
{}
***
### Response:
<think>
{}
</think>
{}"""

EOS_TOKEN = tokenizer.eos_token  # '<|im_end|>'  

def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:200]",trust_remote_code=True)  


# 结构化处理 
dataset = dataset.map(formatting_prompts_func, batched = True,) 

# 开启微调 
model = FastLanguageModel.get_peft_model(
    model,
    r=8,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=8,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 创建有监督微调对象
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length, 
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=20,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

import wandb
wandb.login(key="8c7b98e4f525793b228b04fcc3596acd9e7242bd")
run = wandb.init(project='Fine-tune-QwQ-32B-4bit on Medical COT Dataset', )

# 开始微调
trainer_stats = trainer.train()

查看效果

unsloth在微调结束后，会自动更新模型权重（在缓存中），因此无需手动合并模型权重即可直接调用微调后的模型：

trainer_stats
# TrainOutput(global_step=60, training_loss=1.3152311007181803, metrics={'train_runtime': 709.9004, 'train_samples_per_second': 0.676, 'train_steps_per_second': 0.085, 'total_flos': 6.676294205826048e+16, 'train_loss': 1.3152311007181803})

# 到推理状态 
FastLanguageModel.for_inference(model)

# 再次查看问答效果 
inputs = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

inputs = tokenizer([prompt_style.format(question_2, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=2048,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

模型合并

save_path = 'QwQ-Medical-COT-Tiny'
model.save_pretrained_merged(save_path, tokenizer, save_method = "merged_4bit",)

保存为 GGUF

方便使用ollama进行推理

导出与合并需要较长时间（约20分钟左右）

save_path = 'QwQ-Medical-COT-Tiny-GGUF'
model.save_pretrained_gguf(save_path, tokenizer, quantization_method = "q4_k_m")

七、完整高效微调实验

最后，带入全部数据进行高效微调，以提升模型微调效果。

# 设置训练的提示词模板 
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
***
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 
***
### Question:
{}
***
### Response:
<think>
{}
</think>
{}"""


EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }


# 读取全部数据 
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

# 加载模型 
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)


from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 设置epoch为3，遍历3次数据集：
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs = 3,
        warmup_steps=5,
        # max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

# Map (num_proc=2):   0%| | 0/25371 [00:00<?, ? examples/s] 

trainer_stats = trainer.train()

[ 389/9513 13:44 < 5:24:01, 0.47 it/s, Epoch 0.12/3]

Step	Training Loss
10	1.285900
20	1.262500
…	…
370	1.201200
380	1.215600

这里总共训练约15个小时。

trainer_stats

TrainOutput(global_step=9513, training_loss=1.0824475168592858, metrics={'train_runtime': 20193.217, 'train_samples_per_second': 3.769, 'train_steps_per_second': 0.471, 'total_flos': 2.7936033274397737e+18, 'train_loss': 1.0824475168592858, 'epoch': 2.9992117294655527})

测试

带入两个问题进行测试，均有较好的回答效果：

question = "A 61-year-old ... contractions?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

question = "Given a patient who experiences sudden-onset chest pain radiating to the neck and left arm, with a past medical history of hypercholesterolemia and coronary artery disease, elevated troponin I levels, and tachycardia, what is the most likely coronary artery involved based on this presentation?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])