基于win11下，使用Qwen2.5 0.5B为基模型lora微调，然后使用ollama来运行自定义的大模型的例子

lwprain

已于 2025-05-24 14:57:50 修改

阅读量1k

点赞数 14

文章标签： ollama lora qwen 自定义模型

于 2025-03-19 15:18:24 首次发布

本文链接：https://blog.csdn.net/lwprain/article/details/146367778

版权

近期需要做相关的一些验证工作，想把相关的一些内容记录下来，以供日后使用，后期发现问题也会继续完善。
参考文章：
https://blog.csdn.net/spiderwower/article/details/138755776
但是我按照上述的例子只能做到一半，所以又找其他一些相关的资料，才完成所有任务。

1、下载基模型：使用魔搭社区https://modelscope.cn
①首先先在魔搭社区找到需要使用的基模型，比如我这里使用阿里的Qwen2.5-0.5模型，完整名字：Qwen/Qwen2.5-0.5B-Instruct
在这里插入图片描述
②确保已安装合适的conda环境和python环境，这里使用python3.10，具体不再细说。

conda create -n nlp python=3.10
conda activate nlp310
切换到nlp310环境：
conda activate nlp310

③安装torch环境，如果有英伟达的显卡一定要安装gpu版本，我这里笔记本性能有限，MX350，2G显存，没办法只能选择小模型，当然要提前安装好cuda环境，这里计划使用cuda版本为11.8。
可以参考文章：win11+4060配置CUDA11.8+pytorch2.0.0

好之后，可以使用阿里云的镜像源来安装：

pip3 install torch==2.6.0 torchvision torchaudio -f https://mirrors.aliyun.com/pytorch-wheels/cu118/

可以通过如下地址验证版本及cuda是否可用：

python -c "import torch;print(torch.cuda.is_available())"

当然也可以安装其他版本。
安装完后，再接着安装训练微调使用的pip包：

transformers==4.49.0
streamlit==1.24.0
sentencepiece==0.1.99
accelerate==0.29.3
datasets==2.14.5
peft==0.10.0
tiktoken==0.9.0
modelscope[framework]

也有可能会报错，原因是transformers的版本问题。
这种情况下会建议安装tf-keras。

pip install tf-keras

④基模型下载，使用如下的代码：
当然要提前安装modelscope

环境变量中提前设置modelscope的默认存储地址：
我把下载路径换到了d:/cuda/modelscope中。

环境变量名：MODELSCOPE_CACHE
环境值：d:/cuda/modelscope

模型下载：md_download1.py

from modelscope import snapshot_download
 
#模型存放路径
# model_path = '/root/autodl-tmp'
#模型名字
name = 'Qwen/Qwen2.5-0.5B-Instruct'
# model_dir = snapshot_download(name, cache_dir=model_path, revision='master')
model_dir = snapshot_download(name)

注意name为你想下载的模型全名，运行即可下载：

python md_download1.py

在这里插入图片描述
具体内容：

2、下载训练集
以中文数据集弱智吧为例，约1500条对话数据来进行训练,完整名称为：kigner/ruozhiba-llama3-tt
地址为：
https://huggingface.co/datasets/kigner/ruozhiba-llama3-tt/tree/main
我是手工下载文件到路径d:/cuda/ruozhiba-llma3-tt。
在这里插入图片描述
3、使用lora微调。
①安装相应的python包。使用其他文章中的训练代码，只调整了模型名称,train.py：

import os
os.environ["TF_ENABLE_ONEDNN_OPTS"]="0"
from datasets import Dataset
import pandas as pd
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer, )
import torch
from peft import LoraConfig, TaskType, get_peft_model
import warnings
warnings.filterwarnings("ignore", category=UserWarning) # 忽略告警
 
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("device:"+device)
print(f"PyTorch版本: {torch.__version__}")
print(f"是否有可用GPU: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU数量: {torch.cuda.device_count()}")
    print(f"当前GPU: {torch.cuda.current_device()}")
    print(f"GPU名称: {torch.cuda.get_device_name(0)}")
# 模型文件路径
model_path = r'D:/cuda/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct'
# 训练过程数据保存路径
name = 'ruozhiba'
output_dir = f'./output/qwen-0.5B-{name}'
#是否从上次断点处接着训练，如果需要从上次断点处继续训练，值应为True
train_with_checkpoint = False
# 加载tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# 加载数据集
df = pd.read_json(f'D:/cuda/ruozhiba-llama3-tt/ruozhiba_qa.json')
ds = Dataset.from_pandas(df)
print(ds)
# 对数据集进行处理，需要将数据集的内容按大模型的对话格式进行处理
# 大模型处理
def process_func_mistral(example):
    MAX_LENGTH = 384  # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性
    instruction = tokenizer(
        f"<s>[INST] <<SYS>>\n\n<</SYS>>\n\n{example['instruction']+example['input']}[/INST]",add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  # 因为pad_token_id咱们也是要关注的所以 补充为1
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }
 
inputs_id = ds.map(process_func_mistral, remove_columns=ds.column_names)
#加载模型
# 根据GPU能力选择合适的数据类型
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
    # A100、H100、RTX 40系列等支持bfloat16
    torch_dtype = torch.bfloat16
else:
    # 其他GPU使用float16
    torch_dtype = torch.float16

print(f"使用的数据类型: {torch_dtype}")

if not os.path.exists(model_path):
    raise ValueError(f"模型路径不存在: {model_path}")

# 列出路径下的文件，检查是否包含配置文件
print(f"模型路径下的文件: {os.listdir(model_path)}")
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.bfloat16, use_cache=False)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法
print(model)
#设置lora训练参数
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1  # Dropout 比例
)
#设置训练参数
model = get_peft_model(model, config)
model.print_trainable_parameters()
args = TrainingArguments(
    output_dir=output_dir,# 模型保存路径
    per_device_train_batch_size=2,# 训练的批大小  之前为2，显存2G可用，实际显存最大会占到4.6G； 32时，显存占用到15G；16时，显存占用到15G；8时，显存占用到11.7G；4时，显存占用到6.7G
    gradient_accumulation_steps=2,# 梯度累积步数 
    logging_steps=10,# 日志打印的步数
    num_train_epochs=2,# 训练的轮数 之前为2
    save_steps=25,# 保存的步数
    save_total_limit=2,# 保存的模型数量
    learning_rate=1e-4,# 学习率
    save_on_each_node=True,# 多节点训练时，每个节点保存一次
    gradient_checkpointing=True,# 开启梯度检查点
    label_names=["labels"],# 标签名称
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=inputs_id,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
#开始训练
# 如果训练中断了，还可以从上次中断保存的位置继续开始训练
if train_with_checkpoint:
    checkpoint = [file for file in os.listdir(output_dir) if 'checkpoint' in file][-1]
    last_checkpoint = f'{output_dir}/{checkpoint}'
    print(last_checkpoint)
    trainer.train(resume_from_checkpoint=last_checkpoint)
else:
    trainer.train()

执行代码，python train.py

会生成几个checkpoint，这个过程在我的电脑上2轮，批次大小2个，使用了差不多50分钟，精度也很差。
后来找了资源，又试了其他的：
批次8,100轮训练，A4000显卡，16G显存，loss降到了0.005情况：

{'loss': 0.005, 'grad_norm': 0.1689453125, 'learning_rate': 0.0, 'epoch': 98.94}
{'train_runtime': 4730.9062, 'train_samples_per_second': 31.622, 'train_steps_per_second': 1.966, 'train_loss': 0.09664655848256042, 'epoch': 98.94}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9300/9300 [1:18:50<00:00,  1.97it/s]

在这里插入图片描述
②、checkpoint转为lora
checkpoint_to_lora.py
具体代码：

from transformers import AutoModelForSequenceClassification,AutoTokenizer
import os
 
# 需要保存的lora路径
lora_path= "d:/cuda/lora/qwen-0__5B-lora-ruozhiba"
# 模型路径
model_path = 'D:/cuda/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct'
# 检查点路径
checkpoint_dir = 'output/qwen-0.5B-ruozhiba'
checkpoint = [file for file in os.listdir(checkpoint_dir) if 'checkpoint-' in file][-1] #选择更新日期最新的检查点
model = AutoModelForSequenceClassification.from_pretrained(f'output/qwen-0.5B-ruozhiba/{checkpoint}')
# 保存模型
model.save_pretrained(lora_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# 保存tokenizer
tokenizer.save_pretrained(lora_path)

执行代码：python checkpoint_to_lora.py
最终会把模型生成到d:/cuda/lora/qwen-0__5B-lora-ruozhiba中：
在这里插入图片描述

③、模型合并
merge.py
代码如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from peft import PeftModel
from peft import LoraConfig, TaskType, get_peft_model
 
model_path = 'D:/cuda/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct'
lora_path = "d:/cuda/lora/qwen-0__5B-lora-ruozhiba"
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# 合并后的模型路径
output_path = r'd:/cuda/merge'
 
# 等于训练时的config参数
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1  # Dropout 比例
)
 
base = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
base_tokenizer = AutoTokenizer.from_pretrained(model_path)
lora_model = PeftModel.from_pretrained(
    base,
    lora_path,
    torch_dtype=torch.float16,
    config=config
)
model = lora_model.merge_and_unload()
model.save_pretrained(output_path)
base_tokenizer.save_pretrained(output_path)

执行代码：
python merge.py

生成结果如下：
在这里插入图片描述

④模型转为gguf格式：
下载源码：
https://github.com/ggml-org/llama.cpp
在这里插入图片描述

使用其中的转换器转换：

python convert_hf_to_gguf.py D:/cuda/merge --outtype f16 --outfile d:/cuda/qwen-ruozhiba.bin

在这里插入图片描述
4、模型量化：
首先下载llama.cpp的编译包：
https://github.com/ggml-org/llama.cpp/releases
根据自己的cpu情况选择合适的包，比我的使用的cpu是intel10代处理，通过cpuz可以查看支持的技术：

我这里支持AVX和AVX2，可以使用AVX2的包：
在这里插入图片描述
下载后，解压，会得到相关的编译包：

可以将路径加到path中，但要注意避免冲突，PS:后期升级ollama后出现了冲突，去掉path即可。
然后在D:/cuda中执行转换：
llama-quantize qwen-ruozhiba.bin q5_k_m
会生成一个新文件ggml-model-Q5_K_M.gguf：
在这里插入图片描述
将文件放到新的目录：D:/cuda/qwen2.5-new
5、安装ollama。
下载
https://ollama.com/download/OllamaSetup.exe
安装。
然后到文件夹中新增一个文件，文件名：ModelFile，内容如下：

FROM D:/cuda/qwen2.5-new/ggml-model-Q5_K_M.gguf

# set the temperature to 0.7 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""
# set the system message
SYSTEM """
You are a helpful assistant.
"""

执行创建：

ollama create qwen2_0.5b_demo --file ./ModelFile

最后，运行模型：

ollama run qwen2_0.5b_demo

在这里插入图片描述
可以愉快的玩耍啦。
但是因为大模型实在太小，没什么效果，后来训练的新模型，还有些用，有空也可以尝试比较大的模型作为基模型来进行训练。

如果需要联网思路可以参考文章：
https://blog.csdn.net/ChinaLiaoTian/article/details/145504774
如果需要知识库能力可以参考：
https://blog.csdn.net/sheex2012/article/details/138339166