使用Qwen-7B微调，训练自己的数据集

最新推荐文章于 2025-04-10 13:29:12 发布

跆拳道~跆拳小生~小陶

最新推荐文章于 2025-04-10 13:29:12 发布

阅读量1.1w

点赞数 34

文章标签：语言模型 python 人工智能

本文链接：https://blog.csdn.net/txf1931783593/article/details/136210021

版权

前置准备

在微调之前，请确保已经正确部署过qwen-7b模型，部署流程可以查阅我的上一篇文章
【Ubuntu20.04部署通义千问Qwen-7B】

微调所需的环境

pip install peft deepspeed openpyxl -i http://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com

数据准备

微调所需的数据格式如下

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是一个语言模型，我叫通义千问。"
      }
    ]
  }
]

小编在准备自己的数据集的时候，将数据集以问答的格式放在excel文件中，并使用脚本，将数据处理成模型所需的数据格式，脚本代码如下

# -*- coding:utf-8 -*-
"""
时间：2024年2月21日
name：
用途：处理excel表格数据，将表格数据处理为大模型所需的训练数据格式
"""
import pandas as pd
import numpy as np
import json

class Engineer:

    def get_excel(self, file_name):
        """
        获取数据表中的问题和回答
        :param file_name: 数据表的文件路径
        :return: 数据表中的指定列（dataframe）
        """
        self.dfxm_all = pd.read_excel(file_name, sheet_name=0)  # 读取表格
        self.dfxm_all.replace("-", np.NaN, inplace=True)
        self.dfxm_all.columns = ['input', 'target']
        return self.dfxm_all

    def get_value(self, dfxm_value):
        """
        获取数据表格中的问答数据（）
        :param dfxm_value: dataframe 的表格数据
        :return:
        """

        conversations_list = []  # 问答数据
        for i in range(len(dfxm_value)):
            identity_dict = {}  # 对话数据字典集合
            input = dfxm_value.loc[i, :]["input"]
            target = dfxm_value.loc[i, :]["target"]
            identity_dict["id"] = "identity" + "_" + str(i)
            conversations = self.create_data(input, target)
            identity_dict["conversations"] = conversations

            conversations_list.append(identity_dict)
        print(conversations_list)
        # 将数据转换为JSON格式的字符串
        json_data = json.dumps(conversations_list, ensure_ascii=False, indent=4)
        # 将JSON数据存储到json文件中
        with open('./data.json', 'w', encoding='utf-8') as file:
            file.write(json_data)

    def create_data(self, input, target):
        conversations = [{
            "from":"user",
            "value":input
        },
        {
            "from": "assistant",
            "value": target
        }
        ]
        return conversations

def main():
    file_name = "./data.xlsx"
    eng = Engineer()
    dfxm_value = eng.get_excel(file_name)  # 获取榨利表中的日期、年份、周数值
    # print(dfxm_value)
    eng.get_value(dfxm_value)


if __name__ == '__main__':
    main()

输入如下图（输入是excel文件，列名是“input”，“target”）
在这里插入图片描述

输出数据如下，同级目录会生成json文件。
在这里插入图片描述

看官方有几种微调的方式，一个个试一下吧

第一种：Q-lora方式

修改配置

export CUDA_DEVICE_MAX_CONNECTIONS=1 
export CUDA_VISIBLE_DEVICES=0

修改配置文件：finetune/finetune_qlora_single_gpu.sh，修改以下内容，更改训练数据路径，其他的不变

# 因为小编下载的模型路径是在qwen下，所以修改Qwen为qwen
MODEL="/home/taoxifa/.cache/modelscope/hub/qwen/Qwen-7B-Chat-Int4"
DATA="code/data.json"

pip install mpi4py -i http://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com

ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects，出现此错误

sudo apt update
sudo apt-get install libopenmpi-dev
# 执行以上两条命令后， 重新下载安装mpi4py命令，发现还是安装失败，改用conda的方式安装
conda install mpi4py

开始训练

安装0.7版本的peft，训练的时候遇到的坑，把这个安装提到训练之前
pip install peft==0.7.0 -i http://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com
cd Qwen
bash finetune/finetune_qlora_single_gpu.sh

出现以下界面，即正常在训练，等待训练结果
在这里插入图片描述
训练完成之后，测试训练后的模型，使用以下代码进行测试，output_qwen是模型训练完成后的模型保存文件夹

# 测试代码都是这一段
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("output_qwen", device_map="auto", trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained("output_qwen", trust_remote_code=True)
response, history = model.chat(tokenizer, "你好", history=None)
print(response, history)

运行时报这个错误
ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported

注意: 如果peft>=0.8.0，加载模型同时会尝试加载tokenizer，但peft内部未相应设置trust_remote_code=True，导致ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.要避过这一问题，你可以降级peft<0.8.0或将tokenizer相关文件移到其它文件夹。

我尝试了把tokenizer相关文件移到其它文件夹，引起了其他的错误，于是我选择降级peft

# 卸载peft
pip uninstall peft
安装0.7版本的peft
pip install peft==0.7.0 -i http://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com
# 注意，降级完成之后，需要重新训练，因为之前是用高版本的peft训练的，所以不能用

又经过重新训练，测试时报错：ValueError: Target module QuantLinear() is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.
难道int4的模型不支持微调？可官网上写是支持的，没搞明白，有知道的小伙伴，可以评论交流

第二种：Qwen-1_8B lora方式

因为单卡显存不够的原因，选用1.8B的模型进行lora微调
修改配置文件：finetune/finetune_lora_single_gpu.sh，修改以下内容，更改训练数据路径，其他的不变

# 开始训练指令，使用fp16，一定要打开deepspeed
python finetune.py --model_name_or_path /home/taoxifa/.cache/modelscope/hub/qwen/Qwen-1_8B-Chat --data_path code/data.json --fp16 True --output_dir output_qwen --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1000 --save_total_limit 10 --learning_rate 3e-4 --weight_decay 0.1 --adam_beta2 0.95 --warmup_ratio 0.01 --lr_scheduler_type "cosine" --logging_steps 1 --report_to "none" --model_max_length 512 --lazy_preprocess True --gradient_checkpointing --use_lora --deepspeed finetune/ds_config_zero2.json

训练完成后，保存模型并进行测试，各位自己更改下文件夹路径，因为我的输出文件夹是在上层文件夹，所以是“…/output_qwen”。

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("../output_qwen", device_map="auto", trust_remote_code=True).eval()
merged_model = model.merge_and_unload()
merged_model.save_pretrained("output_model/qwen-1_8b-finetune", max_shard_size="2048MB", safe_serialization=True)  # 最大分片2g
tokenizer = AutoTokenizer.from_pretrained("output_qwen", trust_remote_code=True)

tokenizer.save_pretrained("output_model/qwen-1_8b-finetune")

使用微调后的模型代码测试，需要复制3个文件到上面保存模型的文件夹中，分别是qwen.tiktoken、tokenization_qwen.py、tokenizer_config.json
在这里插入图片描述

# 测试代码
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("自己模型的完整路径/qwen-1_8b-finetune", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("自己模型的完整路径/qwen-1_8b-finetune", device_map="auto",
                                             trust_remote_code=True).eval()
response, history = model.chat(tokenizer, "自定义知识库的问题", history=None)
print(response, history)