人机对话之chatglm模型-lora微调实战

Zsomnus_

已于 2024-02-12 22:25:02 修改

阅读量507

点赞数 7

分类专栏：大模型文章标签：自然语言处理 python 语言模型

于 2024-02-12 22:22:41 首次发布

本文链接：https://blog.csdn.net/wo1234234/article/details/136102725

版权

大模型专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

chatglm-lora微调源代码下载
依赖安装
微调
- 数据预处理
推理

基于上篇chatglm-ptuning微调，继续lora微调讲解，其中第一二节和上篇一致，详情观看 chatglm-ptuning微调实战

chatglm-lora微调源代码下载

git clone https://gitcode.com/mymusise/ChatGLM-Tuning.git

依赖安装

下载源码后，当前路径中会出现一个ChatGLM-Tuning文件夹，cd至该文件夹

pip install -r requirements.txt

微调

与ptuning微调不同，需要先进行数据预处理操作，数据格式与官网data保持一致
在这里插入图片描述

数据预处理

转化json数据集为jsonl，以官网数据为例：

python cover_alpaca2jsonl.py \
    --data_path data/alpaca_data.json \ #修改为自己路径
    --save_path data/alpaca_data.jsonl \ #修改为自己路径

tokenization

python tokenize_dataset_rows.py \
    --jsonl_path data/alpaca_data.jsonl \
    --save_path data/alpaca \
    --max_seq_length 200 \
    --skip_overlength  False \
    --chatglm_path /homebak/home_new/heyiwei/ChatGLM2-6B/chatglm2-6b \
    --version v2

运行完毕后会生成data/alpaca路径

–jsonl_path 微调的数据路径, 格式jsonl, 对每行的[‘context’]和[‘target’]字段进行encode
–save_path 输出路径
–max_seq_length 样本的最大长度
–chatglm_path 导入模型的路径（可以选择chatglm或chatglm2的不同路径）
–version 模型的版本（v1指chatglm,v2指chatglm2）

训练

python finetune.py \
    --dataset_path data/alpaca \
    --lora_rank 8 \
    --per_device_train_batch_size 6 \
    --gradient_accumulation_steps 1 \
    --max_steps 52000 \
    --save_steps 1000 \
    --save_total_limit 2 \
    --learning_rate 1e-4 \
    --fp16 \
    --remove_unused_columns false \
    --logging_steps 50 \
    --output_dir output \
    --chatglm_path /homebak/home_new/heyiwei/ChatGLM2-6B/chatglm2-6b

ps：这个只能适用于6b模型，不适用于量化，对量化模型训练则会报错

RuntimeError: Only Tensors of floating point and complex dtype can require gradients

推理

from transformers import AutoModel,AutoTokenizer
import torch
from peft import PeftModel
import json
from cover_alpaca2jsonl import format_example

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

#加载模型
model = AutoModel.from_pretrained("xxx/chatglm-6b", trust_remote_code=True, load_in_8bit=True, device_map='auto', revision="")
tokenizer = AutoTokenizer.from_pretrained("xxx/chatglm-6b", trust_remote_code=True,  revision="")
model = PeftModel.from_pretrained(model, "微调后的模型路径")


# 加载数据
instructions = json.load(open("xxx/alpaca_data_cleaned.json"))
answers = []


with torch.no_grad():
    for idx, item in enumerate(instructions[:3]):
        feature = format_example(item)
        input_text = feature['context']
        ids = tokenizer.encode(input_text)
        input_ids = torch.LongTensor([ids])
        input_ids = input_ids.to(device)
        out = model.generate(
            input_ids=input_ids,
            max_length=150,
            do_sample=False,
            temperature=0
        )
        out_text = tokenizer.decode(out[0])
        answer = out_text.replace(input_text, "").replace("\nEND", "").strip()
        item['infer_answer'] = answer
        print(out_text)
        print(f"### {idx+1}.Answer:\n", item.get('output'), '\n\n')
        answers.append({'index': idx, **item})