Datawhale AI夏令营第四期魔搭 Task04：源大模型微调实战-知识点文档-CSDN博客

本文链接：https://blog.csdn.net/JayOxford/article/details/141404577

Datawhale AI夏令营第四期魔搭 Task04：源大模型微调实战-知识点文档

大模型微调技术简介

模型微调也被称为指令微调（Instruction Tuning）或者有监督微调（Supervised Fine-tuning, SFT），该方法利用成对的任务输入与预期输出数据，训练模型学会以问答的形式解答问题，从而解锁其任务解决潜能。经过指令微调后，大语言模型能够展现出较强的指令遵循能力，可以通过零样本学习的方式解决多种下游任务

然而，值得注意的是，指令微调并非无中生有地传授新知，而是更多地扮演着催化剂的角色，激活模型内在的潜在能力，而非单纯地灌输信息

相较于预训练所需的海量数据，指令微调所需数据量显著减少，从几十万到上百万条不等的数据，均可有效激发模型的通用任务解决能力，甚至有研究表明，少量高质量的指令数据（数千至数万条）亦能实现令人满意的微调效果。这不仅降低了对计算资源的依赖，也提升了微调的灵活性与效率

然而，由于大模型的参数量巨大，进行全量参数微调需要消耗非常多的算力。为了解决这一问题，研究者提出了参数高效微调（Parameter-efficient Fine-tuning），也称为轻量化微调（Lightweight Fine-tuning），这些方法通过训练极少的模型参数，同时保证微调后的模型表现可以与全量微调相媲美

常用的轻量化微调技术有LoRA、Adapter 和 Prompt Tuning

LoRA 是通过低秩矩阵分解，在原始矩阵的基础上增加一个旁路矩阵，然后只更新旁路矩阵的参数。LoRA通过在预训练模型的关键层中添加低秩矩阵来实现。这些低秩矩阵通常被设计成具有较低维度的参数空间，这样它们就可以在不改变模型整体结构的情况下进行微调。本质就是将大型矩阵拆成，两个小矩阵相乘，从而降低计算复杂度
$M_{R \times C} = A_{R \times k} \cdot B_{k \times C}$
在特定领域有少量标注数据的情况下，也可以有效地对模型进行个性化调整，可以迅速适应新的领域或特定任务。有助于保持模型在未见过的数据上的泛化能力，同时还能学习到特定任务的知识。LoRA旨在通过仅微调模型的部分权重，而不是整个模型，从而减少所需的计算资源和存储空间

请添加图片描述

源2.0-2B 微调实战

环境准备

启动终端后，开始配置环境：

git lfs install
git clone https://www.modelscope.cn/datasets/Datawhale/AICamp_yuan_baseline.git
cp AICamp_yuan_baseline/Task\ 4：源大模型微调实战/* .

模型下载

# 查看已安装依赖
pip list
# 安装 streamlit
pip install streamlit==1.24.0

为了进行Demo搭建，还需要运行下面的单元格，在环境中安装streamlit

from modelscope import snapshot_download
model_dir = snapshot_download('IEITYuan/Yuan2-2B-Mars-hf', cache_dir='.')

这里使用的是 modelscope 中的 snapshot_download 函数，第一个参数为模型名称 IEITYuan/Yuan2-2B-Mars-hf，第二个参数 cache_dir 为模型保存路径，这里.表示当前路径。

模型大小约为4.1G，由于是从魔搭直接进行下载，速度会非常快。

下载完成后，会在当前目录增加一个名为 IEITYuan 的文件夹，其中 Yuan2-2B-Mars-hf 里面保存着我们下载好的源大模型

请添加图片描述

数据处理

# 导入环境
import torch
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

这里直接运行报错：

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

需要重新安装依赖：

!pip install tf-keras

运行 2.3 数据处理 下面的单元格。

我们使用 pandas 进行数据读取，然后转成 Dataset 格式：

# 读取数据
df = pd.read_json('./data.json')
ds = Dataset.from_pandas(df)

# 查看数据
len(ds)
ds[:1]

# 加载 tokenizer
path = './IEITYuan/Yuan2-2B-Mars-hf'

tokenizer = AutoTokenizer.from_pretrained(path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
tokenizer.pad_token = tokenizer.eos_token

为了完成模型训练，需要完成数据处理，这里我们定义了一个数据处理函数 process_func：

# 定义数据处理函数
def process_func(example):
    MAX_LENGTH = 384    # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性

    instruction = tokenizer(f"{example['input']}<sep>")
    response = tokenizer(f"{example['output']}<eod>")
    input_ids = instruction["input_ids"] + response["input_ids"]
    attention_mask = [1] * len(input_ids) 
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] # instruction 不计算loss

    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

具体来说，需要使用tokenizer将文本转成id，同时将input和output拼接，组成input_ids和 attention_mask。
这里我们可以看到，源大模型需要在 input 后面添加一个特殊的token <sep>，在output后面添加一个特殊的token <eod>。
同时，为了防止数据超长，还有做一个截断处理

# 处理数据集
tokenized_id = ds.map(process_func, remove_columns=ds.column_names)
tokenized_id

# 数据检查
tokenizer.decode(tokenized_id[0]['input_ids'])

tokenizer.decode(list(filter(lambda x: x != -100, tokenized_id[0]["labels"])))

模型训练

首先我们需要加载源大模型参数，然后打印模型结构：

# 模型加载
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model

可以看到，源大模型中包含24层 YuanDecoderLayer，每层中包含self_attn、mlp 和 layernorm

另外为了进行模型使用训练，需要先执行 model.enable_input_require_grads()

model.enable_input_require_grads() # 开启gradient_checkpointing时，要执行该方法

在本节中，我们使用Lora进行轻量化微调，首先需要配置 LoraConfig：

# 查看模型数据类型
model.dtype

# 配置Lora
from peft import LoraConfig, TaskType, get_peft_model

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False, # 训练模式
    r=8, # Lora 秩
    lora_alpha=32, # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1# Dropout 比例
)
config

然后构建一个 PeftModel:

# 构建PeftModel
model = get_peft_model(model, config)
model

通过model.print_trainable_parameters()，可以看到需要训练的参数在所有参数中的占比：

# 打印需要训练的参数
model.print_trainable_parameters()

然后设置训练参数TrainingArguments:

# 设置训练参数
args = TrainingArguments(
    output_dir="./output/Yuan2.0-2B_lora_bf16",
    per_device_train_batch_size=12,
    gradient_accumulation_steps=1,
    logging_steps=1,
    save_strategy="epoch",
    num_train_epochs=3,
    learning_rate=5e-5,
    save_on_each_node=True,
    gradient_checkpointing=True,
    bf16=True
)

同时初始化一个Trainer:

# 初始化Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_id,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)

最后运行trainer.train()执行模型训练

# 模型训练
trainer.train()

在训练过程中，会打印模型训练的loss，我们可以通过loss的降低情况，检查模型是否收敛：

模型训练完成后，会打印训练相关的信息：

同时，我们会看到左侧output文件夹下出现了3个文件夹，每个文件夹下保存着每个epoch训完的模型。

这里，以epoch3为例，可以看到其中保存了训练的config、state、ckpt等信息

请添加图片描述

效果验证

完成模型训练后，我们通过定义一个生成函数 generate()，来进行效果验证：

# 定义生成函数
def generate(prompt):
    prompt = prompt + "<sep>"
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
    outputs = model.generate(inputs, do_sample=False, max_length=256)
    output = tokenizer.decode(outputs[0])
    print(output.split("<sep>")[-1])

同时定义好输入的prompt template，这个要和训练保持一致。

# 输入prompt template
template = '''
# 任务描述
假设你是一个AI简历助手，能从简历中识别出所有的命名实体，并以json格式返回结果。

# 任务要求
实体的类别包括：姓名、国籍、种族、职位、教育背景、专业、组织名、地名。
返回的json格式是一个字典，其中每个键是实体的类别，值是一个列表，包含实体的文本。

# 样例
输入：
张三，男，中国籍，工程师
输出：
{"姓名": ["张三"], "国籍": ["中国"], "职位": ["工程师"]}

# 当前简历
input_str

# 任务重述
请参考样例，按照任务要求，识别出当前简历中所有的命名实体，并以json格式返回结果。
'''

最后，我们输入样例进行测试：

input_str = '高欢，男，鲜卑化汉人，CEO'
prompt = template.replace('input_str', input_str).strip()
generate(prompt)

请添加图片描述

可以看到，通过模型微调，模型已经具备了相应的能力

运行AI简历助手

训练完成后，我们尝试使用训练好的模型，搭建Demo

首先，点击重启内核，清空显存。
（因为Demo也需要占用显存，不先清空会因为显存不足报错。）

然后，我们将在终端输入下面的命令，启动streamlit服务：

streamlit run Task\ 4\ 案例：AI简历助手.py --server.address 127.0.0.1 --server.port 6006

源代码：

# 导入所需的库
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import streamlit as st
from peft import PeftModel
import json
import pandas as pd

# 创建一个标题和一个副标题
st.title("💬 Yuan2.0 AI简历助手")

# 源大模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('IEITYuan/Yuan2-2B-Mars-hf', cache_dir='./')

# 定义模型路径
path = './IEITYuan/Yuan2-2B-Mars-hf'
lora_path = './output/Yuan2.0-2B_lora_bf16/checkpoint-51'

# 定义模型数据类型
torch_dtype = torch.bfloat16 # A10
# torch_dtype = torch.float16 # P100

# 定义一个函数，用于获取模型和tokenizer
@st.cache_resource
def get_model():
    print("Creat tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)

    print("Creat model...")
    model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch_dtype, trust_remote_code=True).cuda()
    model = PeftModel.from_pretrained(model, model_id=lora_path)

    return tokenizer, model

# 加载model和tokenizer
tokenizer, model = get_model()

template = '''
# 任务描述
假设你是一个AI简历助手，能从简历中识别出所有的命名实体，并以json格式返回结果。

# 任务要求
实体的类别包括：姓名、国籍、种族、职位、教育背景、专业、组织名、地名。
返回的json格式是一个字典，其中每个键是实体的类别，值是一个列表，包含实体的文本。

# 样例
输入：
张三，男，中国籍，工程师
输出：
{"姓名": ["张三"], "国籍": ["中国"], "职位": ["工程师"]}

# 当前简历
query

# 任务重述
请参考样例，按照任务要求，识别出当前简历中所有的命名实体，并以json格式返回结果。
'''

# 在聊天界面上显示模型的输出
st.chat_message("assistant").write(f"请输入简历文本：")

# 如果用户在聊天输入框中输入了内容，则执行以下操作
if query := st.chat_input():

    # 在聊天界面上显示用户的输入
    st.chat_message("user").write(query)

    # 调用模型
    prompt = template.replace('query', query).strip()
    prompt += "<sep>"
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
    outputs = model.generate(inputs, do_sample=False, max_length=1024) # 设置解码方式和最大生成长度
    output = tokenizer.decode(outputs[0])
    response = output.split("<sep>")[-1].replace("<eod>", '').strip()

    # 在聊天界面上显示模型的输出
    st.chat_message("assistant").write(f"正在提取简历信息，请稍候...")

    st.chat_message("assistant").table(pd.DataFrame(json.loads(response)))

运行之后：

请添加图片描述

可以看到，Demo完成了信息抽取，并进行了结构化展示。

这样我们完成了一个AI简历助手的构建。
output = tokenizer.decode(outputs[0])
response = output.split(“”)[-1].replace(“”, ‘’).strip()

# 在聊天界面上显示模型的输出
st.chat_message("assistant").write(f"正在提取简历信息，请稍候...")

st.chat_message("assistant").table(pd.DataFrame(json.loads(response)))


运行之后：

[外链图片转存中...(img-FhEYm1wN-1724251561466)]

可以看到，Demo完成了信息抽取，并进行了结构化展示。

这样我们完成了一个`AI简历助手`的构建。