大模型微调是什么？有什么用？

最新推荐文章于 2024-09-02 23:03:36 发布

程序员维他命

最新推荐文章于 2024-09-02 23:03:36 发布

阅读量1.1k

点赞数 28

文章标签： pdf 大模型产品经理人工智能 chatgpt

本文链接：https://blog.csdn.net/h1453586413/article/details/141816188

版权

一、大模型的定义

大模型是指具有大规模参数和复杂计算结构的机器学习模型。 这些模型通常由深度神经网络构建而成，拥有数十亿甚至数千亿个参数。大模型的设计目的是为了提高模型的表达能力和预测性能，能够处理更加复杂的任务和数据。大模型在各种领域都有广泛的应用，包括自然语言处理、计算机视觉、语音识别和推荐系统等。大模型通过训练海量数据来学习复杂的模式和特征，具有更强大的泛化能力，可以对未见过的数据做出准确的预测。

二、什么是大模型微调

大模型微调是指在已经训练好的大规模预训练模型的基础上，针对特定的任务或数据集，对模型的参数进行少量的调整和优化。

三、微调的作用

通过微调，可以利用新的有针对性的数据，对预训练模型的参数进行微调，使其更好地适应特定任务的特点和需求。 例如，在情感分析任务中，可以使用特定的情感标注数据对预训练的语言模型进行微调，以提高其对情感判断的准确性。

四、微调的优势

微调的优势在于，相较于从头开始训练一个全新的模型，它可以大大减少训练时间和计算资源的消耗， 同时利用预训练模型所学到的通用知识，往往能够取得较好的性能。

五、指令微调

大模型的指令微调是一种_使用自然语言形式的指令数据对预训练的大语言模型进行参数微调的方法，旨在提高模型对特定任务的适应能力。_ 这种方法通过收集或构建指令化的实例，并通过有监督的方式对大语言模型的参数进行微调，使模型能够展现出较强的指令遵循能力，从而解决多种下游任务。

六、大模型微调流程

模型微调（Fine-tuning）是一种在预训练模型的基础上，通过在特定任务的数据集上进一步训练，来优化模型性能的技术。以下是模型微调的详细步骤：

1. 预训练模型选择

模型选择：根据任务需求选择一个合适的预训练模型。例如，对于文本分类任务，可以选择BERT、RoBERTa等模型。
预训练任务：了解预训练模型完成的任务类型，如语言建模、掩码语言模型等。

2. 数据准备

数据收集：收集与微调任务相关的数据集。
数据清洗：去除数据集中的错误、重复和不相关的数据。
数据标注：对于监督学习任务，确保数据集被正确标注。
数据分割：通常将数据集分为训练集、验证集和测试集，比例一般为70%、15%、15%。

3. 数据预处理

文本预处理：对于NLP任务，进行分词、去停用词、词干提取等。
数据格式化：将数据转换成模型能够接受的格式，例如PyTorch的DataLoader格式。
编码：将文本或图像数据转换为模型可以理解的数字编码。

4. 模型准备

模型架构调整：根据任务需求调整模型结构，如改变输出层的神经元数量以匹配类别数。
加载预训练权重：从预训练模型中加载权重，作为微调的起点。

5. 微调策略

学习率设置：通常使用较小的学习率，因为预训练模型已经具有很好的泛化能力。
层冻结：可以选择性地冻结部分层的权重，只训练部分层，以防止过拟合。
正则化：应用如Dropout、权重衰减等正则化技术来防止过拟合。

6. 模型训练

损失函数：选择适合任务的损失函数，如交叉熵损失用于分类任务。
优化器：选择优化器，如AdamW，并设置适当的超参数。
训练循环：
- 前向传播：输入数据通过模型得到预测结果。
- 计算损失：根据预测结果和真实标签计算损失。
- 反向传播：计算损失相对于模型参数的梯度。
- 参数更新：根据梯度更新模型参数。

7. 超参数调整

监控验证集：在训练过程中，定期在验证集上评估模型性能。
早停：如果验证集上的性能在一定时间内没有提升，则停止训练以防止过拟合。
学习率调整：根据验证集性能调整学习率，如使用学习率衰减或学习率预热。

8. 模型评估

性能指标：使用准确率、精确率、召回率、F1分数等指标评估模型。
错误分析：分析模型预测错误的案例，寻找改进点。

9. 模型保存与部署

模型保存：将训练好的模型及其参数保存下来。
模型部署：将模型部署到服务器、移动设备或浏览器中。

10. 持续学习

数据更新：随着新数据的出现，定期对模型进行微调。
模型迭代：根据应用反馈和性能监控，不断迭代和优化模型。模型微调的关键在于利用预训练模型的通用特征表示，并通过在特定任务上的进一步训练来优化模型，使其能够更好地适应特定领域或任务的需求。

七、例子：Qwen大模型

1.简介

Qwen2是通义千问团队最近开源的大语言模型，以Qwen2作为基座大模型，通过指令微调的方式做高精度的命名实体识别（NER），入门学习LLM微调、建立对大模型微调的认知。

命名实体识别（Named Entity Recognition，简称 NER）是自然语言处理中的一项重要任务。其主要目的是从文本中识别出具有特定意义的实体，这些实体可以包括人名、地名、组织机构名、时间、日期、货币金额等。

2.环境配置

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple swanlab modelscope transformers datasets peft pandas accelerate

3.数据集

数据集来自HuggingFace上的chinese_ner_sft，该数据集被用于训练命名实体识别模型。

将ccfbdci.jsonl文件下载到与python文件同一目录下即可

4.加载模型

使用modelscope下载Qwen2-1.5B-Instruct模型

from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

model_id = "qwen/Qwen2-1.5B-Instruct"    
model_dir = "./qwen/Qwen2-1___5B-Instruct"

# 在modelscope上下载Qwen模型到本地目录下
model_dir = snapshot_download(model_id, cache_dir="./", revision="master")

# Transformers加载模型权重
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法

5.可视化工具

使用SwanLab来监控整个训练过程，并评估最终的模型效果。

from swanlab.integration.huggingface import SwanLabCallback

swanlab_callback = SwanLabCallback(...)

trainer = Trainer(
    ...
    callbacks=[swanlab_callback],
)

6.train.py

全部的完整代码如下

import json
import pandas as pd
import torch
from datasets import Dataset
from modelscope import snapshot_download, AutoTokenizer
from swanlab.integration.huggingface import SwanLabCallback
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import os
import swanlab


def dataset_jsonl_transfer(origin_path, new_path):
    """
    将原始数据集转换为大模型微调所需数据格式的新数据集
    """
    messages = []

    # 读取旧的JSONL文件
    with open(origin_path, "r") as file:
        for line in file:
            # 解析每一行的json数据
            data = json.loads(line)
            input_text = data["text"]
            entities = data["entities"]
            match_names = ["地点", "人名", "地理实体", "组织"]
            
            entity_sentence = ""
            for entity in entities:
                entity_json = dict(entity)
                entity_text = entity_json["entity_text"]
                entity_names = entity_json["entity_names"]
                for name in entity_names:
                    if name in match_names:
                        entity_label = name
                        break
                
                entity_sentence += f"""{{"entity_text": "{entity_text}", "entity_label": "{entity_label}"}}"""
            
            if entity_sentence == "":
                entity_sentence = "没有找到任何实体"
            
            message = {
                "instruction": """你是一个文本实体识别领域的专家，你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体". """,
                "input": f"文本:{input_text}",
                "output": entity_sentence,
            }
            
            messages.append(message)

    # 保存重构后的JSONL文件
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")
            
            
def process_func(example):
    """
    将数据集进行预处理
    """

    MAX_LENGTH = 384 
    input_ids, attention_mask, labels = [], [], []
    system_prompt = """你是一个文本实体识别领域的专家，你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如 {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体"."""
    
    instruction = tokenizer(
        f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
    )
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = (
        instruction["attention_mask"] + response["attention_mask"] + [1]
    )
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}   


def predict(messages, model, tokenizer):
    device = "cuda"     #若没有cuda，则用cpu
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    print(response)
     
    return response

model_id = "qwen/Qwen2-1.5B-Instruct"    
model_dir = "./qwen/Qwen2-1___5B-Instruct"

# 在modelscope上下载Qwen模型到本地目录下
model_dir = snapshot_download(model_id, cache_dir="./", revision="master")

# Transformers加载模型权重
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法

# 加载、处理数据集和测试集
train_dataset_path = "ccfbdci.jsonl"
train_jsonl_new_path = "ccf_train.jsonl"

if not os.path.exists(train_jsonl_new_path):
    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)

# 得到训练集
total_df = pd.read_json(train_jsonl_new_path, lines=True)
train_df = total_df[int(len(total_df) * 0.1):]
train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1,  # Dropout 比例
)

model = get_peft_model(model, config)

args = TrainingArguments(
    output_dir="./output/Qwen2-NER",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    logging_steps=10,
    num_train_epochs=2,
    save_steps=100,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    report_to="none",
)

swanlab_callback = SwanLabCallback(
    project="Qwen2-NER-fintune",
    experiment_name="Qwen2-1.5B-Instruct",
    description="使用通义千问Qwen2-1.5B-Instruct模型在NER数据集上微调，实现关键实体识别任务。",
    config={
        "model": model_id,
        "model_dir": model_dir,
        "dataset": "qgyd2021/chinese_ner_sft",
    },
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    callbacks=[swanlab_callback],
)

trainer.train()

# 用测试集的随机20条，测试模型
# 得到测试集
test_df = total_df[:int(len(total_df) * 0.1)].sample(n=20)

test_text_list = []
for index, row in test_df.iterrows():
    instruction = row['instruction']
    input_value = row['input']
    
    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"}
    ]

    response = predict(messages, model, tokenizer)
    messages.append({"role": "assistant", "content": f"{response}"})
    result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
    test_text_list.append(swanlab.Text(result_text, caption=response))
    
swanlab.log({"Prediction": test_text_list})
swanlab.finish()

7.查看训练结果

到SwanLab上查看最终的训练结果：

经历了两个epoch后，微调后qwen2的loss值降低了很多，并趋于稳定的水平，同时在一些测试样例上，可得知微调后的qwen2能够给出准确的实体抽取结果，至此，我们完成了qwen2在NER任务上的指令微调训练。

8.推理测试

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

def predict(messages, model, tokenizer):
    device = "cuda"

    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

# 加载原下载路径的tokenizer和model
tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)

# 加载训练好的Lora模型，将下面的[checkpoint-XXX]替换为实际的checkpoint文件名名称
model = PeftModel.from_pretrained(model, model_id="./output/Qwen2-NER/checkpoint-1700")

input_text = "国会外有大约２００名警察驻守，防止抗议人群闯入国会。"
test_texts = {
    "instruction": """你是一个文本实体识别领域的专家，你需要从给定的句子中提取 地点; 人名; 地理实体; 组织 实体. 以 json 格式输出, 如; {"entity_text": "南京", "entity_label": "地理实体"} 注意: 1. 输出的每一行都必须是正确的 json 字符串. 2. 找不到任何实体时, 输出"没有找到任何实体". """,
    "input": f"文本:{input_text}"
}

instruction = test_texts['instruction']
input_value = test_texts['input']

messages = [
    {"role": "system", "content": f"{instruction}"},
    {"role": "user", "content": f"{input_value}"}
]

response = predict(messages, model, tokenizer)
print(response)

大模型将会将本句话"国会外有大约２００名警察驻守，防止抗议人群闯入国会。"提取分解。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

😝有需要的小伙伴，可以Vx扫描下方二维码免费领取🆓