llama3.1微调部署实例(从模型下载开始)

山地车撒旦

已于 2024-09-05 14:28:55 修改

阅读量714

点赞数 8

文章标签：深度学习人工智能语言模型 llama

于 2024-08-26 15:10:57 首次发布

本文链接：https://blog.csdn.net/sandichesadan/article/details/141562188

版权

本文对 2024 年 7 月 23 日发布的 llama3.1 的微调部署过程进行说明。

大模型的微调步骤很多，以下是我认为的必要的步骤：

一、模型的加载

由于 llama3.1 是国外的模型，中文的能力非常弱，在最近 llama 官方发布的模型中，模型虽然已经有了一些中文能力，但在实际的使用过程中仍然会夹杂着英文输出。

模型的预训练模型仍然需要中文语料训练过的模型，在这里选取的是 Huggingface 上的模shenzhi-wang / Llama3.1-8B-Chinese-Chat

from huggingface_hub import snapshot_download
snapshot_download(repo_id="shenzhi-wang/Llama3.1-8B-Chinese-Chat",cache_dir='./Llama3.1-8B-Chinese-Chat', ignore_patterns=["*.gguf"])  # Download our BF16 model without downloading GGUF models.

这里可能由于网络问题 huggingface 下载缓慢，在这里提供另一个魔塔社区的模型，虽然不是中文模型，但模型的使用方法是一样的，后续如果想去下载其他的模型进行替换也是可以的。

from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Meta-Llama-3.1-8B-Instruct', cache_dir='./llama3.1_8b_chinese', revision='master')

下载得到的模型预训练模型文件结构如下。

模型加载代码如下：

def get_model():
    model = AutoModelForCausalLM.from_pretrained('./llama3.1_8b_chinese/LLM-Research/Meta-Llama-3___1-8B-Instruct', device_map="auto",torch_dtype=torch.bfloat16)
    model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法
    return model

二、数据集的获取和编辑

在这里选取的数据集是甄嬛的对话数据集。

下载https://huggingface.co/api/datasets/uITimeCia/Zhenhuan/parquet/default/train/0.parquet

数据集还需要进行处理以放入模型中进行训练，以下是数据集处理的代码，将数据集的三个部分按照 template 进行更改。

数据集的整体代码如下：

def get_dataset():
    df=pd.read_parquet('甄嬛传.parquet')
    # df=df[:100]
    ds = Dataset.from_pandas(df)

    tokenizer = AutoTokenizer.from_pretrained('./llama3.1_8b_chinese/LLM-Research/Meta-Llama-3___1-8B-Instruct', use_fast=False, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token

    EOS_TOKEN = tokenizer.eos_token  # 必须添加 EOS_TOKEN
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

    ### Instruction:
    {}
    ### Input:
    {}
    ### Response:
    {}"""
    def formatting_prompts_func(examples):
        instructions = examples["instruction"]
        # print(examples)

        inputs = examples["instruction"]
        outputs = examples["output"]
        texts = []
        for instruction, input, output in zip(instructions, inputs, outputs):
            instruction = '你是皇帝的女人--甄嬛'
            # 必须添加EOS_TOKEN，否则无限生成
            text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
            texts.append(text)

        return {"text": texts, }

    dataset = ds.map(formatting_prompts_func, batched=True, )

    return dataset,tokenizer

三、训练参数设置

lora 的微调训练需要大量的超参数设置。

def get_train(model,datas,tokenizer):
    #peft的lora参数
    config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False,  # 训练模式
        r=8,  # Lora 秩
        lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
        lora_dropout=0.1  # Dropout 比例
    )

    peft_model = get_peft_model(model, config)
    print(peft_model.print_trainable_parameters())

    #训练的参数
    args = TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # max_steps=60,  # 微调步数
        learning_rate=2e-4,  # 学习率
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        num_train_epochs=3,
        save_steps=100,
        logging_steps=3,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    )

    #开始训练
    trainer = SFTTrainer(
        model=peft_model,
        train_dataset=datas,
        dataset_text_field="text",
        max_seq_length=2048,
        tokenizer=tokenizer,
        args=args
    )
    trainer.train()
    #保存模型
    peft_model.save_pretrained("lora")

想了解参数的设置可以参考这两篇文章

LoraConfig：PEFT LoraConfig参数详解-CSDN博客

TrainingArguments：第十二节 huggingface的TrainingArguments与trainner参数说明_class llavatrainer(trainer):-CSDN博客

这里再贴一种 collator 的数据集方式

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 20022
})
#数据格式
{'instruction': 'Create a function that takes a specific input and produces a specific output using any mathematical operators. Write corresponding code in Python.',
 'input': '',
 'output': 'def f(x):\n    """\n    Takes a specific input and produces a specific output using any mathematical operators\n    """\n    return x**2 + 3*x'}
 
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts


'''
DataCollatorForCompletionOnlyLM
找到 labels (batch['labels']) 中和 response_template 相同 token 的最后一个的 index 作为 response_token_ids_start_idx，然后将 labels 中的开头到responese_tempalte的最后一个token都标记为-100，这样的话就不会计算损失了。
'''
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=SFTConfig(output_dir="/tmp"),
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)

trainer.train()

整体代码的运行结构只需要跑三个子函数就能开始训练了。

import json
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
import torch
from peft import LoraConfig, TaskType, get_peft_model
from trl import SFTTrainer



def main():
    datas,tokenizer=get_dataset()
    model=get_model()
    get_train(model,datas,tokenizer)


if __name__ == '__main__':
    main()

四、Lora 模型的合并

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from peft import PeftModel



mode_path = './llama3.1_8b_chinese/LLM-Research/Meta-Llama-3___1-8B-Instruct'
lora_path = './lora'

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer


def apply_lora(model_name_or_path, output_path, lora_path):
    print(f"Loading the base model from {model_name_or_path}")
    base_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    base = AutoModelForCausalLM.from_pretrained(
        model_name_or_path, device_map="auto",torch_dtype=torch.bfloat16
    )


    print(f"Loading the LoRA adapter from {lora_path}")

    lora_model = PeftModel.from_pretrained(
        base,
        lora_path,
        torch_dtype=torch.float16,
    )

    print("Applying the LoRA")
    model = lora_model.merge_and_unload()

    print(f"Saving the target model to {output_path}")
    model.save_pretrained(output_path)
    base_tokenizer.save_pretrained(output_path)


apply_lora(model_name_or_path=mode_path,output_path='./models/甄嬛传大模型',lora_path=lora_path)

最后保存的模型的大小和初始模型的大小一致。

五、llama.cpp进行模型量化

从 github 下载 llama.cpp，并编译

git clone --recursive https://github.com/ggerganov/llama.cpp
cd llama.cpp && make clean && make all -j
pip install -r requirements/requirements-convert_hf_to_gguf.txt
#没有 cmake 需要先下载 cmake
cmake -B build
camke --build build --config Release

利用 llama.cpp 中的转换工具来对模型文件进行转换：

1、模型转换成 f16 的 gguf

./llama.cpp/convert_hf_to_gguf.py ./models/甄嬛传大模型 --outtype f16 --outfile ./models/my_llama3_1.gguf

2、gguf 文件进行量化，量化成 4 bit

./llama.cpp/llama-quantize ./models/my_llama3_1.gguf ./models/my_llama3_1-q4_0.gguf q4_0

六、gguf 文件导入 ollama

ollama 是 llama 官方推出的类似 docker 的模型工具包，可以很方便的下载部署。

curl -fsSL https://ollama.com/install.sh | sh

将 gguf 文件放入模型中还需要编写关于模型的 template 的 config 文件。将以下内容放入 txt 文件中。

FROM "./models/my_llama3_1-q4_0.gguf"

TEMPLATE """{{- if .System }}
<|im_start|>system {{ .System }}<|im_end|>
{{- end }}
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

SYSTEM """"""

PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

ollama create zhenhuanzhuan -f llama3_1_8b_chinese_config.txt

至此微调部署完成

推理脚本如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from peft import PeftModel
# from unsloth import FastLanguageModel

mode_path = './llama3.1_8b_chinese/LLM-Research/Meta-Llama-3___1-8B-Instruct'
lora_path = './outputs/checkpoint-2796' # 这里改称你的 lora 输出对应 checkpoint 地址

# 加载tokenizer
tokenizer = AutoTokenizer.from_pretrained(mode_path, trust_remote_code=True)

# 加载模型
model = AutoModelForCausalLM.from_pretrained(mode_path, device_map="auto", torch_dtype=torch.bfloat16,
                                             trust_remote_code=True).eval()

# 加载lora权重
model = PeftModel.from_pretrained(model, model_id=lora_path)

while 1:
    prompt=input('>')
    messages = [
        {"role": "system", "content": "假设你是皇帝身边的女人--甄嬛。"},
        {"role": "user", "content": prompt}
    ]
    input_ids = tokenizer.apply_chat_template(messages, tokenize=False)
    model_inputs = tokenizer([input_ids], return_tensors="pt").to('cuda')
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(response)