用 Unsloth 微调 LLaMA 3 8B

最新推荐文章于 2024-07-24 17:39:55 发布

liugddx

最新推荐文章于 2024-07-24 17:39:55 发布

阅读量2.2k

点赞数 30

分类专栏： AI 微调大模型文章标签： llama

本文链接：https://blog.csdn.net/CSDNDN/article/details/139522752

版权

AI 同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

大模型

5 篇文章 0 订阅

订阅专栏

微调

2 篇文章 0 订阅

订阅专栏

用 Unsloth 微调 LLaMA 3 8B

今年4月份，Meta 公司发布了功能强大的大型语言模型（LLM）Llama-3，为从事各种 NLP 任务的开发人员提供了功能强大可以在普通机器上运行的开源LLM。然而，传统的 LLM 微调过程既耗时又耗费资源。但是，Unsloth 的出现改变了这一局面，大大加快了 Llama-3 的微调速度。

本文将探讨 Unsloth 如何帮助您以极高的速度和效率，根据具体需求对 Llama-3 进行微调。我们将深入探讨 Unsloth 的优势，并提供 Llama-3 微调流程的流程指南。

名词解释

什么是Unsloth
Unsloth 是一个功能强大的库，旨在加速大型语言模型（LLM）的微调，同时减少内存使用。它由 Daniel 和 Michael Han 创建，通过优化反向传播和将 PyTorch 模块重写为 Triton 内核，训练速度提高了 30 倍，内存消耗降低了 60-80%。Unsloth 支持广泛的 NVIDIA GPU，与 Hugging Face 生态系统无缝集成，使其与 LLaMA 和 Mistral 等各种 LLM 架构兼容。值得注意的是，与传统方法相比，它在准确度方面保持了 0% 的下降，为微调 LLM 提供了有效的解决方案。
Llama-3
Llama-3是Meta公司开发的一款开源大语言模型，是Llama系列的最新版本。它提供了两个版本：8B版本适用于消费级GPU上的高效部署和开发，70B版本则专为大规模AI应用设计。每个版本都包括基础和指令调优两种形式。
Ollama
Ollama 是一款能让用户在本地机器上运行开放式 LLM 的工具，无需云服务。它是 llama.cpp 的前端，可以加载 GGUF 模型。该工具设计简单易用，提供了简单的 API、OpenAI 端点兼容性（例如可与任何支持 OpenAI 规范的模型都可以使用）。Ollama 可在 macOS、Linux 和 Windows 上运行，可使用 CPU 和 GPU，并与 LangChain、LiteLLM 等流行框架无缝集成。通过提供本地执行，Ollama 可以确保数据隐私并减少延迟，是希望高效利用高级 NLP 功能的开发人员和研究人员的理想选择。

为什么要使用 Unsloth 对 Llama-3 进行微调？

Unsloth 为微调 Llama-3 这样的大型模型提供了非常棒的解决方案。有如下几个原因：

速度提升：与传统方法相比，Unsloth 可显著提高速度，将训练时间最多缩短 5 倍。这样就可以更快地进行实验，更快地优化微调模型。
减少内存占用：Unsloth 在提高运行速度的同时，最大限度地减少了内存占用。这对于占用大量内存的 Llama-3 尤为有利。Unsloth 可以让你在性能较弱的硬件上进行训练，使 LLM 微调更容易实现。
提升训练精度：Unsloth 可以同时提高准确性与速度。与某些优化技术不同，Unsloth 采用精确计算方法，确保经过微调的 Llama-3 模型与经过传统训练的模型性能相当。

使用 Unsloth 微调 Llama-3 的指南

安装 Unsloth：

%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

导入必要的库

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

加载 Llama-3 模型和分词器：

max_seq_length = 2048

#unsloth support llama-3 by default, you can also use any model available on huggingface 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

准备微调数据集：
数据集使用[1]，这是斯坦福大学发布的原始Alpaca数据集的净化版本。

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

定义 PEFT 配置：
PEFT（Parameter-Efficient Fine-Tuning）是一种用于微调大型语言模型（LLM）的技术，旨在通过调整少量参数来实现高效的模型适应，而不是对整个模型进行全面微调。这样的方法在保持模型性能的同时，显著减少了计算资源和时间的消耗。
PEFT的优势：
1. 参数高效性：PEFT 只微调模型的一小部分参数，而不是所有参数。这显著减少了需要更新和存储的参数数量，从而降低了计算和内存需求。
2. 速度和资源优化：由于只微调少量参数，训练速度更快，所需的计算资源和内存也大大减少，使得在有限资源下也能进行高效的模型微调。
3. 迁移学习：PEFT 利用预训练模型的知识，通过微调少量参数使模型适应特定任务，从而实现高效的迁移学习。
4. 应用广泛：PEFT 可以应用于多种大型语言模型和任务，包括文本分类、生成、翻译等。

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

配置 Unsloth Trainer：

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

查询内存使用情况：

#Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

训练经过微调的 Llama-3 模型：

 trainer_stats = trainer.train()

使用微调模型进行推理：

 # alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

使用 Hugging Face Transformers 微调模型进行推理：

    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

保存模型：
要将最终模型保存为 LoRA 适配器，可使用 Huggingface 的 push_to_hub 进行在线保存，或使用 save_pretrained 进行本地保存。

#[NOTE] This ONLY saves the LoRA adapters, and not the full model.
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

将模型保存为 16 位或 GGUF:
GGUF（即通用图形通用格式）是一种文件格式，旨在提高部署 LLM（如 LLaMA-3 8B）的效率和灵活性。 GGUF 由 llama.cpp 团队推出，对其前身 GGML、单文件部署、可扩展性和内存映射进行了改进，以实现更快的模型加载。这种格式特别适合量化模型，可以在不影响性能的情况下减少计算资源需求。对于希望简化跨各种平台（包括 CPU 和 Apple 设备）的 LLM 部署和推理的开发人员来说，GGUF 是理想的选择。[2]

# Save to 8bit Q8_0
# Fast conversion. High resource use, but generally acceptable.
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
# 
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
#q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
# model-unsloth-Q4_K_M.gguf
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

使用ollama运行训练后的模型：
在 Ollama 中加载 GGUF 时，需要自定义一个带有系统信息和模板的 Modelfile，记住要将 GGUF 文件替换为您正在使用的量化级别，这里我们使用的是 model-unsloth-Q4_K_M.gguf，您可以在导入 Ollama 时将其命名为任何您想要的名称：

 ollama create llama3:8b-instruct-q4_K_M -f Modelfile  ollama run llama3:8b-instruct-q4_K_M

结论

Unsloth 为像 -Llama-3 这样功能强大的 LLM 的高效微调打开了大门。凭借其速度和内存优化能力，Unsloth 可帮助开发人员探索各种 NLP 应用，加速该领域的创新！

参考

[1]https://huggingface.co/datasets/yahma/alpaca-cleaned

[2]https://huggingface.co/docs/hub/en/gguf

[unsloth]跳转中...

liugddx

关注

30
点赞
踩
54

收藏

觉得还不错? 一键收藏
打赏
0
评论
用 Unsloth 微调 LLaMA 3 8B

今年4月份，Meta 公司发布了功能强大的大型语言模型（LLM）Llama-3，为从事各种 NLP 任务的开发人员提供了功能强大可以在普通机器上运行的开源LLM。然而，传统的 LLM 微调过程既耗时又耗费资源。但是，Unsloth 的出现改变了这一局面，大大加快了 Llama-3 的微调速度。本文将探讨 Unsloth 如何帮助您以极高的速度和效率，根据具体需求对 Llama-3 进行微调。我们将深入探讨 Unsloth 的优势，并提供 Llama-3 微调流程的流程指南。
复制链接

扫一扫