微调llama 3 — PEFT微调和全量微调

1. llama 3 微调基础

1.1 llama 3 简介

官方blog
llama 3 目前有两个版本:8B版和70B版。8B版本拥有8.03B参数,其尺寸较小,可以在消费者硬件上本地运行。

Llama 3与Llama 2具有相同的架构,但词汇表要大得多,包含128k entries,而Llama 2只有32k entries,根据Meta的说法,词汇表的扩展显著提高了模型表现。Llama 3的预训练数据包含5%的高质量非英语数据。注意:Meta在model card中仍然提到Llama 3更适合用于英语任务。

另一方面,词汇表的扩展意味着token embeddings需要更多的数据才能被训练的更准确。Meta在15T tokens上训练Llama 3。相比之下,Llama 2只在2T tokens上训练,Google Gemma在6T tokens训练,这在当时似乎已经很多了。

模型的性能表现如下图所示:
在这里插入图片描述

1.2 llama 3 8b Fully Fine-tuning内存占用分析

Fully Fine-tuning an LLM需要更新其所有参数,这种微调需要大量的内存。

  • 模型需要被完全加载到 GPU 内存中
  • 此外,通常用于微调 LLMs 的优化器 AdamW 会为模型中的每个参数创建并存储 2 个参数在 GPU 内存中
  • 并且我们还需要存储在微调过程中创建的张量,即激活值,以便在反向传播过程中用于更新模型参数的梯度。

对Llama 3 8B进行微调,例如,批量大小为8,序列长度为512,将消耗128.87GB的显存。注意:这个内存消耗是一个估计值,没有考虑任何的优化,比如梯度检查点和张量并行。

model loading the model optimizer states activations total
llama 3 8b 14.96GB 59.83GB 54.08GB 128.87GB

估算大型语言模型(LLM)内存消耗的计算方法

幸运的是,我们可以很容易地减少这三种参数的内存消耗:

  • Optimizer states:默认情况下,AdamW 的参数为 float32,每项占用 4 字节。AdamW-8bit 是另一种不错的选择,它将参数量化为 8 位,即减少了内存消耗从 59.8 GB 到 15 GB。如果使用的框架不复制模型参数,内存消耗会大大减少。
  • Model:我们可以将模型量化为4位。它将内存消耗分成近4份,即从15 GB到4 GB。在实践中,为了保持其性能,并不是所有的LLM模块都会被量化。
  • Activations:我们需要存储激活来计算梯度。然而,使用gradient checkpointing,我们可以在反向传播过程中动态地重新计算激活值,而不是在整个训练过程中都存储这些激活值。它大大减少了激活的内存消耗,从54GB减少到10 GB。

在应用了所有这些优化措施之后,微调过程需要29GB的内存。虽然这仍然太多,但至少现在可以使用两个24GB的GPU来对模型进行微调了。

1.3 llama 3 8b PEFT Fine-tuning内存占用分析

使用PEFT方法,如LoRA,我们可以在模型顶部微调一个适配器,不需要完全重新训练模型。为了进一步降低内存消耗。

  1. 使用LoRA,需要一个带有24 GB RAM的GPU来微调Llama 3;
  2. 使用QLoRA,只需要一个带有16 GB RAM的GPU。

2. PEFT方法微调llama 3

1、QLoRA 是量化的 LoRA 与 LLMs 的结合。要使用这种方法对 Llama 3 8B 进行微调,我们需要安装

pip install --upgrade bitsandbytes transformers peft accelerate datasets trl

2、然后导入需要的pkgs

import torch, os
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl 
### LLaMA Model PEFT Parameter Efficient Fine-Tuning Guide Parameter-efficient fine-tuning (PEFT) methods allow models like LLaMA to be adapted for specific tasks without requiring the retraining of all parameters, thus saving computational resources and time. One approach that has gained attention is UniPELT, which integrates LoRa, Prefix-Tuning, and Adapters with a gating mechanism to achieve parameter efficiency during micro-adjustments[^2]. For implementing PEFT on an LLaMA model specifically: #### Selecting Appropriate Techniques Choosing between Adapter-based tuning, Low-Rank Adaptation (LoRA), or Prefix Tuning depends largely upon the application scenario as well as available hardware constraints. Each technique offers unique advantages when it comes to optimizing performance versus resource usage. #### Applying LoRA Methodology Low-rank adaptation involves modifying only certain parts of pre-trained weights by adding low-rank matrices while keeping others frozen. This method reduces memory footprint significantly compared to full network training but still manages to capture task-specific information effectively[^1]. ```python from peft import get_peft_model, LoraConfig config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, ) model = get_peft_model(model, config) ``` #### Utilizing Adapter Mechanisms Adapter modules are inserted into existing layers within transformer architectures such as those found in LLaMA models. These lightweight structures learn transformations applied directly onto hidden states rather than altering original weight values entirely. Consequently, this leads to faster convergence times alongside reduced storage requirements since adapters typically contain far fewer trainable elements relative to their base counterparts. #### Implementing Prefix Tuning Strategy Prefix tuning focuses on introducing additional tokens at input sequences' beginnings before passing them through encoders/decoders. By doing so, these prefixes act similarly to prompts guiding subsequent generation processes towards desired outcomes based on learned patterns from datasets used during adjustment phases[^3].
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值