【细说fine-tuning】LoRA:低秩自适应策略(附官方代码与教程)


论文链接:LoRA: Low-Rank Adaptation of Large Language Models

code链接:LoRA: Code for loralib


1.为什么需要 L o R A LoRA LoRA

目前诸如GPT、Llama等大型AI模型通常具有非常大的参数量,要想将其迁移到具体的下游场景任务中,利用重头训练的全局微调变得十分昂贵,对于普通研究者明显不现实,为此Microsoft提出了一种低消耗的大模型微调方法— L o R A LoRA LoRA(Low-Rank Adaptation)。 L o R A LoRA LoRA基于适配器的思想,通过调整学习外部模块来对下游特定任务进行知识泛化,并通过可学习的秩分解矩阵设计降低模型微调与存储的开销问题。


2.LoRA优点

  1. 预训练模型共享,针对不同下游任务可设计特定的 L o R A LoRA LoRA模块。
  2. L o R A LoRA LoRA可降低模型微调的硬件消耗,仅专注于优化低秩矩阵参数。
  3. L o R A LoRA LoRA部署时将可训练矩阵与冻结预训练权重合并,避免引入“推理延迟”(扩展模型深度或减少模型的可用sequence长度)。
  4. L o R A LoRA LoRA可与其他方法结合。

3.LoRA原理实现

Aghajanyan等的研究表明预训练的语言模型具有较低的”instrisic dimension“(内在维度),即将特征矩阵随机映射到较小的特征子空间中,其仍可进行有效的学习(大地主题外话:Stable Diffusion Model中也有该思想的体现,利用VAE等感知压缩模型将特征引入到低维感知空间进行循环去噪。此外,其奇异值分解思想也在很多模型中有所体现,比如fastercnn最后的多级线性层设计等)。LoRA受该思想启发,假设适配器权重矩阵在更新过程中同样具有"intrinsic rank"。为此, L o R A LoRA LoRA将适配器权重矩阵 ∆ W x ∆W_x Wx分解为 B A BA BA,其中 B ∈ R d × r B\inℝ^{d×r} BRd×r A ∈ R r × k A\inℝ^{r×k} ARr×k表示可训练的参数矩阵:
h = W 0 x + ∆ W x = W 0 x + B A x h = W_0x + ∆W_x = W_0x + BAx h=W0x+Wx=W0x+BAx
外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

L o R A LoRA LoRA使用随即高斯初始化 A A A,使用 0 0 0初始化 B B B,因此, ∆ W x ∆W_x Wx在训练开始时被初始化为 0 0 0。然后将 ∆ W x ∆W_x Wx乘以 α r \frac{\alpha}{r} rα进行缩放处理,其中 α \alpha α r r r中的常数。当使用优化器进行优化时,如果适当地缩放初始化,则调整 α \alpha α与调整学习率大致相同。因此,简单地将 α \alpha α设置为尝试的第一个 r r r,而不调整它。当改变 r r r时,这种缩放有助于减少重新调整超参数的需要。


4.深入理解 L o R A LoRA LoRA

原文作者提出了三个问题并给出了实验解释,以更加深入地理解 L o R A LoRA LoRA

  1. 预训练Transformer中的哪些权重矩阵应该应用 L o R A LoRA LoRA?
  2. 实践中最优 r r r为多少?
  3. ∆ W ∆W W W W W之间有什么联系? ∆ W ∆W W W W W高度相关吗? ∆ W ∆W W相对于 W W W有多大?

实验先知:原论文中,作者将 L o R A LoRA LoRA应用于Transformer,并对自注意力机制的 ∆ W q ∆W_q Wq ∆ W k ∆W_k Wk ∆ W v ∆W_v Wv ∆ W o ∆W_o Wo进行自适应调整,而前馈神经网络部分( m l p mlp mlp)被完全冻结。

问题1:预训练Transformer中的哪些权重矩阵应该应用 L o R A LoRA LoRA?

在这里插入图片描述

论文在GPT-3 175B上进行了实验,对于所有自注意力层若适应调整某单一类型的注意力权重,则对应于 r r r = 8,若适应调整两种类型,则对应于 r r r = 4,若调整四种类型,则对应 r r r=2。实验结果表明,若将所有参数都放在 ∆ W q ∆W_q Wq ∆ W k ∆W_k Wk、中会导致性能显著降低,而同时调整 W q W_q Wq W v W_v Wv会产生最佳效果。这表明,在多个注意力权重矩阵自适应且 r r r=4条件下的 ∆ W ∆W W捕获信息能力优于单一类型的注意力权重矩阵且具有较大 r r r的调整策略。

问题2:实践中最优 r r r为多少?

在这里插入图片描述

论文通过实验进一步探索了 L o R A LoRA LoRA是否真的存在“intrinsic rank”空间,以及 r r r的最优解。实验结果表明,即使非常小的 r r r L o R A LoRA LoRA同样具有较强的性能(相比于单一类型权重矩阵调整,多类型性能更佳)。这表明更新矩阵 ∆ W ∆W W可能具有非常小的“intrinsic rank”。为了进一步证实该假设,论文计算了不同 r r r和不同random seed获得的子空间的重叠程度。

不同r之间的子空间相似度

论文给定秩为 r = 8 r=8 r=8 r = 64 r=64 r=64的可学习自适应矩阵 A r = 8 A_r=8 Ar=8 A r = 64 A_r=64 Ar=64,使用相同的预训练模型,进行奇异值分解,得到右奇异矩阵 U A r = 8 U_{Ar=8} UAr=8 U A r = 64 U_{Ar=64} UAr=64。论文希望得到: U A r = 8 U_{Ar=8} UAr=8 ( 1 ≤ i ≤ 8 ) (1≤i≤8) (1i8)中由前 i i i个奇异向量构成的子空间中有多少包含在 U A r = 64 U_{Ar=64} UAr=64 ( 1 ≤ j ≤ 64 ) (1≤j≤64) (1j64)中由前 j j j个奇异向量构成的子空间中?实验使用基于Grassmann距离的标准化子空间相似性来量化该相似性。

在这里插入图片描述

实验结果显示,在 A r = 8 A_r=8 Ar=8 A r = 64 A_r=64 Ar=64之间,顶部奇异向量对应的方向重叠显著,而其他方向则不重叠。这说明, A r = 8 A_r=8 Ar=8 A r = 64 A_r=64 Ar=64的顶部奇异向量方向是最有用的,而其他方向可能包含训练过程中积累的大部分随机噪声。因此,适应权重矩阵确实可以有很低的秩。

不同random seed之间的子空间相似性

论文通过绘制 r = 64 r = 64 r=64的两个random seed运行之间的归一化子空间相似性进一步证实了这一点。实验结果(下图左、中)显示, ∆ W q ∆W_q Wq似乎比 ∆ W v ∆W_v Wv具有更高的**“intrinsic rank”**,因为两次运行中, ∆ W q ∆W_q Wq学习了更多常见的奇异值方向,这与前文表6中的经验观察一致。作为比较,论文还绘制了两个随机高斯矩阵,如下图右,可见它们彼此之间没有共同的奇异值方向。

在这里插入图片描述

问题3: ∆ W ∆W W W W W之间有什么联系? ∆ W ∆W W W W W高度相关吗? ∆ W ∆W W相对于 W W W有多大?

为进一步揭示自适应调整预训练模型的潜在机制,论文对 ∆ W ∆W W W W W之间联系进行了实验探索。实验通过计算 U T W V T U^TWV^T UTWVT W W W投影到 ∆ W ∆W W r − d i m r-dim rdim子空间上,其中 U U U V V V ∆ W ∆W W的左、右奇异向量矩阵。然后,比较了 ∥ U T W V T ∥ F \|U^TWV^T\|_F UTWVTF ∥ W ∥ F \|W\|_F WF F r o b e n i u s Frobenius Frobenius范数(矩阵每个元素的平方和的二次方根)。作为比较,论文还将 ∥ U T W V T ∥ F \|U^TWV^T\|_F UTWVTF中的 U U U V V V替换为 W W W的前 r r r个奇异向量或者一个随机矩阵进行实验。实验结果如下。

在这里插入图片描述

实验结果表明:

  1. 与随机矩阵相比, ∆ W 4 ∆W4 W4 W W W的相关性更强,说明 ∆ W ∆W W放大了 W W W中已经存在的一些特征。
  2. ∆ W ∆W W并没有重复 W W W的最上面的奇异方向,而是只放大了 W W W中没有强调的方向。
  3. 放大系数非常大:当 r r r = 4时,其放大系数为 21.5 ≈ 6.91 ÷ 0.32 21.5\approx 6.91\div 0.32 21.56.91÷0.32

5.研究总结

L o R A LoRA LoRA在不引入推理延迟,也不减少输入序列长度,同时保持高模型质量的条件下实现了低硬件需求的大模型下游任务自适应调整。论文作者通过大量实验深度探索 L o R A LoRA LoRA的内在机理,让读者能够更加充分地了解并应用。


6.代码(From official)

import torch
import torch.nn as nn
import torch.nn.functional as F

import math
from typing import Optional, List, Dict

class LoRALayer():
    def __init__(
        self, 
        r: int, 
        lora_alpha: int, 
        lora_dropout: float,
        merge_weights: bool,
    ):
        self.r = r
        self.lora_alpha = lora_alpha
        # Optional dropout
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x
        # Mark the weight as unmerged
        self.merged = False
        self.merge_weights = merge_weights


class Embedding(nn.Embedding, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        r: int = 0,
        lora_alpha: int = 1,
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0,
                           merge_weights=merge_weights)
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()

    def reset_parameters(self):
        nn.Embedding.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.zeros_(self.lora_A)
            nn.init.normal_(self.lora_B)

    def train(self, mode: bool = True):
        nn.Embedding.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
                self.merged = True
        
    def forward(self, x: torch.Tensor):
        if self.r > 0 and not self.merged:
            result = nn.Embedding.forward(self, x)
            after_A = F.embedding(
                x, self.lora_A.transpose(0, 1), self.padding_idx, self.max_norm,
                self.norm_type, self.scale_grad_by_freq, self.sparse
            )
            result += (after_A @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return nn.Embedding.forward(self, x)
            

class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)

        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize B the same way as the default for nn.Linear and A to zero
            # this is different than what is described in the paper but should not affect performance
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = True       

    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.r > 0 and not self.merged:
            result = F.linear(x, T(self.weight), bias=self.bias)            
            result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, T(self.weight), bias=self.bias)


class MergedLinear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        enable_lora: List[bool] = [False],
        fan_in_fan_out: bool = False,
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)
        assert out_features % len(enable_lora) == 0, \
            'The length of enable_lora must divide out_features'
        self.enable_lora = enable_lora
        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0 and any(enable_lora):
            self.lora_A = nn.Parameter(
                self.weight.new_zeros((r * sum(enable_lora), in_features)))
            self.lora_B = nn.Parameter(
                self.weight.new_zeros((out_features // len(enable_lora) * sum(enable_lora), r))
            ) # weights for Conv1D with groups=sum(enable_lora)
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
            # Compute the indices
            self.lora_ind = self.weight.new_zeros(
                (out_features, ), dtype=torch.bool
            ).view(len(enable_lora), -1)
            self.lora_ind[enable_lora, :] = True
            self.lora_ind = self.lora_ind.view(-1)
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def zero_pad(self, x):
        result = x.new_zeros((len(self.lora_ind), *x.shape[1:]))
        result[self.lora_ind] = x
        return result

    def merge_AB(self):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        delta_w = F.conv1d(
            self.lora_A.unsqueeze(0), 
            self.lora_B.unsqueeze(-1), 
            groups=sum(self.enable_lora)
        ).squeeze(0)
        return T(self.zero_pad(delta_w))

    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0 and any(self.enable_lora):
                    self.weight.data -= self.merge_AB() * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0 and any(self.enable_lora):
                    self.weight.data += self.merge_AB() * self.scaling
                self.merged = True        

    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.merged:
            return F.linear(x, T(self.weight), bias=self.bias)
        else:
            result = F.linear(x, T(self.weight), bias=self.bias)
            if self.r > 0:
                result += self.lora_dropout(x) @ T(self.merge_AB().T) * self.scaling
            return result

class ConvLoRA(nn.Module, LoRALayer):
    def __init__(self, conv_module, in_channels, out_channels, kernel_size, r=0, lora_alpha=1, lora_dropout=0., merge_weights=True, **kwargs):
        super(ConvLoRA, self).__init__()
        self.conv = conv_module(in_channels, out_channels, kernel_size, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights)
        assert isinstance(kernel_size, int)
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(
                self.conv.weight.new_zeros((r * kernel_size, in_channels * kernel_size))
            )
            self.lora_B = nn.Parameter(
              self.conv.weight.new_zeros((out_channels//self.conv.groups*kernel_size, r*kernel_size))
            )
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.conv.weight.requires_grad = False
        self.reset_parameters()
        self.merged = False

    def reset_parameters(self):
        self.conv.reset_parameters()
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def train(self, mode=True):
        super(ConvLoRA, self).train(mode)
        if mode:
            if self.merge_weights and self.merged:
                if self.r > 0:
                    # Make sure that the weights are not merged
                    self.conv.weight.data -= (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                if self.r > 0:
                    # Merge the weights and mark it
                    self.conv.weight.data += (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
                self.merged = True

    def forward(self, x):
        if self.r > 0 and not self.merged:
            return self.conv._conv_forward(
                x, 
                self.conv.weight + (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling,
                self.conv.bias
            )
        return self.conv(x)

class Conv2d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv2d, self).__init__(nn.Conv2d, *args, **kwargs)

class Conv1d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv1d, self).__init__(nn.Conv1d, *args, **kwargs)

# Can Extend to other ones like this

class Conv3d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv3d, self).__init__(nn.Conv3d, *args, **kwargs)
        
        
####################################################################################


def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None:
    for n, p in model.named_parameters():
        if 'lora_' not in n:
            p.requires_grad = False
    if bias == 'none':
        return
    elif bias == 'all':
        for n, p in model.named_parameters():
            if 'bias' in n:
                p.requires_grad = True
    elif bias == 'lora_only':
        for m in model.modules():
            if isinstance(m, LoRALayer) and \
                hasattr(m, 'bias') and \
                m.bias is not None:
                    m.bias.requires_grad = True
    else:
        raise NotImplementedError


def lora_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]:
    my_state_dict = model.state_dict()
    if bias == 'none':
        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k}
    elif bias == 'all':
        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k or 'bias' in k}
    elif bias == 'lora_only':
        to_return = {}
        for k in my_state_dict:
            if 'lora_' in k:
                to_return[k] = my_state_dict[k]
                bias_name = k.split('lora_')[0]+'bias'
                if bias_name in my_state_dict:
                    to_return[bias_name] = my_state_dict[bias_name]
        return to_return
    else:
        raise NotImplementedError

7.使用教程(From official)

  1. 安装loralib

    pip install loralib
    # Alternatively
    # pip install git+https://github.com/microsoft/LoRA
    
  2. 可以选择通过将某些层替换为loralib中实现的对应层来调整它们。我们目前仅支持nn. Linearnn.Embeddingnn.Conv2d。还支持MergedLinear,用于单个nn.Linear代表多个层的情况,例如在注意力qkv投影的某些实现中(有关更多信息,请参阅附加说明)。

    # ===== Before =====
    # layer = nn.Linear(in_features, out_features)
    
    # ===== After ======
    import loralib as lora
    # Add a pair of low-rank adaptation matrices with rank r=16
    layer = lora.Linear(in_features, out_features, r=16)
    
  3. 在训练循环开始之前,仅将LoRA参数标记为可训练。

    import loralib as lora
    model = BigModel()
    # This sets requires_grad to False for all parameters without the string "lora_" in their names
    lora.mark_only_lora_as_trainable(model)
    # Training loop
    for batch in dataloader:
       ...
    
  4. 保存检查点时,生成仅包含LoRA参数的state_dict

    # ===== Before =====
    # torch.save(model.state_dict(), checkpoint_path)
    # ===== After =====
    torch.save(lora.lora_state_dict(model), checkpoint_path)
    
  5. 使用load_state_dict加载检查点时,请务必设置strict=False

    # Load the pretrained checkpoint first
    model.load_state_dict(torch.load('ckpt_pretrained.pt'), trict=False)
    # Then load the LoRA checkpoint
    model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)
    

附加说明

  1. 虽然该研究专注于一个简单而有效的设置,即只调整Transformer中的qvprojection,但在我们的示例中, L o R A LoRA LoRA可以应用于预训练权重的任何子集。我们鼓励您探索不同的配置,例如通过将nn. Embedding替换为lora.Embedding或调整MLP层来调整Embedding层。对于不同的模型架构和任务,最佳配置很可能会有所不同。

  2. 一些Transformer模型利用nn.Linear实现querykey、和value的投影矩阵。如果希望限制单个矩阵更新的秩,则必须将其分解为三个单独的矩阵或使用lora.MergedLine。如果选择分解层,请确保相应地修改checkpoint。

    # ===== Before =====
    # qkv_proj = nn.Linear(d_model, 3*d_model)
    # ===== After =====
    # Break it up (remember to modify the pretrained checkpoint accordingly)
    q_proj = lora.Linear(d_model, d_model, r=8)
    k_proj = nn.Linear(d_model, d_model)
    v_proj = lora.Linear(d_model, d_model, r=8)
    # Alternatively, use lora.MergedLinear (recommended)
    qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])
    
  3. L o R A LoRA LoRA一起训练bias向量可能是挤出额外任务性能的一种经济高效的方法(如果你仔细调整学习率)。虽然论文中没有彻底研究它的效果,但在 l o r a lora lora中很容易尝试。调用mark_only_lora_as_trainable时,您可以通过将**“all”“lora_only”**传递给bias=来标记一些偏差为可训练的。保存检查点时,请记住将相应的偏见=参数传递给lora_state_dict

    # ===== Before =====
    # lora.mark_only_lora_as_trainable(model) # Not training any bias vectors
    # ===== After =====
    # Training all bias vectors associated with modules we apply LoRA to 
    lora.mark_only_lora_as_trainable(model, bias='lora_only')
    # Alternatively, we can train *all* bias vectors in the model, including LayerNorm biases
    lora.mark_only_lora_as_trainable(model, bias='all')
    # When saving a checkpoint, use the same bias= ('all' or 'lora_only')
    torch.save(lora.lora_state_dict(model, bias='all'), checkpoint_path)
    
  4. 调用model.eval()将触发 L o R A LoRA LoRA参数与相应的预训练参数的合并,从而消除后续前向传递的额外延迟。再次调用model.train()将撤消合并。这可以通过将merge_weights=False传递给 L o R A LoRA LoRA层来禁用。

  • 14
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

卖报的大地主

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值