【细说fine-tuning】LoRA：低秩自适应策略（附官方代码与教程）

卖报的大地主

已于 2024-04-27 16:50:43 修改

阅读量7.4k

点赞数 28

分类专栏：深度学习论文阅读 # PyTorch 文章标签：人工智能深度学习 ai

于 2024-04-27 16:37:08 首次发布

本文链接：https://blog.csdn.net/qq_43456016/article/details/138252294

版权

深度学习同时被 3 个专栏收录

54 篇文章

订阅专栏

PyTorch

26 篇文章

订阅专栏

论文阅读

12 篇文章

订阅专栏

文章目录

1.为什么需要 $L o R A$ ？
2.LoRA优点
3.LoRA原理实现
4.深入理解 $L o R A$
5.研究总结
6.代码（From official）
7.使用教程（From official）

论文链接：LoRA: Low-Rank Adaptation of Large Language Models

code链接：LoRA: Code for loralib

1.为什么需要 $L o R A$ ？

目前诸如GPT、Llama等大型AI模型通常具有非常大的参数量，要想将其迁移到具体的下游场景任务中，利用重头训练的全局微调变得十分昂贵，对于普通研究者明显不现实，为此Microsoft提出了一种低消耗的大模型微调方法— $L o R A$ （Low-Rank Adaptation）。 $L o R A$ 基于适配器的思想，通过调整学习外部模块来对下游特定任务进行知识泛化，并通过可学习的秩分解矩阵设计降低模型微调与存储的开销问题。

2.LoRA优点

预训练模型共享，针对不同下游任务可设计特定的 $L o R A$ 模块。
$L o R A$ 可降低模型微调的硬件消耗，仅专注于优化低秩矩阵参数。
$L o R A$ 部署时将可训练矩阵与冻结预训练权重合并，避免引入“推理延迟”（扩展模型深度或减少模型的可用sequence长度）。
$L o R A$ 可与其他方法结合。

3.LoRA原理实现

Aghajanyan等的研究表明预训练的语言模型具有较低的”instrisic dimension“（内在维度），即将特征矩阵随机映射到较小的特征子空间中，其仍可进行有效的学习（大地主题外话：Stable Diffusion Model中也有该思想的体现，利用VAE等感知压缩模型将特征引入到低维感知空间进行循环去噪。此外，其奇异值分解思想也在很多模型中有所体现，比如fastercnn最后的多级线性层设计等）。LoRA受该思想启发，假设适配器权重矩阵在更新过程中同样具有"intrinsic rank"。为此， $L o R A$ 将适配器权重矩阵 $W_x$ 分解为 $B A$ ，其中 $B\inℝ^{d×r}$ 、 $A\inℝ^{r×k}$ 表示可训练的参数矩阵：
$h = W_0x + ∆W_x = W_0x + BAx$
外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

$L o R A$ 使用随即高斯初始化 $A$ ，使用 $0$ 初始化 $B$ ，因此， $W_x$ 在训练开始时被初始化为 $0$ 。然后将 $W_x$ 乘以 $\frac{\alpha}{r}$ 进行缩放处理，其中 $\alpha$ 是 $r$ 中的常数。当使用优化器进行优化时，如果适当地缩放初始化，则调整 $\alpha$ 与调整学习率大致相同。因此，简单地将 $\alpha$ 设置为尝试的第一个 $r$ ，而不调整它。当改变 $r$ 时，这种缩放有助于减少重新调整超参数的需要。

4.深入理解 $L o R A$

原文作者提出了三个问题并给出了实验解释，以更加深入地理解 $L o R A$ ：

预训练Transformer中的哪些权重矩阵应该应用 $L o R A$ ?
实践中最优 $r$ 为多少?
$∆ W$ 和 $W$ 之间有什么联系? $∆ W$ 与 $W$ 高度相关吗? $∆ W$ 相对于 $W$ 有多大?

实验先知：原论文中，作者将 $L o R A$ 应用于Transformer，并对自注意力机制的 $W_q$ 、 $W_k$ 、 $W_v$ 、 $W_o$ 进行自适应调整，而前馈神经网络部分（ $m lp$ ）被完全冻结。

问题1：预训练Transformer中的哪些权重矩阵应该应用 $L o R A$ ?

在这里插入图片描述

论文在GPT-3 175B上进行了实验，对于所有自注意力层若适应调整某单一类型的注意力权重，则对应于 $r$ = 8，若适应调整两种类型，则对应于 $r$ = 4，若调整四种类型，则对应 $r$ =2。实验结果表明，若将所有参数都放在 $W_q$ 或 $W_k$ 、中会导致性能显著降低，而同时调整 $W_q$ 、 $W_v$ 会产生最佳效果。这表明，在多个注意力权重矩阵自适应且 $r$ =4条件下的 $∆ W$ 捕获信息能力优于单一类型的注意力权重矩阵且具有较大 $r$ 的调整策略。

问题2：实践中最优 $r$ 为多少?

在这里插入图片描述

论文通过实验进一步探索了 $L o R A$ 是否真的存在“intrinsic rank”空间，以及 $r$ 的最优解。实验结果表明，即使非常小的 $r$ ， $L o R A$ 同样具有较强的性能(相比于单一类型权重矩阵调整，多类型性能更佳)。这表明更新矩阵 $∆ W$ 可能具有非常小的“intrinsic rank”。为了进一步证实该假设，论文计算了不同 $r$ 和不同random seed获得的子空间的重叠程度。

不同r之间的子空间相似度

论文给定秩为 $r = 8$ 和 $r = 64$ 的可学习自适应矩阵 $A_r=8$ 和 $A_r=64$ ，使用相同的预训练模型，进行奇异值分解，得到右奇异矩阵 $U_{Ar=8}$ 和 $U_{Ar=64}$ 。论文希望得到: $U_{Ar=8}$ $(1 \leq i \leq 8)$ 中由前 $i$ 个奇异向量构成的子空间中有多少包含在 $U_{Ar=64}$ $(1 \leq j \leq 64)$ 中由前 $j$ 个奇异向量构成的子空间中？实验使用基于Grassmann距离的标准化子空间相似性来量化该相似性。

在这里插入图片描述

实验结果显示，在 $A_r=8$ 和 $A_r=64$ 之间，顶部奇异向量对应的方向重叠显著，而其他方向则不重叠。这说明， $A_r=8$ 和 $A_r=64$ 的顶部奇异向量方向是最有用的，而其他方向可能包含训练过程中积累的大部分随机噪声。因此，适应权重矩阵确实可以有很低的秩。

不同random seed之间的子空间相似性

论文通过绘制 $r = 64$ 的两个random seed运行之间的归一化子空间相似性进一步证实了这一点。实验结果（下图左、中）显示， $W_q$ 似乎比 $W_v$ 具有更高的**“intrinsic rank”**，因为两次运行中， $W_q$ 学习了更多常见的奇异值方向，这与前文表6中的经验观察一致。作为比较，论文还绘制了两个随机高斯矩阵，如下图右，可见它们彼此之间没有共同的奇异值方向。

在这里插入图片描述

问题3： $∆ W$ 和 $W$ 之间有什么联系? $∆ W$ 与 $W$ 高度相关吗? $∆ W$ 相对于 $W$ 有多大?

为进一步揭示自适应调整预训练模型的潜在机制，论文对 $∆ W$ 和 $W$ 之间联系进行了实验探索。实验通过计算 $U^TWV^T$ 将 $W$ 投影到 $∆ W$ 的 $r - d im$ 子空间上，其中 $U$ 和 $V$ 为 $∆ W$ 的左、右奇异向量矩阵。然后，比较了 $U^TWV^T\|_F$ 和 $W\|_F$ 的 $F ro b e ni u s$ 范数（矩阵每个元素的平方和的二次方根）。作为比较，论文还将 $U^TWV^T\|_F$ 中的 $U$ 和 $V$ 替换为 $W$ 的前 $r$ 个奇异向量或者一个随机矩阵进行实验。实验结果如下。

在这里插入图片描述

实验结果表明：

与随机矩阵相比， $∆ W 4$ 与 $W$ 的相关性更强，说明 $∆ W$ 放大了 $W$ 中已经存在的一些特征。
$∆ W$ 并没有重复 $W$ 的最上面的奇异方向，而是只放大了 $W$ 中没有强调的方向。
放大系数非常大：当 $r$ = 4时，其放大系数为 $21.5\approx 6.91\div 0.32$ 。

5.研究总结

$L o R A$ 在不引入推理延迟，也不减少输入序列长度，同时保持高模型质量的条件下实现了低硬件需求的大模型下游任务自适应调整。论文作者通过大量实验深度探索 $L o R A$ 的内在机理，让读者能够更加充分地了解并应用。

6.代码（From official）

import torch
import torch.nn as nn
import torch.nn.functional as F

import math
from typing import Optional, List, Dict

class LoRALayer():
    def __init__(
        self, 
        r: int, 
        lora_alpha: int, 
        lora_dropout: float,
        merge_weights: bool,
    ):
        self.r = r
        self.lora_alpha = lora_alpha
        # Optional dropout
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x
        # Mark the weight as unmerged
        self.merged = False
        self.merge_weights = merge_weights


class Embedding(nn.Embedding, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        r: int = 0,
        lora_alpha: int = 1,
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0,
                           merge_weights=merge_weights)
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()

    def reset_parameters(self):
        nn.Embedding.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.zeros_(self.lora_A)
            nn.init.normal_(self.lora_B)

    def train(self, mode: bool = True):
        nn.Embedding.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
                self.merged = True
        
    def forward(self, x: torch.Tensor):
        if self.r > 0 and not self.merged:
            result = nn.Embedding.forward(self, x)
            after_A = F.embedding(
                x, self.lora_A.transpose(0, 1), self.padding_idx, self.max_norm,
                self.norm_type, self.scale_grad_by_freq, self.sparse
            )
            result += (after_A @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return nn.Embedding.forward(self, x)
            

class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)

        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize B the same way as the default for nn.Linear and A to zero
            # this is different than what is described in the paper but should not affect performance
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = True       

    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.r > 0 and not self.merged:
            result = F.linear(x, T(self.weight), bias=self.bias)            
            result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, T(self.weight), bias=self.bias)


class MergedLinear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        enable_lora: List[bool] = [False],
        fan_in_fan_out: bool = False,
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)
        assert out_features % len(enable_lora) == 0, \
            'The length of enable_lora must divide out_features'
        self.enable_lora = enable_lora
        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0 and any(enable_lora):
            self.lora_A = nn.Parameter(
                self.weight.new_zeros((r * sum(enable_lora), in_features)))
            self.lora_B = nn.Parameter(
                self.weight.new_zeros((out_features // len(enable_lora) * sum(enable_lora), r))
            ) # weights for Conv1D with groups=sum(enable_lora)
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
            # Compute the indices
            self.lora_ind = self.weight.new_zeros(
                (out_features, ), dtype=torch.bool
            ).view(len(enable_lora), -1)
            self.lora_ind[enable_lora, :] = True
            self.lora_ind = self.lora_ind.view(-1)
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def zero_pad(self, x):
        result = x.new_zeros((len(self.lora_ind), *x.shape[1:]))
        result[self.lora_ind] = x
        return result

    def merge_AB(self):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        delta_w = F.conv1d(
            self.lora_A.unsqueeze(0), 
            self.lora_B.unsqueeze(-1), 
            groups=sum(self.enable_lora)
        ).squeeze(0)
        return T(self.zero_pad(delta_w))

    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0 and any(self.enable_lora):
                    self.weight.data -= self.merge_AB() * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0 and any(self.enable_lora):
                    self.weight.data += self.merge_AB() * self.scaling
                self.merged = True        

    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.merged:
            return F.linear(x, T(self.weight), bias=self.bias)
        else:
            result = F.linear(x, T(self.weight), bias=self.bias)
            if self.r > 0:
                result += self.lora_dropout(x) @ T(self.merge_AB().T) * self.scaling
            return result

class ConvLoRA(nn.Module, LoRALayer):
    def __init__(self, conv_module, in_channels, out_channels, kernel_size, r=0, lora_alpha=1, lora_dropout=0., merge_weights=True, **kwargs):
        super(ConvLoRA, self).__init__()
        self.conv = conv_module(in_channels, out_channels, kernel_size, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights)
        assert isinstance(kernel_size, int)
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(
                self.conv.weight.new_zeros((r * kernel_size, in_channels * kernel_size))
            )
            self.lora_B = nn.Parameter(
              self.conv.weight.new_zeros((out_channels//self.conv.groups*kernel_size, r*kernel_size))
            )
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.conv.weight.requires_grad = False
        self.reset_parameters()
        self.merged = False

    def reset_parameters(self):
        self.conv.reset_parameters()
        if hasattr(self, 'lora_A'):
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)

    def train(self, mode=True):
        super(ConvLoRA, self).train(mode)
        if mode:
            if self.merge_weights and self.merged:
                if self.r > 0:
                    # Make sure that the weights are not merged
                    self.conv.weight.data -= (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                if self.r > 0:
                    # Merge the weights and mark it
                    self.conv.weight.data += (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
                self.merged = True

    def forward(self, x):
        if self.r > 0 and not self.merged:
            return self.conv._conv_forward(
                x, 
                self.conv.weight + (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling,
                self.conv.bias
            )
        return self.conv(x)

class Conv2d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv2d, self).__init__(nn.Conv2d, *args, **kwargs)

class Conv1d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv1d, self).__init__(nn.Conv1d, *args, **kwargs)

# Can Extend to other ones like this

class Conv3d(ConvLoRA):
    def __init__(self, *args, **kwargs):
        super(Conv3d, self).__init__(nn.Conv3d, *args, **kwargs)
        
        
####################################################################################


def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None:
    for n, p in model.named_parameters():
        if 'lora_' not in n:
            p.requires_grad = False
    if bias == 'none':
        return
    elif bias == 'all':
        for n, p in model.named_parameters():
            if 'bias' in n:
                p.requires_grad = True
    elif bias == 'lora_only':
        for m in model.modules():
            if isinstance(m, LoRALayer) and \
                hasattr(m, 'bias') and \
                m.bias is not None:
                    m.bias.requires_grad = True
    else:
        raise NotImplementedError


def lora_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]:
    my_state_dict = model.state_dict()
    if bias == 'none':
        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k}
    elif bias == 'all':
        return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k or 'bias' in k}
    elif bias == 'lora_only':
        to_return = {}
        for k in my_state_dict:
            if 'lora_' in k:
                to_return[k] = my_state_dict[k]
                bias_name = k.split('lora_')[0]+'bias'
                if bias_name in my_state_dict:
                    to_return[bias_name] = my_state_dict[bias_name]
        return to_return
    else:
        raise NotImplementedError

7.使用教程（From official）

安装loralib

pip install loralib
# Alternatively
# pip install git+https://github.com/microsoft/LoRA

可以选择通过将某些层替换为loralib中实现的对应层来调整它们。我们目前仅支持nn. Linear、nn.Embedding和nn.Conv2d。还支持MergedLinear，用于单个nn.Linear代表多个层的情况，例如在注意力qkv投影的某些实现中（有关更多信息，请参阅附加说明）。
```
# ===== Before =====
# layer = nn.Linear(in_features, out_features)

# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)
```

在训练循环开始之前，仅将LoRA参数标记为可训练。

import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string "lora_" in their names
lora.mark_only_lora_as_trainable(model)
# Training loop
for batch in dataloader:
   ...

保存检查点时，生成仅包含LoRA参数的state_dict。

# ===== Before =====
# torch.save(model.state_dict(), checkpoint_path)
# ===== After =====
torch.save(lora.lora_state_dict(model), checkpoint_path)

使用load_state_dict加载检查点时，请务必设置strict=False。

# Load the pretrained checkpoint first
model.load_state_dict(torch.load('ckpt_pretrained.pt'), trict=False)
# Then load the LoRA checkpoint
model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)

附加说明

虽然该研究专注于一个简单而有效的设置，即只调整Transformer中的q和v的projection，但在我们的示例中， $L o R A$ 可以应用于预训练权重的任何子集。我们鼓励您探索不同的配置，例如通过将nn. Embedding替换为lora.Embedding或调整MLP层来调整Embedding层。对于不同的模型架构和任务，最佳配置很可能会有所不同。

一些Transformer模型利用nn.Linear实现query、key、和value的投影矩阵。如果希望限制单个矩阵更新的秩，则必须将其分解为三个单独的矩阵或使用lora.MergedLine。如果选择分解层，请确保相应地修改checkpoint。

# ===== Before =====
# qkv_proj = nn.Linear(d_model, 3*d_model)
# ===== After =====
# Break it up (remember to modify the pretrained checkpoint accordingly)
q_proj = lora.Linear(d_model, d_model, r=8)
k_proj = nn.Linear(d_model, d_model)
v_proj = lora.Linear(d_model, d_model, r=8)
# Alternatively, use lora.MergedLinear (recommended)
qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])

与 $L o R A$ 一起训练bias向量可能是挤出额外任务性能的一种经济高效的方法（如果你仔细调整学习率）。虽然论文中没有彻底研究它的效果，但在 $l or a$ 中很容易尝试。调用mark_only_lora_as_trainable时，您可以通过将**“all”或“lora_only”**传递给bias=来标记一些偏差为可训练的。保存检查点时，请记住将相应的偏见=参数传递给lora_state_dict。
```
# ===== Before =====
# lora.mark_only_lora_as_trainable(model) # Not training any bias vectors
# ===== After =====
# Training all bias vectors associated with modules we apply LoRA to 
lora.mark_only_lora_as_trainable(model, bias='lora_only')
# Alternatively, we can train *all* bias vectors in the model, including LayerNorm biases
lora.mark_only_lora_as_trainable(model, bias='all')
# When saving a checkpoint, use the same bias= ('all' or 'lora_only')
torch.save(lora.lora_state_dict(model, bias='all'), checkpoint_path)
```
调用model.eval()将触发 $L o R A$ 参数与相应的预训练参数的合并，从而消除后续前向传递的额外延迟。再次调用model.train()将撤消合并。这可以通过将merge_weights=False传递给 $L o R A$ 层来禁用。