文章目录
论文链接:LoRA: Low-Rank Adaptation of Large Language Models
code链接:LoRA: Code for loralib
1.为什么需要 L o R A LoRA LoRA?
目前诸如GPT、Llama等大型AI模型通常具有非常大的参数量,要想将其迁移到具体的下游场景任务中,利用重头训练的全局微调变得十分昂贵,对于普通研究者明显不现实,为此Microsoft提出了一种低消耗的大模型微调方法— L o R A LoRA LoRA(Low-Rank Adaptation)。 L o R A LoRA LoRA基于适配器的思想,通过调整学习外部模块来对下游特定任务进行知识泛化,并通过可学习的秩分解矩阵设计降低模型微调与存储的开销问题。
2.LoRA优点
- 预训练模型共享,针对不同下游任务可设计特定的 L o R A LoRA LoRA模块。
- L o R A LoRA LoRA可降低模型微调的硬件消耗,仅专注于优化低秩矩阵参数。
- L o R A LoRA LoRA部署时将可训练矩阵与冻结预训练权重合并,避免引入“推理延迟”(扩展模型深度或减少模型的可用sequence长度)。
- L o R A LoRA LoRA可与其他方法结合。
3.LoRA原理实现
Aghajanyan等的研究表明预训练的语言模型具有较低的”instrisic dimension“(内在维度),即将特征矩阵随机映射到较小的特征子空间中,其仍可进行有效的学习(大地主题外话:Stable Diffusion Model中也有该思想的体现,利用VAE等感知压缩模型将特征引入到低维感知空间进行循环去噪。此外,其奇异值分解思想也在很多模型中有所体现,比如fastercnn最后的多级线性层设计等)。LoRA受该思想启发,假设适配器权重矩阵在更新过程中同样具有"intrinsic rank"。为此,
L
o
R
A
LoRA
LoRA将适配器权重矩阵
∆
W
x
∆W_x
∆Wx分解为
B
A
BA
BA,其中
B
∈
R
d
×
r
B\inℝ^{d×r}
B∈Rd×r、
A
∈
R
r
×
k
A\inℝ^{r×k}
A∈Rr×k表示可训练的参数矩阵:
h
=
W
0
x
+
∆
W
x
=
W
0
x
+
B
A
x
h = W_0x + ∆W_x = W_0x + BAx
h=W0x+∆Wx=W0x+BAx
L o R A LoRA LoRA使用随即高斯初始化 A A A,使用 0 0 0初始化 B B B,因此, ∆ W x ∆W_x ∆Wx在训练开始时被初始化为 0 0 0。然后将 ∆ W x ∆W_x ∆Wx乘以 α r \frac{\alpha}{r} rα进行缩放处理,其中 α \alpha α是 r r r中的常数。当使用优化器进行优化时,如果适当地缩放初始化,则调整 α \alpha α与调整学习率大致相同。因此,简单地将 α \alpha α设置为尝试的第一个 r r r,而不调整它。当改变 r r r时,这种缩放有助于减少重新调整超参数的需要。
4.深入理解 L o R A LoRA LoRA
原文作者提出了三个问题并给出了实验解释,以更加深入地理解 L o R A LoRA LoRA:
- 预训练Transformer中的哪些权重矩阵应该应用 L o R A LoRA LoRA?
- 实践中最优 r r r为多少?
- ∆ W ∆W ∆W和 W W W之间有什么联系? ∆ W ∆W ∆W与 W W W高度相关吗? ∆ W ∆W ∆W相对于 W W W有多大?
实验先知:原论文中,作者将 L o R A LoRA LoRA应用于Transformer,并对自注意力机制的 ∆ W q ∆W_q ∆Wq、 ∆ W k ∆W_k ∆Wk、 ∆ W v ∆W_v ∆Wv、 ∆ W o ∆W_o ∆Wo进行自适应调整,而前馈神经网络部分( m l p mlp mlp)被完全冻结。
问题1:预训练Transformer中的哪些权重矩阵应该应用 L o R A LoRA LoRA?
论文在GPT-3 175B上进行了实验,对于所有自注意力层若适应调整某单一类型的注意力权重,则对应于 r r r = 8,若适应调整两种类型,则对应于 r r r = 4,若调整四种类型,则对应 r r r=2。实验结果表明,若将所有参数都放在 ∆ W q ∆W_q ∆Wq或 ∆ W k ∆W_k ∆Wk、中会导致性能显著降低,而同时调整 W q W_q Wq、 W v W_v Wv会产生最佳效果。这表明,在多个注意力权重矩阵自适应且 r r r=4条件下的 ∆ W ∆W ∆W捕获信息能力优于单一类型的注意力权重矩阵且具有较大 r r r的调整策略。
问题2:实践中最优 r r r为多少?
论文通过实验进一步探索了 L o R A LoRA LoRA是否真的存在“intrinsic rank”空间,以及 r r r的最优解。实验结果表明,即使非常小的 r r r, L o R A LoRA LoRA同样具有较强的性能(相比于单一类型权重矩阵调整,多类型性能更佳)。这表明更新矩阵 ∆ W ∆W ∆W可能具有非常小的“intrinsic rank”。为了进一步证实该假设,论文计算了不同 r r r和不同random seed获得的子空间的重叠程度。
不同r之间的子空间相似度
论文给定秩为 r = 8 r=8 r=8和 r = 64 r=64 r=64的可学习自适应矩阵 A r = 8 A_r=8 Ar=8和 A r = 64 A_r=64 Ar=64,使用相同的预训练模型,进行奇异值分解,得到右奇异矩阵 U A r = 8 U_{Ar=8} UAr=8和 U A r = 64 U_{Ar=64} UAr=64。论文希望得到: U A r = 8 U_{Ar=8} UAr=8 ( 1 ≤ i ≤ 8 ) (1≤i≤8) (1≤i≤8)中由前 i i i个奇异向量构成的子空间中有多少包含在 U A r = 64 U_{Ar=64} UAr=64 ( 1 ≤ j ≤ 64 ) (1≤j≤64) (1≤j≤64)中由前 j j j个奇异向量构成的子空间中?实验使用基于Grassmann距离的标准化子空间相似性来量化该相似性。
实验结果显示,在 A r = 8 A_r=8 Ar=8和 A r = 64 A_r=64 Ar=64之间,顶部奇异向量对应的方向重叠显著,而其他方向则不重叠。这说明, A r = 8 A_r=8 Ar=8和 A r = 64 A_r=64 Ar=64的顶部奇异向量方向是最有用的,而其他方向可能包含训练过程中积累的大部分随机噪声。因此,适应权重矩阵确实可以有很低的秩。
不同random seed之间的子空间相似性
论文通过绘制 r = 64 r = 64 r=64的两个random seed运行之间的归一化子空间相似性进一步证实了这一点。实验结果(下图左、中)显示, ∆ W q ∆W_q ∆Wq似乎比 ∆ W v ∆W_v ∆Wv具有更高的**“intrinsic rank”**,因为两次运行中, ∆ W q ∆W_q ∆Wq学习了更多常见的奇异值方向,这与前文表6中的经验观察一致。作为比较,论文还绘制了两个随机高斯矩阵,如下图右,可见它们彼此之间没有共同的奇异值方向。
问题3: ∆ W ∆W ∆W和 W W W之间有什么联系? ∆ W ∆W ∆W与 W W W高度相关吗? ∆ W ∆W ∆W相对于 W W W有多大?
为进一步揭示自适应调整预训练模型的潜在机制,论文对 ∆ W ∆W ∆W和 W W W之间联系进行了实验探索。实验通过计算 U T W V T U^TWV^T UTWVT将 W W W投影到 ∆ W ∆W ∆W的 r − d i m r-dim r−dim子空间上,其中 U U U和 V V V为 ∆ W ∆W ∆W的左、右奇异向量矩阵。然后,比较了 ∥ U T W V T ∥ F \|U^TWV^T\|_F ∥UTWVT∥F和 ∥ W ∥ F \|W\|_F ∥W∥F的 F r o b e n i u s Frobenius Frobenius范数(矩阵每个元素的平方和的二次方根)。作为比较,论文还将 ∥ U T W V T ∥ F \|U^TWV^T\|_F ∥UTWVT∥F中的 U U U和 V V V替换为 W W W的前 r r r个奇异向量或者一个随机矩阵进行实验。实验结果如下。
实验结果表明:
- 与随机矩阵相比, ∆ W 4 ∆W4 ∆W4与 W W W的相关性更强,说明 ∆ W ∆W ∆W放大了 W W W中已经存在的一些特征。
- ∆ W ∆W ∆W并没有重复 W W W的最上面的奇异方向,而是只放大了 W W W中没有强调的方向。
- 放大系数非常大:当 r r r = 4时,其放大系数为 21.5 ≈ 6.91 ÷ 0.32 21.5\approx 6.91\div 0.32 21.5≈6.91÷0.32。
5.研究总结
L o R A LoRA LoRA在不引入推理延迟,也不减少输入序列长度,同时保持高模型质量的条件下实现了低硬件需求的大模型下游任务自适应调整。论文作者通过大量实验深度探索 L o R A LoRA LoRA的内在机理,让读者能够更加充分地了解并应用。
6.代码(From official)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional, List, Dict
class LoRALayer():
def __init__(
self,
r: int,
lora_alpha: int,
lora_dropout: float,
merge_weights: bool,
):
self.r = r
self.lora_alpha = lora_alpha
# Optional dropout
if lora_dropout > 0.:
self.lora_dropout = nn.Dropout(p=lora_dropout)
else:
self.lora_dropout = lambda x: x
# Mark the weight as unmerged
self.merged = False
self.merge_weights = merge_weights
class Embedding(nn.Embedding, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
num_embeddings: int,
embedding_dim: int,
r: int = 0,
lora_alpha: int = 1,
merge_weights: bool = True,
**kwargs
):
nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0,
merge_weights=merge_weights)
# Actual trainable parameters
if r > 0:
self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings)))
self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r)))
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
self.reset_parameters()
def reset_parameters(self):
nn.Embedding.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.zeros_(self.lora_A)
nn.init.normal_(self.lora_B)
def train(self, mode: bool = True):
nn.Embedding.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0:
self.weight.data -= (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0:
self.weight.data += (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
if self.r > 0 and not self.merged:
result = nn.Embedding.forward(self, x)
after_A = F.embedding(
x, self.lora_A.transpose(0, 1), self.padding_idx, self.max_norm,
self.norm_type, self.scale_grad_by_freq, self.sparse
)
result += (after_A @ self.lora_B.transpose(0, 1)) * self.scaling
return result
else:
return nn.Embedding.forward(self, x)
class Linear(nn.Linear, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
in_features: int,
out_features: int,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.,
fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
merge_weights: bool = True,
**kwargs
):
nn.Linear.__init__(self, in_features, out_features, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
merge_weights=merge_weights)
self.fan_in_fan_out = fan_in_fan_out
# Actual trainable parameters
if r > 0:
self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
self.reset_parameters()
if fan_in_fan_out:
self.weight.data = self.weight.data.transpose(0, 1)
def reset_parameters(self):
nn.Linear.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize B the same way as the default for nn.Linear and A to zero
# this is different than what is described in the paper but should not affect performance
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def train(self, mode: bool = True):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
nn.Linear.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0:
self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0:
self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
if self.r > 0 and not self.merged:
result = F.linear(x, T(self.weight), bias=self.bias)
result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
return result
else:
return F.linear(x, T(self.weight), bias=self.bias)
class MergedLinear(nn.Linear, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
in_features: int,
out_features: int,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.,
enable_lora: List[bool] = [False],
fan_in_fan_out: bool = False,
merge_weights: bool = True,
**kwargs
):
nn.Linear.__init__(self, in_features, out_features, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
merge_weights=merge_weights)
assert out_features % len(enable_lora) == 0, \
'The length of enable_lora must divide out_features'
self.enable_lora = enable_lora
self.fan_in_fan_out = fan_in_fan_out
# Actual trainable parameters
if r > 0 and any(enable_lora):
self.lora_A = nn.Parameter(
self.weight.new_zeros((r * sum(enable_lora), in_features)))
self.lora_B = nn.Parameter(
self.weight.new_zeros((out_features // len(enable_lora) * sum(enable_lora), r))
) # weights for Conv1D with groups=sum(enable_lora)
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
# Compute the indices
self.lora_ind = self.weight.new_zeros(
(out_features, ), dtype=torch.bool
).view(len(enable_lora), -1)
self.lora_ind[enable_lora, :] = True
self.lora_ind = self.lora_ind.view(-1)
self.reset_parameters()
if fan_in_fan_out:
self.weight.data = self.weight.data.transpose(0, 1)
def reset_parameters(self):
nn.Linear.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def zero_pad(self, x):
result = x.new_zeros((len(self.lora_ind), *x.shape[1:]))
result[self.lora_ind] = x
return result
def merge_AB(self):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
delta_w = F.conv1d(
self.lora_A.unsqueeze(0),
self.lora_B.unsqueeze(-1),
groups=sum(self.enable_lora)
).squeeze(0)
return T(self.zero_pad(delta_w))
def train(self, mode: bool = True):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
nn.Linear.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0 and any(self.enable_lora):
self.weight.data -= self.merge_AB() * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0 and any(self.enable_lora):
self.weight.data += self.merge_AB() * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
if self.merged:
return F.linear(x, T(self.weight), bias=self.bias)
else:
result = F.linear(x, T(self.weight), bias=self.bias)
if self.r > 0:
result += self.lora_dropout(x) @ T(self.merge_AB().T) * self.scaling
return result
class ConvLoRA(nn.Module, LoRALayer):
def __init__(self, conv_module, in_channels, out_channels, kernel_size, r=0, lora_alpha=1, lora_dropout=0., merge_weights=True, **kwargs):
super(ConvLoRA, self).__init__()
self.conv = conv_module(in_channels, out_channels, kernel_size, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, merge_weights=merge_weights)
assert isinstance(kernel_size, int)
# Actual trainable parameters
if r > 0:
self.lora_A = nn.Parameter(
self.conv.weight.new_zeros((r * kernel_size, in_channels * kernel_size))
)
self.lora_B = nn.Parameter(
self.conv.weight.new_zeros((out_channels//self.conv.groups*kernel_size, r*kernel_size))
)
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.conv.weight.requires_grad = False
self.reset_parameters()
self.merged = False
def reset_parameters(self):
self.conv.reset_parameters()
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def train(self, mode=True):
super(ConvLoRA, self).train(mode)
if mode:
if self.merge_weights and self.merged:
if self.r > 0:
# Make sure that the weights are not merged
self.conv.weight.data -= (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
if self.r > 0:
# Merge the weights and mark it
self.conv.weight.data += (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling
self.merged = True
def forward(self, x):
if self.r > 0 and not self.merged:
return self.conv._conv_forward(
x,
self.conv.weight + (self.lora_B @ self.lora_A).view(self.conv.weight.shape) * self.scaling,
self.conv.bias
)
return self.conv(x)
class Conv2d(ConvLoRA):
def __init__(self, *args, **kwargs):
super(Conv2d, self).__init__(nn.Conv2d, *args, **kwargs)
class Conv1d(ConvLoRA):
def __init__(self, *args, **kwargs):
super(Conv1d, self).__init__(nn.Conv1d, *args, **kwargs)
# Can Extend to other ones like this
class Conv3d(ConvLoRA):
def __init__(self, *args, **kwargs):
super(Conv3d, self).__init__(nn.Conv3d, *args, **kwargs)
####################################################################################
def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None:
for n, p in model.named_parameters():
if 'lora_' not in n:
p.requires_grad = False
if bias == 'none':
return
elif bias == 'all':
for n, p in model.named_parameters():
if 'bias' in n:
p.requires_grad = True
elif bias == 'lora_only':
for m in model.modules():
if isinstance(m, LoRALayer) and \
hasattr(m, 'bias') and \
m.bias is not None:
m.bias.requires_grad = True
else:
raise NotImplementedError
def lora_state_dict(model: nn.Module, bias: str = 'none') -> Dict[str, torch.Tensor]:
my_state_dict = model.state_dict()
if bias == 'none':
return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k}
elif bias == 'all':
return {k: my_state_dict[k] for k in my_state_dict if 'lora_' in k or 'bias' in k}
elif bias == 'lora_only':
to_return = {}
for k in my_state_dict:
if 'lora_' in k:
to_return[k] = my_state_dict[k]
bias_name = k.split('lora_')[0]+'bias'
if bias_name in my_state_dict:
to_return[bias_name] = my_state_dict[bias_name]
return to_return
else:
raise NotImplementedError
7.使用教程(From official)
-
安装
loralib
pip install loralib # Alternatively # pip install git+https://github.com/microsoft/LoRA
-
可以选择通过将某些层替换为
loralib
中实现的对应层来调整它们。我们目前仅支持nn. Linear
、nn.Embedding
和nn.Conv2d
。还支持MergedLinear
,用于单个nn.Linear
代表多个层的情况,例如在注意力qkv投影的某些实现中(有关更多信息,请参阅附加说明)。# ===== Before ===== # layer = nn.Linear(in_features, out_features) # ===== After ====== import loralib as lora # Add a pair of low-rank adaptation matrices with rank r=16 layer = lora.Linear(in_features, out_features, r=16)
-
在训练循环开始之前,仅将LoRA参数标记为可训练。
import loralib as lora model = BigModel() # This sets requires_grad to False for all parameters without the string "lora_" in their names lora.mark_only_lora_as_trainable(model) # Training loop for batch in dataloader: ...
-
保存检查点时,生成仅包含LoRA参数的
state_dict
。# ===== Before ===== # torch.save(model.state_dict(), checkpoint_path) # ===== After ===== torch.save(lora.lora_state_dict(model), checkpoint_path)
-
使用
load_state_dict
加载检查点时,请务必设置strict=False
。# Load the pretrained checkpoint first model.load_state_dict(torch.load('ckpt_pretrained.pt'), trict=False) # Then load the LoRA checkpoint model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)
附加说明
-
虽然该研究专注于一个简单而有效的设置,即只调整
Transformer
中的q
和v
的projection
,但在我们的示例中, L o R A LoRA LoRA可以应用于预训练权重的任何子集。我们鼓励您探索不同的配置,例如通过将nn. Embedding
替换为lora.Embedding
或调整MLP
层来调整Embedding
层。对于不同的模型架构和任务,最佳配置很可能会有所不同。 -
一些
Transformer
模型利用nn.Linear
实现query
、key
、和value
的投影矩阵。如果希望限制单个矩阵更新的秩,则必须将其分解为三个单独的矩阵或使用lora.MergedLine
。如果选择分解层,请确保相应地修改checkpoint。# ===== Before ===== # qkv_proj = nn.Linear(d_model, 3*d_model) # ===== After ===== # Break it up (remember to modify the pretrained checkpoint accordingly) q_proj = lora.Linear(d_model, d_model, r=8) k_proj = nn.Linear(d_model, d_model) v_proj = lora.Linear(d_model, d_model, r=8) # Alternatively, use lora.MergedLinear (recommended) qkv_proj = lora.MergedLinear(d_model, 3*d_model, r=8, enable_lora=[True, False, True])
-
与 L o R A LoRA LoRA一起训练bias向量可能是挤出额外任务性能的一种经济高效的方法(如果你仔细调整学习率)。虽然论文中没有彻底研究它的效果,但在 l o r a lora lora中很容易尝试。调用
mark_only_lora_as_trainable
时,您可以通过将**“all”或“lora_only”**传递给bias=
来标记一些偏差为可训练的。保存检查点时,请记住将相应的偏见=参数传递给lora_state_dict
。# ===== Before ===== # lora.mark_only_lora_as_trainable(model) # Not training any bias vectors # ===== After ===== # Training all bias vectors associated with modules we apply LoRA to lora.mark_only_lora_as_trainable(model, bias='lora_only') # Alternatively, we can train *all* bias vectors in the model, including LayerNorm biases lora.mark_only_lora_as_trainable(model, bias='all') # When saving a checkpoint, use the same bias= ('all' or 'lora_only') torch.save(lora.lora_state_dict(model, bias='all'), checkpoint_path)
-
调用
model.eval()
将触发 L o R A LoRA LoRA参数与相应的预训练参数的合并,从而消除后续前向传递的额外延迟。再次调用model.train()
将撤消合并。这可以通过将merge_weights=False
传递给 L o R A LoRA LoRA层来禁用。