LoRA(Low-Rank Adaptation)的来由:
全参微调没有足够的机器,Adapter Tuning存在训练推理延迟,Prefix Tuning会减少原始训练数据中的有效文字的长度,是否有一种微调方法能改善不足呢?
LoRA低秩适配器就这样诞生了。
LoRA的思想:
- 全量微调=现有的+学习到的(冗余的知识)
- LoRA=现有的+学习到的(不冗余的知识)
为什么LoRA可以将模型训练参数显著减少?
首先transfomer中Q,K,V本质是个线性层,
Q,K,V的维度都是d_model*dk,d_model在transformer中是512代表的是词嵌入的维度,dk假设也是512,那么Q*K的计算量是=512*512*512*512
LoRA的思想:将两个QK(512,512)的矩阵转成(512,2),(512,2),那么计算量是:512*2*2*512
其中2代表的就是秩r,一般远小于dk,所以(512*512)/(2*2)计算量降低了60000倍。
LoRA的实现也很简单,它只对transfomer中的线性层进行修改,参考的是deepspeed-chat里面的源码。
def convert_linear_layer_to_lora(model,
part_module_name,
lora_dim=0,
lora_scaling=1,
lora_droppout=0
)
replace_name = []
for name ,module in model.named_modules():
#找到所有的线性层
if isinstance(module,nn.Linear) and part_module_name in name:
replace_name.append(name)
#将所有的线性层,转换成LoRA层
for name in replace_name:
module = recursive_getattr(model,name)
#LinearLayer_LoRA就是将普通的线性层转换成LoRA
tmp = LinearLayer_LoRA(model.weight ,lora_dim,lora_scaling,lora_droppout,
module.bias
).to(module.weight.device,module.weight.type)
recursive_setattr(model,name,tmp)
return model
LinearLayer_LoRA的实现:
class LinearLayer_LoRA(nn.Module):
def __init__(self, weight,lora_dim=0,lora_scaling=1,lora_droppout=0,bias=None):
super(LinearLayer_LoRA, self).__init__()
self.weight = weight
self.bias = bias
if lora_dim <= 0:
raise ValueError(
"You are training to use LoRA, whose reduced dim should be larger than 1"
)
try:
#zero第三阶段权重参数放到了不同的卡上,所以使用的是weight.ds_shape
#关于zero的三阶段可以去看我的之前博客的讲解
# for zero stage 3
rows, columns = weight.ds_shape
except:
rows, columns = weight.shape
#lora右侧权重,lora_dim就是我们要设置的秩
self.lora_right_weight = nn.Parameter(torch.zeros(columns,lora_dim))
# apply transpose so in forward we do not need to
#lora左侧权重
self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
self.lora_scaling = lora_scaling / lora_dim
#dorpout
if lora_droppout > 0:
self.lora_dropout = nn.Dropout(lora_droppout)
else:
self.lora_dropout = nn.Identity()
self.reset_parameters()
#def reset_parameters(self):
#nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))
#nn.init.zeros_(self.lora_left_weight)
# disable the original weight gradient
self.weight.requires_grad = False
# fuse LoRA to the original weight将lora放到原始的权重去,no
self.fuse_lora = False
#@是矩阵的乘法
def forward(self, input):
if self.fuse_lora: #False
return F.linear(input, self.weight, self.bias)
else:
return F.linear(
input, self.weight,
self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
@ self.lora_left_weight) * self.lora_scaling
可以看到这一行代码
return F.linear(
input, self.weight,
self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
@ self.lora_left_weight) * self.lora_scaling
就是原始的权重+lora的权重。
为什么神经网络的不同层(神经元)之间可以相加?
因为模型的本质可以看成是一个有向图,节点的权重代表每个节点对于下个节点的影响。