什么是LoRA？白话加上源码讲解

小鸡不简单

已于 2023-09-27 10:56:16 修改

阅读量282

点赞数

文章标签： python 开发语言语言模型 nlp 人工智能

于 2023-09-26 14:29:21 首次发布

本文链接：https://blog.csdn.net/qq_50097745/article/details/133301726

版权

LoRA（Low-Rank Adaptation）的来由：

全参微调没有足够的机器，Adapter Tuning存在训练推理延迟，Prefix Tuning会减少原始训练数据中的有效文字的长度，是否有一种微调方法能改善不足呢？

LoRA低秩适配器就这样诞生了。

LoRA的思想：

全量微调=现有的+学习到的（冗余的知识）
LoRA=现有的+学习到的（不冗余的知识）

为什么LoRA可以将模型训练参数显著减少？

首先transfomer中Q,K,V本质是个线性层，

Q,K,V的维度都是d_model*dk，d_model在transformer中是512代表的是词嵌入的维度，dk假设也是512，那么Q*K的计算量是=512*512*512*512

LoRA的思想：将两个QK（512，512）的矩阵转成（512，2），（512，2），那么计算量是：512*2*2*512

其中2代表的就是秩r，一般远小于dk，所以(512*512)/(2*2)计算量降低了60000倍。

LoRA的实现也很简单，它只对transfomer中的线性层进行修改，参考的是deepspeed-chat里面的源码。

def convert_linear_layer_to_lora(model,
                                 part_module_name,
                                 lora_dim=0,
                                 lora_scaling=1,
                                 lora_droppout=0
)
replace_name = []
for name ,module in model.named_modules():
    #找到所有的线性层
    if isinstance(module,nn.Linear) and part_module_name in name:
        replace_name.append(name)

#将所有的线性层，转换成LoRA层
for name in replace_name:
    module = recursive_getattr(model,name)

#LinearLayer_LoRA就是将普通的线性层转换成LoRA
    tmp = LinearLayer_LoRA(model.weight ,lora_dim,lora_scaling,lora_droppout,
                            module.bias
                            ).to(module.weight.device,module.weight.type)

    recursive_setattr(model,name,tmp)
 return model

LinearLayer_LoRA的实现：

class LinearLayer_LoRA(nn.Module)：
	    def __init__(self, weight,lora_dim=0,lora_scaling=1,lora_droppout=0,bias=None):
        	super(LinearLayer_LoRA, self).__init__()
            self.weight = weight
            self.bias = bias

            if lora_dim <= 0:
                raise ValueError(
                    "You are training to use LoRA, whose reduced dim should be larger than 1"
                )

            try:
                #zero第三阶段权重参数放到了不同的卡上，所以使用的是weight.ds_shape
                #关于zero的三阶段可以去看我的之前博客的讲解
                # for zero stage 3
                rows, columns = weight.ds_shape
            except:
                rows, columns = weight.shape
                
            #lora右侧权重,lora_dim就是我们要设置的秩
            self.lora_right_weight = nn.Parameter(torch.zeros(columns,lora_dim))  
            # apply transpose so in forward we do not need to
            #lora左侧权重
            self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
            self.lora_scaling = lora_scaling / lora_dim
			#dorpout
            if lora_droppout > 0:
                self.lora_dropout = nn.Dropout(lora_droppout)
            else:
                self.lora_dropout = nn.Identity()
			
            self.reset_parameters()
			#def reset_parameters(self):
        		#nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))
        		#nn.init.zeros_(self.lora_left_weight)
            # disable the original weight gradient
            self.weight.requires_grad = False
            
            # fuse LoRA to the original weight将lora放到原始的权重去，no
            self.fuse_lora = False
    #@是矩阵的乘法
    def forward(self, input):
        if self.fuse_lora: #False
            return F.linear(input, self.weight, self.bias)
        else:
            return F.linear(
                input, self.weight,
                self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
                              @ self.lora_left_weight) * self.lora_scaling

可以看到这一行代码

return F.linear(
input, self.weight,
self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
@ self.lora_left_weight) * self.lora_scaling

就是原始的权重+lora的权重。

为什么神经网络的不同层（神经元）之间可以相加？

因为模型的本质可以看成是一个有向图，节点的权重代表每个节点对于下个节点的影响。

小鸡不简单

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
什么是LoRA？白话加上源码讲解

LoRA（的来由：全参微调没有足够的机器，Adapter Tuning存在训练推理延迟，Prefix Tuning会减少原始训练数据中的有效文字的长度，是否有一种微调方法能改善不足呢？LoRA低秩适配器就这样诞生了。
复制链接

扫一扫