LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS（个人笔记）

RSociopath

于 2024-08-28 18:59:54 发布

阅读量949

点赞数 19

文章标签：语言模型人工智能自然语言处理

本文链接：https://blog.csdn.net/RSociopath/article/details/141647619

版权

“As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible.” 随着我们预训练更大的模型，重新训练所有模型参数的完全微调变得不太可行。

“1 INTRODUCTION”

“This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed.” 这样，除了为每个任务预先训练好的模型外，我们只需要存储和加载少量的任务特定参数，大大提高了部署时的运行效率。

“We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.” 我们假设模型适应过程中权重的变化也具有较低的"内在秩"，从而提出了低秩适应( Low-Rank Adaptive，LoRA )方法。

“LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1.” LoRA允许我们通过优化适应过程中稠密层变化的秩分解矩阵来间接地训练神经网络中的一些稠密层，同时保持预训练的权重不变，如图1所示。

“LoRA possesses several key advantages.” LoRA具有几个关键优势。

“A pre-trained model can be shared and used to build many small LoRA modules for different tasks.” 一个预训练好的模型可以共享，用于为不同的任务构建许多小的LoRA模块。
“LoRA makes training more efficient and lowers the hardware barrier” LoRA提高了训练效率，降低了硬件障碍
“Our simple linear design allows us to merge the trainable matrices with the frozen weights when deployed” 我们简单的线性设计允许我们在部署时将可训练矩阵与冻结的权重合并
“LoRA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning.” LoRA与许多先验方法正交，可以与其中的许多方法结合，如前缀调整。

“2 PROBLEM STATEMENT”

文章以语言模型为例

“Each downstream task is represented by a training dataset of context-target pairs” 每个下游任务由一个上下文-目标对的训练数据集表示

“During full fine-tuning, the model is initialized to pre-trained weights Φ0 and updated to Φ0 + ∆Φ by repeatedly following the gradient to maximize the conditional language modeling objective:” 在充分微调过程中，模型初始化为预训练权重Φ 0，并通过反复跟随梯度更新为Φ 0 +∆Φ，以最大化条件语言建模目标：

“In this paper, we adopt a more parameter-efficient approach, where the task-specific parameter increment ∆Φ = ∆Φ(Θ) is further encoded by a much smaller-sized set of parameters Θ with |Θ|<< |Φ0|. The task of finding ∆Φ thus becomes optimizing over Θ:” 在本文中，我们采用了一种更具参数效率的方法，其中任务特定的参数增量∆Φ =∆Φ ( Θ )进一步由一组更小的参数Θ编码，参数Θ的值为| Θ |远小于 | Φ0 |。因此，寻找∆Φ的任务就变成了在Θ上优化：

“When the pre-trained model is GPT-3 175B, the number of trainable parameters |Θ| can be as small as 0.01% of |Φ0|.” 当预训练模型为GPT - 3 175B时，可训练参数| Θ |的个数可小至| Φ0 |的0.01 %。

“3 AREN’T EXISTING SOLUTIONS GOOD ENOUGH?”

“Using language modeling as an example, there are two prominent strategies when it comes to efficient adaptations” 以语言建模为例，当涉及到有效的适应时，有两个突出的策略

“Adapter Layers Introduce Inference Latency”

“large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially.” 大型神经网络依靠硬件并行性来保持低延迟，而适配器层需要依次处理。

“Directly Optimizing the Prompt is Hard”

“ prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters”前缀调谐难以优化，并且其性能在可训练参数中非单调变化

“reserving a part of the sequence length for adaptation necessarily reduces the sequence length available to process a downstream task” 保留一部分序列长度进行适应，必然会减少可用于处理下游任务的序列长度

“4 OUR METHOD”

“4.1 LOW-RANK-PARAMETRIZED UPDATE MATRICES”

“we hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation” 我们假设权重的更新在适应过程中也具有较低的"内在秩"

“we constrain its update by representing the latter with a low-rank decomposition” 我们通过用低秩分解表示后者来约束其更新

“A Generalization of Full Fine-tuning”

“LoRA takes a step further and does not require the accumulated gradient update to weight matrices to have full-rank during adaptation.” LoRA更进一步，在自适应过程中不需要对权重矩阵进行累加梯度更新就具有满秩性。

“No Additional Inference Latency.”

“When we need to switch to another downstream task, we can recover W0 by subtracting BA and then adding a different B′A′, a quick operation with very little memory overhead.” 当我们需要切换到另一个下游任务时，我们可以通过减去BA，然后添加一个不同的B′A′来恢复W0，这种快速的操作只需要很少的内存开销。

“4.2 APPLYING LORA TO TRANSFORMER”

“We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.” 我们限制我们的研究只适应下游任务的注意力权重，并冻结MLP模块(所以他们没有在下游任务中进行训练)，以简化和参数效率。

“Practical Benefits and Limitations.”

“The most significant benefit comes from the reduction in memory and storage usage.” 最显著的好处来自于内存和存储使用量的减少。

“we can switch between tasks while deployed at a much lower cost by only swapping the LoRA weights as opposed to all the parameters.” 通过只交换LoRA权重而不是所有参数，我们可以以更低的成本在部署任务之间切换。

“it is not straightforward to batch inputs to different tasks with different A and B in a single forward pass, if one chooses to absorb A and B into W to eliminate additional inference latency.” 如果选择将A和B吸收到W中，以消除额外的推理延迟，那么在一次前向传递中批量输入具有不同A和B的不同任务是不容易的。

“5 EMPIRICAL EXPERIMENTS”

“6 RELATED WORKS”

“Transformer Language Models.”

“Transformer-based language models have dominated NLP, achieving the state-of-the-art in many tasks.” 基于Transformer的语言模型在NLP中占据了主导地位，在许多任务中取得了最先进的水平。
“Prompt Engineering and Fine-Tuning.”
“Parameter-Efficient Adaptation.”
“Low-Rank Structures in Deep Learning.”

“7 UNDERSTANDING THE LOW-RANK UPDATES”

“Note that the low-rank structure not only lowers the hardware barrier to entry which allows us to run multiple experiments in parallel, but also gives better interpretability of how the update weights are correlated with the pre-trained weights” 值得注意的是，低秩结构不仅降低了硬件准入门槛，允许我们并行地运行多个实验，而且对更新权重如何与预训练权重相关给出了更好的可解释性

“7.1 WHICH WEIGHT MATRICES IN TRANSFORMER SHOULD WE APPLY LORA TO?”

“We set a parameter budget of 18M (roughly 35MB if stored in FP16) on GPT-3 175B, which corresponds to r = 8 if we adapt one type of attention weights or r = 4 if we adapt two types, for all 96 layers.” 我们在GPT - 3 175B上设置了一个18M (如果存储在FP16中,大约35MB)的参数预算，对应于所有96层，如果我们适应一种类型的注意力权重，则r = 8，如果我们适应两种类型，则r = 4。

“7.2 WHAT IS THE OPTIMAL RANK r FOR LORA?”

“We argue that increasing r does not cover a more meaningful subspace, which suggests that a low-rank adaptation matrix is sufficient.” 我们认为增大r并不能覆盖一个更有意义的子空间，这表明一个低秩的适应矩阵是足够的。

“8 CONCLUSION AND FUTURE WORK”

“We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality.” 我们提出了LoRA，一种高效的自适应策略，既不引入推理延迟，也不减少输入序列长度，同时保持较高的模型质量。

“Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters.” 重要的是，当部署为服务时，它可以通过共享绝大多数的模型参数来实现快速的任务切换。

“There are many directions for future works.”

“LoRA can be combined with other efficient adaptation methods, potentially providing orthogonal improvement.” LoRA可以与其他高效的适应方法相结合，提供潜在的正交改进。
“The mechanism behind fine-tuning or LoRA is far from clear” 微调或LoRA背后的机制远未明确
“We mostly depend on heuristics to select the weight matrices to apply LoRA to.” 我们主要依靠启发式来选择应用LoRA的权重矩阵。