【论文精读笔记】O-LoRA: Orthogonal Subspace Learning for Language Model Continual Learning

ZedKingCarry

已于 2023-11-30 16:37:09 修改

阅读量1.7k

点赞数 4

分类专栏：阅读笔记持续学习文章标签：笔记语言模型人工智能迁移学习深度学习

于 2023-11-24 10:12:47 首次发布

本文链接：https://blog.csdn.net/weixin_45225032/article/details/134545164

版权

阅读笔记同时被 2 个专栏收录

21 篇文章

订阅专栏

持续学习

15 篇文章

订阅专栏

论文信息

论文标题

Orthogonal Subspace Learning for Language Model Continual Learning

发表刊物

EMNLP2023

作者团队

复旦大学

关键词

Continual Learing、LLMs、Orthogonal Subspace

文章结构

引言

研究动机

The performace of LLMs degrades in scenarios where multiple tasks are encoutered sequentially (CF).

任务背景

第一段介绍CL、LLMs和CF
第二段介绍三种经典分类方法：rehearsal、regularization、architecture，并指出三者在unseen tasks上表现不佳，也是本文方法优势所在（基于大模型）。
第三段介绍传统的正交梯度方法解决CF，但因为要保存历史数据或历史数据梯度，在大模型场景下不太实际。

创新方法

O-LoRA is a simple and efficient approach with orthogonal low-rank adaptation for continual learning in language models to alleviate catastrophic forgetting.
理论基础：LoRA参数可以刻画梯度子空间，这样一个LoRA块与一个任务对应的梯度子空间对应，通过将LoRA块正交则相当于对梯度子空间正交优化，从而让不同的任务之间互相不影响。
- 图中蓝色斜线表示原本的梯度更新方向，橘黄色表示正交更新方向，蓝色虚线及其所在的浅蓝色平面为历史任务的正交子空间。通过将原本方向分解到正交方向使其与历史任务子空间正交，达到新任务学习时不影响过去任务的目标，从而解决CF。

技术背景

Continual Learning Setup

In this study, we tackle a more challenging setting. During the training phase, the model is prohibited from accessing any historical data. In the testing phase, the model predicts a sample’s label without knowing which task it belongs to.
这里的setup在progressive prompt的基础上加以限制。

LoRA

不展开。

创新方法

在这里插入图片描述

Instruction Schema

与传统训练不同的是，由于大模型训练微调方式的特点，模型不需要再编码输入文本和将label映射成数字后再训练分类模型，而是直接以instruction tuning用文本输入即可训练。因此，本工作也是采用此方式，告诉模型任务定义、选项、输入文本，让模型输出答案，来测试准确率。此方式利用大模型强大的few-shot ICL能力，能够在unseen task上表现更好。
疑问：T5也可以基于text进行instruction tuning吗？

CL in Orthogonal Subspaces

传统正交梯度下降更新方法及其特点

Orthogonal Gradient Descent (OGD) constrains the parameters to move within the orthogonal space to the gradients of previous tasks.
With limited access to previous task data, OGD approximates the current gradient
of previous data with the gradient in the previous convergence parameters.
However, OGD needs to store gradients of all previous data. This can be especially intractable for large-scale language models with billions of parameters, which have become a standard in the NLP field.

本文提出的改进方法

思路来源：large pre-trained models primarily fine-tune within a specific low-rank subspace. This characteristic behavior suggests that the LoRA parameters are not mere numerical adjustments but encapsulate crucial model update directions. 即LoRA块不仅仅是数值上的修正信息，同时也包含着模型更新的方向信息。读到这里，值得思考的是，LoRA块是否真的包含了模型更新的方向信息，是如何体现的，能够如何应用。多个LoRA块除了正交关系（为了保持互不影响），还有没有其它相互关系能够达到新的目的？
For each task, we introduce a set of LoRA parameters denoted as $\left\{A_t,B_t\right\}$ , where $\in R^{d \times r}$ , $\in R^{r \times k}$ , and the rank $\ll \min (d, k) .$ We approximate the parameter update subspace $\mathcal{U}_t$ for the parameter update by the column vectors of $A_t$ :
$A_t=\left[a_t^1, a_t^2, \ldots, a_t^r\right] \\ \mathcal{U}_t=\operatorname{span}\left\{a_t^1, a_t^2, \ldots, a_t^r\right\}$
Let $B_t=\left[b_t^1, b_t^2, \ldots, b_t^r\right]$ , where $b_t^i \in B_t$ represents the linear weighting coefficients of the column vectors in $A_t$ .
To ensure the orthogonality between the subspace $\mathcal{U}$ and the subspace $\mathcal{W}$ , we need to satisfy: $\left \langle u,w \right \rangle =0,\forall u \in \mathcal{U}, w \in \mathcal{W}$ . Therefore, achieving orthogonality between the LoRA subspaces of task $i(\mathcal{U} ^i)$ and task $t(\mathcal{U}^t)$ can be expressed as: $O_{i,t}=A^T_{i}A_t=0.$
Finally, our training objective is defined as:
$\sum_{x, y \in \mathcal{D}_t} \log p_{\Theta}(y \mid x)+\lambda_1 \sum_{i=1}^{t-1} L_{\text {orth }}\left(A_i, A_t\right) \\ L_{\text {orth }}(A_i,A_t)=\sum_{j, k}\left \| O_{i,t}[j,k] \right \|^2$
where $O_{i,t}[j,k]$ denotes the element at the $\textit{j}$ -th row and $\textit{k}$ -th column of $O_{i,t}$ , and $\lambda_1$ is the weights of the orthogonality loss.
During the training process, to mitigate forgetting of past knowledge, we fix the previous LoRA parameters $\left\{A_i,B_i|i<t\right\}.$ Following Hu et al. (2021), we only apply LoRA to the attention weights of queries ( $W_q$ ) and values ( $W_v$ ). 这里需要看一下引用的原文，为什么只在Q和V上用LoRA，而K上没有；
While the number of LoRA parameters grows with the number of tasks during training, we can merge the updates corresponding to the LoRA parameters into the initial parameters to avoid GPU memory inflation: $W_{\text {init}}:=W_{\text {init }}+\sum^t_{i=1}A_iB_i$

Comparisons Between O-LoRA and Other Methods

Data privacy-friendliness (without storing historical data)
Model parameter-friendliness (from LoRA)
Generalization-friendliness (from LLMs + instruction tuning)

实验部分

实验设置

数据集

Standard CL Benchmark （同Progressive Prompt）
Large number of tasks（同Progressive Prompt）
Unseen tasks Generation
- To assess the impact of our approach on LLMs’ generalization ability, we initially train an LLM on the Alpaca dataset, an open-source multitask instruction tuning dataset. We then use the pretrained LLM for sequential training on the standard CL benchmark.
- Our zero-shot benchmark, MMLU, covers 57 subjects across various domains such as STEM, humanities, and social sciences, assessing world knowledge and problem-solving abilities across various difficulty levels.

指标

Average Accuracy （AA）

Baselines

没和prompt-based methods对比；用LoRA系列；

SeqFT：train all model parameters on a sequence of tasks
SeqLoRA: fixed-size LoRA parameters are trained on a sequence of tasks (一个LoRA在多个序列任务)
IncLoRA：incremental learning of new LoRA parameters on a sequential series of tasks （多个LoRA在多个序列任务，但不是正交）
Replay、EWC、LwF、L2P、LFPT5、ProgressivePrompt、PerTaskFT、MTL

实验细节

T5-large (encoder-decoder)
LLaMA-7B (decoder-only) for unseen tasks
average of 3 runs

实验结果

主实验

在这里插入图片描述

第一类：SeqFT和SeqLoRA，在固定参数设置下一直训，效果是比较差的，垫底；IncLoRA是多个LoRA效果明显增长，进入第一梯度；
第二类：Replay、EWC、LwF、L2P传统方法效果相当，进入第二梯度；
第三类：LFPT5、O-LoRA、ProgPrompt、PerTaskFT、MTL效果最好，属于第一梯度；其中MTL多任务学习是upper bound，O-LoRA效果和LFPT5相当，略高一点，在standard cl benchmark上比progprompt略高一点，但在large number of tasks 上比progPrompt差得多（作者这里指出是因为progPrompt需要taskid，而且其他方法也差得远，说明在large number这个任务上现有模型确实差的多）。
个人分析，在O-LoRA里没有提到forward transfer能力，实验结果也证明了后来的模型在LoRA训练不会受到之前LoRA块的帮助。
Alpaca-LoRA-LLaMA在MMLU和CL上的表现。只看MMLU（unseen tasks), 直接用LLaMA-7B在MMLU上跑，效果是34.4；用LoRA在Alpaca数据集上微调后效果提高了，为37.5；再用微调后的Alpaca，用固定LoRA块在传统CL benchmark上序列训练，直接掉成了23.3；把固定LoRA块改成序列LoRA块效果提升到28.6；再改进成正交序列LoRA，效果提升到33.6；另一方面，用固定lora训练效果为46.7，用序列lora反而变成了33.1，这点也是不理解为什么效果会变差；最后用OLoRA直接到76.8了，属于是直接翻倍了。
这里关于LoRA-CL和inc-LoRA-CL的效果分析：在固定lora上连续微调，和每次用新的lora块微调，有表1可以看出在任务数量特别多的large number时，连续微调会失效，但这里应该是数量没那么多，使得效果没有特别差，比多个lora微调效果还好一些。

副实验1：正交对loss的影响

在这里插入图片描述
这里用了一张图来体现O-LoRA中loss的变化。
$\sum_{x, y \in \mathcal{D}_t} \log p_{\Theta}(y \mid x)+\lambda_1 \sum_{i=1}^{t-1} L_{\text {orth }}\left(A_i, A_t\right) \\ L_{\text {orth }}(A_i,A_t)=\sum_{j, k}\left \| O_{i,t}[j,k] \right \|^2$
上式中，当 $\lambda$ 为0时，就是传统的分类loss，为图中橘黄色部分所示。横坐标为loss，纵坐标是batch数。虽然能理解想表达加了以后（蓝色）比不加的（橘黄色）prediction loss要低，但是这个图画的感觉很奇怪。一是纵坐标batch数作为变量想体现什么，是训练的次数吗，越大表示训练次数越多？我觉得用loss降低的图来表示加了以后能让loss降低，但奇怪的一点是这了加了一项，为什么反而loss会降低，这都是需要进一步解释的。这里有点不直观，让人读不懂。

副实验2：正交对encoder-decoder各自编码内容的影响

在这里插入图片描述
深蓝色是没用O-LoRA，亮蓝色用了O-LoRA。用了O-Lora以后，在encoder里，前面几层都就很低，表示基本不变化，作者认为这是encoder学习到了任务之间的共同特征，而在高层变化很大，作者认为这是学习到了不同任务之间的区别，导致一直随着任务变化而变化。在decoder里，作者认为decoder能得到 relevant information from these rich semantic representations, proving the minimal impact of our method on past tasks.

副实验3：LoRA中 $r$ 对实验结果的影响

在这里插入图片描述

T5-base on standard CL benchmark.
Increasing the rank r improves the average accuracy of the model to a certain extent.
There is not a significant difference in performance between r=2 and r=16, indicating that the gradient space of the model has a relatively low intrinsic dimensionality.
有提升但不明显；

副实验4：模型参数大小对实验结果的影响

在这里插入图片描述

T5随着模型变大CL效果变好，但是MTL变化不大；
这里最好把模型参数量是多少标出来，只能看到llama是7B；

A-GEM, MBPA++, OGD, EWC, LwF, L2P, IDBR, LFPT5, EIP, PP
Incremental intent detection for medical domain with contrast replay networks.
Continual learning with tiny episodic memories.
Continual relation learning via episodic memory activation and reconsolidation.
Towards a unified view of parameter-efficient transfer learning.
Continual learning for text classification with information disentanglement based regularization.
Continual learning of natural language processing tasks: A survey.
Investigating forgetting in pre-trained representations through continual learning.
Catastrophic interference in connectionist networks: The sequential learning problem.
Gradient projection memory for continual learning.
Continual diffusion: Continual customization of text-to-image diffusion with c-lora. (C-LoRA)
A comprehensive survey of continual learning: Theory, method and application.
Trace: A comprehensive benchmark for continual learning in large language models.
Rehearsal-free continual language learning via efficient parameter isolation.
Learning to prompt for continual learning.
Pretrained language model in continual learning: A comparative study.
The rise and potential of large language model based agents: A survey.

PEFT

LoRA
Parameter-efficient transfer learning for nlp.
The power of scale for parameter-efficient prompt tuning.
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.

Models

GPT2, BERT, GPT-4, T5, alpaca, LLaMA, SuperGlue, Glue, Super-NaturalInstructions,
Measuring massive multitask language understanding.
Learning word vectors for sentiment analysis.
Training language models to follow instructions with human feedback.
Miner: Improving out-of-vocabulary named entity recognition from an information theoretic perspective.
Farewell to aimless large-scale pretraining: Influential subset selection for language model.
Instructuie:Multi-task instruction tuning for unified information extraction.
Character-level convolutional networks for text classification.