都说Lora是在原model上增加旁路,并且在训练时冻结原model、只训练旁路的一种高效参数微调方案。那么具体Lora是怎么将两个矩阵A、B引入模型、在哪些层增加Lora旁路、训练得到Lora参数后又是怎么合并参数的呢?让我们从代码入手,一步步探个究竟
我直接用的是刘聪大佬的ChatGLM微调代码。由于担心他后面对代码再有改动,届时又看不懂了,故我将他的代码fork了一份,并做了些许注释,仓库地址:https://github.com/illusions-LYY/ChatGLM-Finetuning/tree/master
微调代码启动
首先,将代码仓库clone一份到本地。
其次,执行以下命令,开启Lora微调:
bash scripts/train_chatglm.sh
- 注意,阅读该脚本即可得知,我使用的是ChatGLM2-6B进行的微调,没用最新的ChatGLM3
-
peft.__version__ = "0.8.2" !!!!!
和Lora模型准备相关的部分
1. 加载主模型 & LoraConfig:
顺序执行train.py
文件至line-101(代码如下),因为我们选择的是lora微调模式,所以会先从这个if进入,进行model(ChatGLM2-6B)的载入、LoraConfig加载等。
if args.train_type == "lora":
model = MODE[args.mode]["model"].from_pretrained(args.model_name_or_path)
lora_module_name = args.lora_module_name.split(",")
config = LoraConfig(r=args.lora_dim,
lora_alpha=args.lora_alpha,
target_modules=lora_module_name,
lora_dropout=args.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
inference_mode=False,
)
model = get_peft_model(model, config) # PEFT model准备(Lora)
LoraConfig中的几个参数含义
参数名 | 含义 |
---|---|
r | Lora矩阵的秩r,矩阵A和矩阵B相连接的宽度,r<<d |
lora_alpha | 归一化超参数。lora参数会除以 |
target_modules | 进行Lora训练的目标modules列表。这个已在train_chatglm.sh脚本中定义 |
lora_dropout | Lora中的Dropout超参 |
然后,在这段代码最后一行(line-112)model = get_peft_model(model, config)
,完成了从原始模型ChatGLM -> Lora后的ChatGLM模型的转变。这个get_peft_model
函数来自Huggingface官方发布的peft包。我们现在就从这儿开始。首先,我们先复习下ChatGLM的模型结构,以便届时和Lora后的Lora-ChatGLM作对比:
ChatGLM model structure
ChatGLMForConditionalGeneration(
(transformer): ChatGLMModel(
(embedding): Embedding(
(word_embeddings): Embedding(65024, 4096)
)
(rotary_pos_emb): RotaryEmbedding()
(encoder): GLMTransformer(
(layers): ModuleList(
(0-27): 28 x GLMBlock(
(input_layernorm): RMSNorm()
(self_attention): SelfAttention(
(query_key_value): Linear(in_features=4096, out_features=4608, bias=True)
(core_attention): CoreAttention(
(attention_dropout): Dropout(p=0.0, inplace=False)
)
(dense): Linear(in_features=4096, out_features=4096, bias=False)
)
(post_attention_layernorm): RMSNorm()
(mlp): MLP(
(dense_h_to_4h): Linear(in_features=4096, out_features=27392, bias=False)
(dense_4h_to_h): Linear(in_features=13696, out_features=4096, bias=False)
)
)
)
(final_layernorm): RMSNorm()
)
(output_layer): Linear(in_features=4096, out_features=65024, bias=False)
)
)
2. 进入get_peft_model()
后:
def get_peft_model(
model: PreTrainedModel, peft_config: PeftConfig, adapter_name: str = "default", mixed: bool = False
) -> PeftModel | PeftMixedModel:
...
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
如上述代码所示,进入get_peft_model()
后,该函数会根据预先指定的peft_config
.task_type
选择加载的模型(peft/mappping.py(line-137))。由于我们已知task_type="CAUSAL_LM",所以我们继续跳转其对应的PeftModelForCausalLM
。下面会有几步跳转,均发生在peft包内:
peft/mappping.py(137) -> # 选择CausalLM(可见ChatGLM也是用的是CausalLM,而非PrefixLM!
peft/peft_model.py(1051) -> # 先进行其父类PeftModel的初始化
peft/peft_model.py(127) -> # 根据`PEFT_TYPE_TO_MODEL_MAPPING`的定义,跳转`LoraModel`并进行LoraModel初始化
peft/tuners/lora/model.py(109) -> # 在LoraModel内先进行其父类BaseTuner的初始化
peft/tuners/tuners_utils.py(148) -> # 插入Lora旁路modules
peft/tuners/tuners_utils.py(280) -> # 遍历ChatGLM所有layer,并进行Lora改造
如上,经过上述多次跳转,我们终于找到了实际插入Lora旁路module的位置
3. 遍历ChatGLM所有layer,并进行Lora改造
for key in key_list:
# Check for modules_to_save in case
if _check_for_modules_to_save and any(
key.endswith(f"{module_to_save}") for module_to_save in peft_config.modules_to_save
):
# Optionally set the modules to save
parent, target, target_name = _get_submodules(model, key)
if not isinstance(target, ModulesToSaveWrapper):
new_module = ModulesToSaveWrapper(target, adapter_name)
setattr(parent, target_name, new_module)
else:
target.update(adapter_name)
_has_modules_to_save = True
continue
if not self._check_target_module_exists(peft_config, key): # 检查当前module在不在Lora微调范围内
continue
self.targeted_module_names.append(key) # 在微调范围内,进行Lora参数准备
is_target_modules_in_base_model = True
parent, target, target_name = _get_submodules(model, key) # 获取指定key的父module和其module本身
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key) # 用adapter layer原位取代将被微调tune的layer。查找的目标layer和`target_name`有关
如上所示,原代码位置为line280~line303
这部分关键逻辑为,判断if not
self._check_target_module_exists(peft_config, key)
,判断遍历到的ChatGLM module是否在进行Lora微调之列。若判断不在,直接continue;若在,则进一步执行后面的逻辑:
上述代码line-23,通过key(其实就是module name)获取到该module的父module、其module本身。获取其本身,则是要在其本身上增加Lora旁路得到new module;获取其父module,是为了将父module中的其本身替换为新的Lora版本的new module;
继续进入函数self._create_and_replace
,即使用带Lora的layer原位取代原始layer的过程。这里的跳转逻辑是:
peft/tuners/tuners_utils.py(303) -> # 创建带Lora的新module
peft/tuners/lora/model.py(176) -> # 创建带Lora旁路的新module
peft/tuners/lora/model.py(247~251) -> # 这里会遍历3个dispatcher,但最终总会选择最后一个`dispatch_default`
peft/tuners/lora/layer.py(679) -> # 进入dispatch_default
我们可以在此简单看一下都有什么定义:
def dispatch_default(
target: torch.nn.Module,
adapter_name: str,
lora_config: LoraConfig,
**kwargs,
) -> Optional[torch.nn.Module]:
new_module = None
if isinstance(target, BaseTunerLayer):
target_base_layer = target.get_base_layer()
else:
target_base_layer = target # 将待替换的layer记录一下。因为Lora旁路和原layer的地位是并存,而不是取代之,所以以前的layer也得保留
if isinstance(target_base_layer, torch.nn.Embedding):
embedding_kwargs = kwargs.copy()
embedding_kwargs.pop("fan_in_fan_out", None)
embedding_kwargs.update(lora_config.loftq_config)
new_module = Embedding(target, adapter_name, **embedding_kwargs)
elif isinstance(target_base_layer, torch.nn.Conv2d):
kwargs.update(lora_config.loftq_config)
new_module = Conv2d(target, adapter_name, **kwargs)
elif isinstance(target_base_layer, torch.nn.Linear):
if kwargs["fan_in_fan_out"]:
warnings.warn(
"fan_in_fan_out is set to True but the target module is `torch.nn.Linear`. "
"Setting fan_in_fan_out to False."
)
kwargs["fan_in_fan_out"] = lora_config.fan_in_fan_out = False
kwargs.update(lora_config.loftq_config)
new_module = Linear(target, adapter_name, **kwargs) # 在Linear/Embedding/Conv等layer中,加入Lora层都有分别新的定义
elif isinstance(target_base_layer, Conv1D):
if not kwargs["fan_in_fan_out"]:
warnings.warn(
"fan_in_fan_out is set to False but the target module is `Conv1D`. " "Setting fan_in_fan_out to True."
)
kwargs["fan_in_fan_out"] = lora_config.fan_in_fan_out = True
kwargs.update(lora_config.loftq_config)
new_module = Linear(target, adapter_name, is_target_conv_1d_layer=True, **kwargs)
return new_module
如上代码可见,peft包中一共定义了4种带Lora旁路的改造后的Layer。它们分别是:Embedding、Conv2d、Linear和Conv1D
由于在scripts/train_chatglm.sh
中,我们已定义好了这些layer参与微调:
--lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense"
而这些layer,我们通过对ChatGLM model structure的对比可知,上述4个都是Linear layer,所以我们当然毫不犹豫地选择Linear这一定义进入(上述代码line-30)
4. 对Linear进行Lora旁路构建细节
上来同样是一波跳转:
peft/tuners/lora/layer.py(207) -> # LoraLayer初始化。LoraLayer是Linear(其实精准的说应该叫LoraLinear)的父类之一
peft/tuners/lora/layer.py(211) -> # update_layer().定义LoraLinear参数,如lora_A, lora_B矩阵
我们在Linear中先看一下Lora定义了A、B两个矩阵后是怎么对hidden state进行操作的:
def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
previous_dtype = x.dtype
if self.disable_adapters:
if self.merged:
self.unmerge()
result = self.base_layer(x, *args, **kwargs)
elif self.merged:
result = self.base_layer(x, *args, **kwargs)
else:
result = self.base_layer(x, *args, **kwargs)
for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
continue
lora_A = self.lora_A[active_adapter]
lora_B = self.lora_B[active_adapter]
dropout = self.lora_dropout[active_adapter]
scaling = self.scaling[active_adapter]
x = x.to(lora_A.weight.dtype)
result += lora_B(lora_A(dropout(x))) * scaling
result = result.to(previous_dtype)
return result
这部分代码对应peft/tuners/lora/layer.py的line297~319。我们直接看else逻辑,这部分逻辑可以这样理解:
-
result = self.base_layer(
x
, *
args
, **
kwargs
)
这个base_layer就是原始模型中的layer,比如query_key_value。第一步就是先进行一次模型正常的前向传播; -
分别将当前layer对应的、已初始化的(具体怎么初始化后面会提及)lora_A、lora_B、dropout取出
-
dropout(x) -> loraA矩阵变换 -> LoraB矩阵变换 -> scaling归一化 -> + 原始模型layer的前向传播结果
-
Return
所以我们可以看出,Lora旁路的前向传播结果,实际上是和原layer的结果直接相加了的。毕竟经A/B两个矩阵线性变换后实际上x的hidden size并未变化
到此,Lora旁路结构已构建完毕并返回,接下来就是替换操作了
5. 返回Lora过的new module,并对原始ChatGLM模型进行layer对应替换
peft/tuners/lora/model.py(176) -> # 经return回到这里,我们已经获取到了new_module
peft/tuners/lora/model.py(180) -> # 将原module替换成原module+Lora旁路的new_module
peft/tuners/lora/model.py(183) -> # setattr(parent, child_name, new_module) 即parent.child_name = new_module, 替换在这里进行
peft/tuners/tuner_utils.py(303) -> # self._create_and_replace()完成,回到这里
peft/tuners/tuner_utils.py(311)-->peft/tuners/lora/model.py(209)-> # 设置全部模型module是否在微调中进行更新。(所有非Lora模块都不更新参数,不做反向传播)
至此,所有目标模块都会被循环地一个个完成Lora改造并返回。那么最后改造完后的模型变成什么样了呢:
Lora-ChatGLM structure
LoraModel(
(model): ChatGLMForConditionalGeneration(
(transformer): ChatGLMModel(
(embedding): Embedding(
(word_embeddings): Embedding(65024, 4096)
)
(rotary_pos_emb): RotaryEmbedding()
(encoder): GLMTransformer(
(layers): ModuleList(
(0-27): 28 x GLMBlock(
(input_layernorm): RMSNorm()
(self_attention): SelfAttention(
(query_key_value): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=4608, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4608, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
(core_attention): CoreAttention(
(attention_dropout): Dropout(p=0.0, inplace=False)
)
(dense): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
)
(post_attention_layernorm): RMSNorm()
(mlp): MLP(
(dense_h_to_4h): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=27392, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=27392, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
(dense_4h_to_h): lora.Linear(
(base_layer): Linear(in_features=13696, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=13696, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
)
)
)
)
(final_layernorm): RMSNorm()
)
(output_layer): Linear(in_features=4096, out_features=65024, bias=False)
)
)
)