适配器微调(Adapter tuning)

        适配器微调(Adapter tuning)通过引入适配器模块来进行特定任务的优化。适配器模块是一组轻量级的参数,被添加到模型的中间层,以保护原有预训练模型的参数。这种方法的目标是在不改变整体模型结构的情况下,通过调整适配器模块的参数来适应新任务。

        适配器微调相对于全量微调有两个主要优势。首先,它减少了参数调整的数量,同时还能保持与全量微调相当的模型性能。在GLUE基准测试中,适配器微调几乎与完全微调的BERT性能相当,但只使用了3%的任务特定参数 (Neil Houlsby, 2019)。这使得在资源有限的情况下更容易进行模型调整。其次,适配器模块允许在特定任务上保留更多的预训练知识,因为主要的预训练参数仍然受到保护,不会被大幅度改变。

        适配器微调的步骤包括在预训练模型的中间层添加适配器模块,并对其权重进行恒等初始化,这样训练开始时模型性能和原始模型非常接近;然后在目标任务上进行微调,调整适配器模块的权重。

        如图 1所示,适配器模块一般添加在Transformer模块中的两个全连接层的后面。每个适配器模块的结构包括输入层、输出层、下投影前馈层、上投影前馈层、非线性层、和从输入到输出的跳接。在训练过程中,一般只调整图 4中绿色的部分,包括适配器的下投影前馈层、上投影前馈层、非线性层以及Transformer模块中的两个归一层的参数。适配器模块的工作原理是先把输入的d维特征向量通过下投影前馈层(d×r维矩阵)投影为r维向量(r<<d),应用非线性层,再通过上投影前馈层(r×d维矩阵)投影回一个d维向量。

1 适配器模块的结构及其与Transformer模块的整合

        如图 2所示, (Neil Houlsby, 2019)给出了Adaptor tuning与传统迁移学习(调整靠近输出的几层)在调整同等数量的参数的情况下,模型性能的差别。左边的图表是微调BERT-large基础模型在GLUE数据集下的性能。X轴是调整的参数个数,Y轴是和全参数微调对比的准确度。右边的图表是微调BERT-base基础模型来处理其他一些文本分类任务(包括20 newsgroups、Crowdflower airline、Customer complaint database等)。X轴是调整的参数个数,Y轴是和全参数微调对比的准确度(所有这些文本分类任务的平均准确度)。由此可见,Adaptor tuning与传统迁移学习相比,只需要调整1/100甚至1/1000的参数就能得到相当的性能。

图 2 适配器微调与传统迁移学习的性能比较

        开源项目Adapters(https://github.com/adapter-hub/adapters)已经实现了开箱即用的适配器微调实现,我们只需要稍加配置,就可以训练和加载相应的Adapter模型。另外,Adapters还可以和HuggingFace的Transformer包无缝整合,可以直接加载HuggingFace上的模型进行Adapter微调。

        下面以使用BERT预训练模型来做文本分类任务为例,来展示如何使用适配器微调来微调一个大语言模型。首先,需要加载BERT预训练模型及其分词器。

from transformers import AutoTokenizer, AutoConfig

from adapters import AutoAdapterModel


model_path = "bert-base-chinese"

tokenizer = AutoTokenizer.from_pretrained(model_path)

config = AutoConfig.from_pretrained(model_path, num_labels=3)

model = AutoAdapterModel.from_pretrained(model_path, config=config)

        然后为预训练模型设置适配器。这里需要注意,在Adapters包里,本节所介绍的适配器结构被称为瓶颈适配器(Bottleneck adapters),使用BnConfig类来配置。这里需要为适配器取一个名字,之后可以通过这个名字来激活或者禁用这个适配器。

from adapters import BnConfig


adapter_name = "trouble_shooting"

# 添加一个新的adapter,类型为Bn adapter,即bottleneck adapter

config = BnConfig(mh_adapter=True, output_adapter=True, reduction_factor=16, non_linearity="relu")

model.add_adapter(adapter_name, config=config)


# 添加一个分类头

model.add_classification_head(adapter_name,num_labels=3, activation_function="relu")


# 激活这个adapter

model.train_adapter(adapter_name)

这里主要的配置参数有:

    1. mh_adapter:设置是否要在多头注意力模块之后添加适配器(即图 4左边下方的那个适配器)。
    2. output_adapter:设置是否要在Transformer模块的输出层添加适配器(即图 4左边上方的那个适配器)。
    3. reduction_factor:模型参数量与需调整的适配器参数量的比值。
    4. non_linearity:设置图 4右边的非线性部分使用的激活函数。

        最后,设置好训练参数,即可通过adapters包的AdapterTrainer类进行训练了,再调用trainer的save_model方法即可把训练好的适配器(不包括基础模型)保存在本地。

from transformers import TrainingArguments

from adapters import AdapterTrainer


training_args = TrainingArguments(

    num_train_epochs=5,

    per_device_train_batch_size = 16,

    logging_steps=2,

    save_steps = 10,

    gradient_accumulation_steps = 4,

    output_dir="/LLM/BERT/bert-adapter",

)


trainer = AdapterTrainer (

model=model, tokenizer=tokenizer

args=training_args, train_dataset=train_dataset,

    optimizers=(optimizer, None)

)

trainer.train() # 开始训练

trainer.save_model() # 保存训练好的模型

        这里用到了以下训练参数:

  1. per_device_train_batch_size:每次迭代的批次里包含的训练数据量。
  2. gradient_accumulation_steps:这是一种优化策略,减少更新梯度的次数,不用每次迭代都更新梯度,而是积累一定次数之后再更新。每次更新梯度称为一个基础步。
  3. logging_steps:每隔多少基础步记录一下训练进度,打印在控制台上。
  4. save_steps:每隔多少基础步存储一次训练模型的检查点(checkpoint)。
  5. num_train_epochs:每个epoch表示对所有训练数据完成了一遍训练。这个参数设置总共对训练数据集训练多少遍。
  6. output_dir:模型检查点和最后模型存储的路径。

        在推理时,除了要加载基础模型以外,还要加载Adapter,才能构成完整的模型。下面代码加载本地路径下的一个Adapter,并将其激活。

model.load_adapter("/LLM/BERT/bert-adapter", set_active=True)

参考文献:

Neil Houlsby, A. G. (2019). Parameter-Efficient Transfer Learning for NLP. Proceedings of ICML.

### Adapter Fine-Tuning Variants in Deep Learning Adapter fine-tuning represents a parameter-efficient approach to adapting large pre-trained models for specific tasks or domains without retraining all model parameters from scratch. This method involves inserting small neural network modules called adapters between existing layers of the pretrained model[^2]. These adapters are then trained while keeping most original weights frozen. #### Implementation Methods In practice, adapter insertion typically occurs at every layer or selected layers within transformer-based architectures such as BERT, RoBERTa, etc. Each adapter consists of two linear transformations with nonlinear activation functions applied in-between: ```python class Adapter(nn.Module): def __init__(self, input_dim=768, hidden_dim=128): super(Adapter, self).__init__() self.linear1 = nn.Linear(input_dim, hidden_dim) self.relu = nn.ReLU() self.linear2 = nn.Linear(hidden_dim, input_dim) def forward(self, x): z = self.linear1(x) z = self.relu(z) output = self.linear2(z) return output + x # Residual connection ``` The above code snippet demonstrates how an adapter module can be implemented using PyTorch. The residual connection ensures that information flows through unchanged when necessary. For efficient training and inference, specialized libraries like Hugging Face's `transformers` provide built-in support for adding adapters via simple API calls: ```python from transformers import BertModel, AdapterConfig model = BertModel.from_pretrained('bert-base-uncased') config = AdapterConfig(mh_adapter=True, output_adapter=True, reduction_factor=16, non_linearity="relu") model.add_adapter("task_name", config=config) model.train_adapter("task_name") ``` This example shows how one might add an adapter configuration specifically designed for task adaptation on top of a base language model. #### Use Cases Adapters have been successfully utilized across various NLP applications where domain-specific knowledge transfer is required but full finetuning resources may not always be available. Some common scenarios include: - **Domain Adaptation**: Transferring general-purpose LLMs into specialized fields (e.g., medical literature analysis). - **Multilingual Transfer Learning**: Enhancing cross-language capabilities by introducing adapters tailored towards individual languages. - **Personalization Systems**: Customizing recommendation engines based on user preferences without affecting core system performance. By leveraging these lightweight modifications, developers gain flexibility in customizing powerful yet resource-intensive models according to their needs efficiently. --related questions-- 1. What are some best practices for selecting hyperparameters during adapter tuning? 2. Can you explain more about memory efficiency improvements brought by adapter techniques compared to traditional fine-tuning approaches? 3. Are there any notable differences between applying adapters versus other forms of prompt engineering strategies? 4. In what ways do adapters contribute to reducing carbon footprint associated with AI computations?
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值