LLMs之IA3:《Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning》翻译

LLMs之IA3:《Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning》翻译与解读

目录

《Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning》翻译与解读

Abstract摘要

1 Introduction引言

2 Background背景

2.1 Few-shot in-context learning (ICL)少样本的语境学习

2.2 Parameter-efficient fine-tuning参数高效微调

3 Designing the T-Few Recipe设计T-Few配方

3.3 Parameter-efficient fine-tuning with (IA)3参数高效微调

5 Conclusion结论


《Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning》翻译与解读

地址

论文地址:https://arxiv.org/abs/2205.05638

时间

2022年5月11日

作者

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, Colin Raffel

总结

这篇文章比较 Few-shot ICL 和 PEFT 在少量数据上调优语言模型的效果。

背景痛点

>> ICL 虽然可以实现零成本任务转换,但inference开销大,精度也不如调优;

>> 传统调优需要大量参数,难满足少量数据场景。

解决方案—T-Few

>> 使用(IA)3这个PEFT方法,它通过乘以学习向量来缩放Transformer各层的attention和feedforward网络的激活,同时加入很少的参数

>> 在T0基础上进行微调,加入unlikelihood loss和length normalization loss提升精度;

>> 预训练(IA)3的参数,再进行微调。

T-Few的核心特点和优势

>> 和传统调优相比,精度更高,但参数数量少多了;

>> 和ICL相比,精度更高,inference开销也低很多;

>> 允许不同任务间混合批次训练,更适合多任务学习;

>> 不需要对每个数据集做超参数调整,可以直接应用到新任务

>> 在RAFT此类真实任务上的表现都优于ICL方法和人类水平,证明了在少量数据下 its 效果。

本文通过比较实验和理论分析,证明了在少量数据场景下,使用PEFT方法比ICL更优。T-Few这个具体方案也在多个 benchark 上表现出色,为语言模型在少量数据上的应用提供了很好的思路。

Abstract摘要

Few-shot in-context learning (ICL) enables pre-trained language models to per-form a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available.1

少样本上下文学习(ICL)通过提供少量训练示例作为输入的一部分,使预训练的语言模型能够在没有任何基于梯度的训练的情况下执行以前未见过的任务ICL需要大量的计算、内存和存储成本,因为每次进行预测时都需要处理所有的训练示例。参数高效微调(PEFT)(例如,适配器模块,提示调整,稀疏更新方法等)提供了一种替代范例,其中一小部分参数被训练以使模型能够执行新任务。在本文中,我们严格比较了少样本ICL和PEFT,并证明后者具有更好的精度和显着降低的计算成本。在此过程中,我们引入了一种名为(IA)3的新的PEFT方法,该方法通过学习向量缩放激活,在只引入相对少量的新参数的情况下获得更强的性能。我们还提出了一个基于T0模型[1]的简单配方,称为T-Few,可以应用于新任务,而无需针对任务进行调优或修改。我们通过将T-Few应用于RAFT基准测试[2],验证了T-Few在完全未见的任务上的有效性,首次实现超人类表现,并超过现有技术6%的绝对值。我们在实验中使用的所有代码都是公开可用的。

1 Introduction引言

Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest – i.e., using a pre-trained language model for initialization often produces better results with less labeled data. A historically common approach has been to use the pre-trained model’s parameters for initialization before performing gradient-based fine-tuning on a downstream task of interest. While fine-tuning has produced many state-of-the-art results [1], it results in a model that is specialized for a single task with an entirely new set of parameter values, which can become impractical when fine-tuning a model on many downstream tasks.

预训练的语言模型已经成为自然语言处理的基石,因为它们可以显著提高感兴趣任务的数据效率——也就是说,使用预训练的语言模型进行初始化通常可以用较少的标记数据产生更好的结果。历史上常见的方法是,在对下游任务执行基于梯度的微调之前,使用预训练模型的参数进行初始化。虽然微调产生了许多最先进的结果[1],但它产生的模型是专门用于具有全新参数值集的单个任务的,当对许多下游任务的模型进行微调时,这可能变得不切实际。

An alternative approach popularized by [3, 4] is in-context learning (ICL), which induces a model to perform a downstream task by inputting prompted examples. Few-shot prompting converts a small collection of input-target pairs into (typically) human-understandable instructions and examples [3, 4], along with a single unlabeled example for which a prediction is desired. Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training. These characteristics have led to a great deal of recent interest in ICL methods [5–10].

[3,4]推广的另一种方法是上下文学习(ICL),它通过输入提示示例诱导模型执行下游任务。Few-shot提示将一小部分输入-目标对转换为(通常)人类可理解的指令和示例[3,4],以及需要预测的单个未标记示例。值得注意的是,ICL不需要基于梯度的训练,因此允许单个模型立即执行各种任务。因此,执行ICL完全依赖于模型在预训练期间学习到的能力。这些特性导致了ICL方法近期的大量关注[5-10]。

Despite the practical benefits of ICL, it has several major drawbacks. First, processing all prompted input-target pairs every time the model makes a prediction incurs significant compute costs. Second, ICL typically produces inferior performance compared to fine-tuning [4]. Finally, the exact formatting of the prompt (including the wording [11] and ordering of examples [12]) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation of fine-tuning. Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all [9].

尽管ICL有实际的好处,但它有几个主要的缺点。首先,每次模型进行预测时,处理所有提示的输入-目标对都会导致大量的计算成本。其次,与微调相比,ICL通常产生较差的性能[4]。最后,提示符的确切格式(包括措辞[11]和示例的顺序[12])对模型的性能会产生重大且不可预测的影响,远远超出了运行间微调的变化。最近的研究也表明,即使提供了不正确的标签,ICL也能表现良好,这引发了ICL到底有多少学习的疑问[9]。

An additional paradigm for enabling a model to perform a new task with minimal updates is parameter-efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters. Recent methods have matched the performance of fine-tuning the full model while only updating or adding a small fraction (e.g. 0.01%) of the full model’s parameters [13, 14]. Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently [14], making both PEFT and ICL viable for multitask models.

使模型能够以最小的更新执行新任务的另一个范例是参数高效微调(PEFT),其中通过仅更新少量添加或选择的参数来对预训练的模型进行微调。最近的方法在只更新添加一小部分(例如0.01%)完整模型参数的情况下,匹配了微调完整模型的性能[13,14]。此外,某些PEFT方法允许混合任务批次,其中批次中的不同示例被不同地处理[14],使得PEFT和ICL都适用于多任务模型。

While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available. Our primary goal in this paper is to close this gap by proposing a recipe – i.e., a model, a PEFT method, and a fixed set of hyperparameters – that attains strong performance on novel, unseen tasks while only updating a tiny fraction of the model’s parameters. Specifically, we base our approach on the T0 model [1], a variant of T5 [15] fine-tuned on a multitask mixture of prompted datasets. To improve performance on classification and multiple-choice tasks, we add unlikelihood [16, 17] and length normalization-based [4] loss terms. In addition, we develop (IA)3, a PEFT method that multiplies intermediate activations by learned vectors. (IA)3 attains stronger performance than full-model fine-tuning while updating up to 10,000× fewer parameters. Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning [18, 19]. Our overall recipe, which we dub “T-Few”, performs significantly better than ICL (even against 16× larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [2] while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems and future research on PEFT, we release our code.1

虽然PEFT的优点解决了微调的一些缺点(与ICL相比),但相对而言,很少有人关注PEFT方法在可用的标记数据非常少的情况下是否有效。我们在本文中的主要目标是通过提出一种方法来缩小这一差距——即,一个模型、一个PEFT方法和一组固定的超参数——在只更新模型参数的一小部分的情况下,在新的、看不见的任务上获得强大的性能。具体来说,我们的方法基于T0模型[1],这是T5[15]的一个变体,对提示数据集的多任务混合进行了微调。为了提高分类和多项选择任务的性能,我们添加了非似然损失项[16,17]和基于长度归一化的损失项[4]。此外,我们开发了(IA)3,这是一种PEFT方法,一种通过学习向量将中间激活乘以的PEFT方法。(IA)3实现了比全模型微调更强的性能,同时更新的参数减少了10,000倍。最后,我们展示了在微调之前对(IA)3参数进行预训练的好处[18,19]。我们的整体配方,我们称之为“T-Few”,表现明显优于ICL(即使是针对16倍大的模型),并且首次在真实世界的少量学习基准RAFT上优于人类[2],同时需要显着减少计算并允许在推理期间混合任务批次。为了方便T-Few在PEFT的新问题和未来研究中使用,我们发布了我们的代码

After providing background on ICL and PEFT in the following section, we discuss the design of T-Few in section 3. In section 4, we present experiments comparing T-Few to strong ICL baselines. Finally, we discuss related work in appendix B and conclude in section 5.

在下一节中介绍ICL和PEFT的背景之后,我们将在第3节中讨论T-Few的设计。在第4节中,我们展示了比较T-Few和强ICL基线的实验。最后,我们在附录B中讨论相关工作,并在第5节中进行总结。

2 Background背景

In this section, we provide am verview of ICL and PEFT with a focus on characterizing the com-putation, memory, and on-disk storage costs of making a prediction. Real-world costs depend on implementation and hardware, so we report costs in terms of FLOPs for computation and bytes for memory and storage, respectively. Additional related work is discussed in appendix B.

在本节中,我们概述了ICL和PEFT,重点描述了进行预测的计算、内存和磁盘存储成本。实际成本取决于实现和硬件,因此我们分别以计算的flop和内存和存储的字节来报告成本。附录B讨论了其他相关工作。

2.1 Few-shot in-context learning (ICL)少样本的语境学习

ICL [3, 4] aims to induce a model to perform a task by feeding in concatenated and prompted input-target examples (called “shots”) along with an unlabeled query example. Taking the cycled letter task from Brown et al. [4] as an example, a 4-shot input or context would be “Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro =”, for which the desired output would be “roasted”. ICL induces an autoregressive language model to perform this task by feeding in the context and sampling from the model. For classification tasks, each label is associated with a string (e.g. “positive” and “negative” for sentiment analysis) and a label is assigned by choosing the label string that the model assigns the highest probability to. For multiple-choice tasks (e.g. choosing between N possible answers to a question), the model’s prediction is similarly determined by determining which choice is assigned the highest probability.

ICL[3,4]旨在通过输入连接和提示的输入目标示例(称为“shots”)以及未标记的查询示例来诱导模型执行任务。以Brown等人[4]的循环字母任务为例,4次输入或上下文将是“请将字母整理成一个单词,并写下该单词:asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, asterdro =”,因此期望的输出将是“roast”。ICL通过在上下文中输入和从模型中采样来诱导自回归语言模型来执行此任务。对于分类任务,每个标签都与一个字符串相关联(例如,情感分析的“积极”和“消极”),并且通过选择模型赋予最高概率的标签字符串来分配标签。对于多项选择任务(例如,在一个问题的N个可能答案中进行选择),模型的预测同样是通过确定哪个选择的概率最高来确定的。

The primary advantage of ICL is that it enables a single model to perform many tasks immediately without fine-tuning. This also enables mixed-task batches, where different examples in a batch of data correspond to different tasks by using different contexts in the input. ICL is also typically performed with only a limited number of labeled examples – called few-shot learning – making it data-efficient.

Despite these advantages, ICL comes with significant practical drawbacks: First, making a prediction is dramatically more expensive because the model needs to process all of the in-context labeled examples. Specifically, ignoring the quadratic complexity of self-attention operations in Transformer language models (which are typically small compared to the costs of the rest of the model [20]), processing the k training examples for k-shot ICL increases the computational cost by approximately k + 1 times compared to processing the unlabeled example alone. Memory costs similarly scale approximately linearly with k, though during inference the memory costs are typically dominated by storing the model’s parameters. Separately, there is a small amount of on-disk storage required for storing the in-context examples for a given task. For example, storing 32 examples for a task where the prompted input and target for each example is 512 tokens long would require about 66 kilobytes of storage on disk (32 examples × 512 tokens × 32 bits).

ICL的主要优点是,它使单个模型能够立即执行许多任务,而无需进行微调。这还支持混合任务批处理,其中一批数据中的不同示例通过使用输入中的不同上下文对应于不同的任务。ICL通常也只使用有限数量的标记示例(称为“少射学习”)来执行,这使其具有数据效率。

尽管有这些优点,但ICL在实际应用中也有明显的缺点:首先,进行预测的成本要高得多,因为模型需要处理所有在上下文中标记的示例。具体来说,忽略Transformer语言模型中自注意操作的二次复杂度(与模型其余部分的成本相比通常很小[20]),处理k-shot ICL的k个训练样例比单独处理未标记的样例增加了大约k + 1倍的计算成本。类似地,内存成本随着k近似地线性增长,尽管在推理过程中,内存成本通常由存储模型的参数主导。另外,存储给定任务的上下文示例需要少量磁盘存储空间。例如,为一个任务存储32个示例,其中每个示例的提示输入和目标长度为512个令牌,则需要大约66 kb的磁盘存储空间(32个示例× 512个令牌× 32位)。

Beyond the aforementioned costs, ICL also exhibits unintuitive behavior. Zhao et al. [12] showed that the ordering of examples in the context heavily influences the model’s predictions. Min et al.[9] showed that ICL can still perform well even if the labels of the in-context examples are swapped (i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled examples.

Various approaches have been proposed to mitigate these issues. One way to decrease computational costs is to cache the key and value vectors for in-context examples. This is possible because decoder-only Transformer language models have a causal masking pattern, so the model’s activations for the context do not do not depend on the unlabeled example. In an extreme case, 32-shot ICL with 512 tokens per in-context example would result in over 144 gigabytes of cached key and value vectors for the GPT-3 model (32 examples × 512 tokens × 96 layers × 12288 dmodel × 32 bits each for the key and value vectors). Separately, Min et al. [21] proposed ensemble ICL, where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together. This lowers the non-parameter memory cost by a factor of k/2 but increases the computational cost by a factor of 2. In terms of task performance, Min et al. [21] find that ensemble ICL outperforms the standard concatenative variant.

除了上述成本之外,ICL还表现出不直观的行为。Zhao等人[12]表明,上下文中的示例顺序严重影响模型的预测。Min等人[9]表明,即使交换了上下文示例的标签(即错误),ICL仍然可以很好地执行,这就提出了ICL是否真的从标记的示例中“学习”的问题。

已经提出了各种方法来缓解这些问题。减少计算成本的一种方法是为上下文示例缓存键和值向量。这是可能的,因为只有解码器的Transformer语言模型具有因果屏蔽模式,因此模型对上下文的激活不依赖于未标记的示例。在极端情况下,每个上下文示例包含512个令牌的32镜头ICL将为GPT-3模型提供超过144gb的缓存键和值向量(32个示例× 512个令牌× 96层× 12288 dmodel ×每个键和值向量的32位)。另外,Min等人[21]提出了集成ICL,其中不是使用连接k个训练样例的输出概率,而是将模型在每个训练样例上的输出概率(即每个k个样例的1-shot ICL)相乘。这将非参数内存成本降低了k/2倍,但将计算成本提高了2倍。在任务性能方面,Min等人[21]发现集成ICL优于标准的串联变体。

2.2 Parameter-efficient fine-tuning参数高效微调

While standard fine-tuning updates all parameters of the pre-trained model, it has been demonstrated that it is possible to instead update or add a relatively small number of parameters. Early methods proposed adding adapters [22–24], which are small trainable feed-forward networks inserted between the layers in the fixed pre-trained model. Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train [25, 26], produce low-rank updates [13], perform optimization in a lower-dimensional subspace [27], add low-rank adapters using hypercomplex multiplication [28], and more. Relatedly, prompt tuning [14] and prefix tuning [29] concatenate learned continuous embeddings to the model’s input or activations to induce it to perform a task; this can be seen as a PEFT method [30]. State-of-the-art PEFT methods can match the performance of fine-tuning all of the model’s parameters while updating only a tiny fraction (e.g. 0.01%) of the model’s parameters.

虽然标准微调更新了预训练模型的所有参数,但已经证明可以更新添加相对少量的参数。早期的方法提出添加适配器[22-24],适配器是在固定预训练模型的层之间插入的小型可训练前馈网络。从那时起,各种复杂的PEFT方法被提出,包括选择参数的稀疏子集进行训练[25,26],产生低秩更新[13],在低维子空间中进行优化[27],使用超复乘法添加低秩适配器[28]等方法。与此相关,提示调优[14]和前缀调优[29]将学习到的连续嵌入连接到模型的输入或激活中,以诱导其执行任务;这可以看作是一种PEFT方法[30]。最先进的PEFT方法可以匹配微调所有模型参数的性能,同时只更新模型参数的一小部分(例如0.01%)。

PEFT drastically reduces the memory and storage requirements for training and saving the model. In addition, certain PEFT methods straightforwardly allow mixed-task batches – for example, prompt tuning enables a single model to perform many tasks simply by concatenating different prompt embeddings to each example in the batch [14]. On the other hand, PEFT methods that re-parameterize the model (e.g. [27, 13]) are costly or onerous for mixed-task batches. Separately, different PEFT methods increase the computation and memory required to perform inference by different amounts. For example, adapters effectively add additional (small) layers to the model, resulting in small but non-negligible increases in computational costs and memory. An additional cost incurred by PEFT is the cost of fine-tuning itself, which must be performed once and is then amortized as the model is used for inference. However, we will show that PEFT can be dramatically more computationally efficient when considering both fine-tuning and inference while achieving better accuracy than ICL.

PEFT极大地减少了训练和保存模型的内存和存储需求。此外,某些PEFT方法直接允许混合任务批次——例如,提示调优使单个模型能够简单地通过将不同的提示嵌入连接到批次中的每个示例来执行许多任务[14]。另一方面,重新参数化模型的PEFT方法(例如[27,13])对于混合任务批次来说是昂贵或繁重的。另外,不同的PEFT方法会增加执行推理所需的计算量和内存。例如,适配器有效地向模型添加了额外的(小的)层,从而导致计算成本和内存的微小但不可忽略的增加。PEFT产生的额外成本是微调本身的成本,必须执行一次,然后在模型用于推理时分摊。然而,我们将证明,当考虑微调和推理时,PEFT在获得比ICL更好的精度的同时,可以显着提高计算效率。

3 Designing the T-Few Recipe设计T-Few配方

Given that PEFT allows a model to be adapted to a new task with relatively small storage requirements and computational cost, we argue that PEFT presents a promising alternative to ICL. Our goal is therefore to develop a recipe that allows a model to attain high accuracy on new tasks with limited labeled examples while allowing mixed-task batches during inference and incurring minimal computational and storage costs. By recipe, we mean a specific model and hyperparameter setting that provides strong performance on any new task without manual tuning or per-task adjustments. In this way, we can ensure that our approach is a realistic option in few-shot settings where limited labeled data is available for evaluation [31, 32].

鉴于PEFT允许模型以相对较小的存储需求和计算成本适应新任务,我们认为PEFT是ICL的一个有希望的替代方案。因此,我们的目标是开发一种配方,允许模型在具有有限标记示例的新任务上获得高精度,同时在推理期间允许混合任务批次,并产生最小的计算和存储成本。通过配方,我们指的是在任何新任务上提供强大性能的特定模型和超参数设置,而无需手动调优或对每个任务进行调整。通过这种方式,我们可以确保我们的方法在少数镜头设置中是一个现实的选择,其中有限的标记数据可用于评估[31,32]。

3.3 Parameter-efficient fine-tuning with (IA)3参数高效微调

In order to compare favorably to few-shot ICL, we need a PEFT method that has the following properties: First, it must add or update as few parameters as possible to avoid incurring storage and memory costs. Second, it should achieve strong accuracy after few-shot training on new tasks. Finally, it must allow for mixed-task batches, since that is a capability of ICL. In order to easily enable mixed-task batches, a PEFT method should ideally not modify the model itself. Otherwise, each example in a batch would effectively need to be processed by a different model or computational graph. A more convenient alternative is provided by methods that directly modify the activations of the model since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to. Prompt tuning and prefix tuning methods [14, 29] work by concatenating learned vectors to activation or embedding sequences and are therefore examples of activation-modifying PEFT methods that allow for mixed-task batches. However, as we will discuss later, we were unable to attain reasonable accuracy with prompt tuning and found that the more performant PEFT methods did not allow for mixed-task batches. We therefore developed a new PEFT method that meets our desiderata.

为了与少样本ICL进行比较,我们需要一种PEFT方法,它具有以下属性:首先,它必须添加或更新尽可能少的参数,以避免产生存储和内存成本。其次,在新任务上进行几次训练后,要达到较强的准确率。最后,它必须允许混合任务批处理,因为这是ICL的一种功能。为了方便地启用混合任务批处理,PEFT方法在理想情况下不应该修改模型本身。否则,批处理中的每个示例将需要由不同的模型或计算图有效地处理。直接修改模型激活的方法提供了一种更方便的替代方法,因为这可以根据示例对应的任务对批处理中的每个示例独立且廉价地完成。提示调优和前缀调优方法[14,29]通过将学习到的向量连接到激活或嵌入序列来工作,因此是允许混合任务批次的激活修改PEFT方法的示例。然而,正如我们将在后面讨论的那样,我们无法通过及时调优获得合理的精度,并且发现性能更高的PEFT方法不允许混合任务批处理。因此,我们开发了一种新的PEFT方法来满足我们的需求。

As an alternative, we explored element-wise multiplication (i.e. rescaling) of the model’s activations against a learned vector. Specifically, we consider adaptation of the form l  x where l ∈ Rd is a learned task-specific vector,  represents element-wise multiplication, and x ∈ RT×d is a length-T sequence of activations. We use “broadcasting notation” [46] so that the (i, j)th entry of l x is ljxi,j . In preliminary experiments, we found it was not necessary to introduce a learned rescaling vector for each set of activations in the Transformer model. Instead, we found it was sufficient to introduce rescaling vectors on the keys and values in self-attention and encoder-decoder attention mechanisms and on the intermediate activation of the position-wise feed-forward networks. Specifically, using the notation from Vaswani et al. [33], we introduce three learned vectors lk ∈ Rdk , lv ∈ Rdv , and lff ∈ Rdff , which are introduced into the attention mechanisms as:

and in the position-wise feed-forward networks as (lff  γ(W1x))W2, where γ is the feed-forward network nonlinearity. We introduce a separate set of lk, lv, and lff vectors in each Transformer layer block. This adds a total of L(dk + dv + dff ) new parameters for a L-layer-block Transformer encoder and L(2dk + 2dv + dff ) (with factors of 2 accounting for the presence of both self-attention and encoder-decoder attention) for a L-layer-block decoder. lk, lv, and lff are all initialized with ones so that the overall function computed by the model does not change when they are added. We call our method (IA)3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”.

作为替代方案,我们探索了针对学习向量的模型激活的元素智能乘法(即重新缩放)。具体地说,我们考虑对形式l _ _ _ x的适应,其中l∈Rd是一个特定于任务的学习向量,_ _表示元素明智的乘法,x∈RT×d是一个长度为t的激活序列。我们使用“广播表示法”[46],因此第1个条目的(i, j)是ljxi,j。在初步的实验中,我们发现没有必要为Transformer模型中的每一组激活引入一个学习的重新缩放向量。相反,我们发现在自我注意和编码器-解码器注意机制中的键和值以及位置前馈网络的中间激活上引入重新缩放向量就足够了。具体来说,使用Vaswani等人[33]的符号,我们引入了三个学习向量lk∈Rdk, lv∈Rdv, lff∈Rdff,它们被引入到注意机制中如下:

在位置前馈网络中为(lff _ γ(W1x))W2,其中γ为前馈网络的非线性。我们在每个Transformer层块中引入一组单独的lk、lv和lff向量。这为L层块变压器编码器和L(2dk + 2dv + dff)添加了总共L(dk + dv + dff)个新参数(考虑到自注意和编码器-解码器注意的存在,因子为2)层块解码器。Lk、lv和LFF都是用1初始化的,这样在添加它们时,模型计算的整体函数不会改变。我们将我们的方法称为(IA)3,即“通过抑制和放大内部激活来注入适配器”。

(IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. We also note that, in the event that a model will only be used on a single task, the modifications introduced by (IA)3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the model’s architecture remains unchanged. This possible because element-wise multiplications performed in (IA)3 always co-occur with a matrix multiplication, and l  Wx = (l  W )x. In this case, our method incurs no additional computational cost compared to the original model.

(IA)3使得混合任务批处理成为可能,因为批处理中的每个激活序列可以单独且便宜地乘以其相关的学习任务向量。我们还注意到,如果模型只用于单个任务,则(IA)3引入的修改也可以永久地应用于权重矩阵,因此不需要元素乘法,模型的体系结构保持不变。这可能是因为在(IA)3中执行的逐元素乘法总是与矩阵乘法同时发生,并且l Wx = (l W)x。在这种情况下,与原始模型相比,我们的方法不会产生额外的计算成本。

To validate (IA)3, we compare it to a large variety of existing adaptation methods in our setting of fine-tuning T0-3B on few-shot datasets from held-out tasks. Specifically, we compare with 9 strong PEFT methods: BitFit [47] which updates only the bias parameters; Adapters [23] which introduce task-specific layers after the self-attention and position-wise feed-forward networks; Compacter and Compacter++ [28] which improve upon adapters by using low-rank matrices and hypercomplex mul-tiplication; prompt tuning [14] which learns task-specific prompt embeddings that are concatenated to the model’s input; FISH Mask [26] which chooses a subset of parameters to update based on their ap-proximate Fisher information; Intrinsic SAID [27] which performs optimization in a low-dimensional subspace; prefix-tuning [29] which learns task-specific vectors that are concatenated to the model’s activations; and LoRA [13] which assigns low-rank updates to parameter matrices. Additionally, we include the baselines of full-model fine-tuning and updating only the layer normalization parameters. For certain methods that allow changing the parameter efficiency, we report results for different budgets: 0.2% and 0.02% sparsity for FISH Mask, 10 and 100 learned prompt vectors for prompt tuning, and 20,000- or 500,000-dimensional subspaces for Intrinsic SAID.

为了验证(IA)3,我们将其与大量现有的自适应方法进行比较,在我们的微调T0-3B设置中,对来自暂停任务的少量数据集进行微调。具体来说,我们比较了9种强PEFT方法:BitFit[47],它只更新偏差参数;适配器[23]在自关注和位置前馈网络之后引入任务特定层;compater和compater++[28],它们通过使用低秩矩阵和超复乘法改进适配器;提示调整[14],学习连接到模型输入的特定于任务的提示嵌入;FISH Mask[26],它根据参数的近似Fisher信息选择一个参数子集进行更新;内在SAID[27],在低维子空间中进行优化;前缀调优[29],学习连接到模型激活的特定于任务的向量;LoRA[13]对参数矩阵进行低秩更新。此外,我们还包括全模型微调的基线和仅更新层归一化参数。对于某些允许改变参数效率的方法,我们报告了不同预算的结果:FISH Mask的稀疏度为0.2%和0.02%,10和100学习提示向量用于提示调整,以及Intrinsic SAID的20,000或500,000维子空间。

The results are shown in fig. 2, with detailed per-dataset results in appendix D. We find that (IA)3 is the only method that attains higher accuracy than the full-model-fine-tuning baseline. While other PEFT methods (e.g. Intrinsic SAID and prompt tuning) update or introduce fewer parameters,(IA)3 performs considerably better. Our results and setting differ with some past work on the PEFT methods we compare against. Mahabadi et al. [28] report that Compacter and Compacter++ outperform full-model fine-tuning, including in the few-shot setting. Lester et al. [14] found that prompt tuning could match full-model fine-tuning, and in subsequent work Wei et al. [48] found that prompt tuning performed well when applied to a multitask fine-tuned model in the few-shot setting. In both cases, we experimented with various hyperparameter choices to try to match past results. We hypothesize the disagreement comes from us using a different model and different datasets. For prompt tuning specifically, we noticed that the validation set performance could fluctuate wildly over the course of training, hinting at possible optimization issues.

结果如图2所示,详细的每个数据集结果见附录d。我们发现(IA)3是唯一比全模型微调基线获得更高精度的方法。当其他PEFT方法(例如Intrinsic SAID和prompt tuning)更新或引入更少的参数时,(IA)3的性能要好得多。我们的结果和设置与我们比较的一些过去的PEFT方法的工作不同。Mahabadi等人[28]报道,compater和compter++优于全模型微调,包括在少量镜头设置中。Lester等人[14]发现提示调整可以匹配全模型微调,而Wei等人[48]在随后的工作中发现,当将提示调整应用于少镜头设置的多任务微调模型时,提示调整表现良好。在这两种情况下,我们尝试了各种超参数选择,试图匹配过去的结果。我们假设分歧来自于我们使用不同的模型和不同的数据集。特别是对于即时调优,我们注意到验证集的性能在训练过程中可能波动很大,这暗示了可能存在的优化问题。

5 Conclusion结论

We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost. T-Few uses (IA)3, a new PEFT method that rescales inner activations with learned vectors. Using (IA)3 produces better performance than fine-tuning the full model while only introducing a tiny amount of additional parameters. T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. When applying T-Few as-is (with no task-specific hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human performance for the first time and outperformed prior submissions by a large margin. Through detailed characterization of computational costs, we found that T-Few uses over 1,000× fewer FLOPs during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. Since all of our experiments were on classification tasks, we are interested in applying T-Few to generative tasks like as summarization and question answering in future work. We hope our results provide a new perspective on how best to perform few-shot learning with large language models.

我们介绍了T-Few,一种参数高效的少量学习配方,它比少量ICL在更低的计算成本下获得更高的精度。T-Few使用(IA)3,一种新的PEFT方法,用学习到的向量重新缩放内部激活。使用(IA)3比在只引入少量额外参数的情况下对整个模型进行微调产生更好的性能。T-Few还使用了两个额外的损失项,以鼓励模型输出错误选择的较低概率,并考虑到不同答案选择的长度。当将T-Few按现状(没有特定于任务的超参数调优或其他更改)应用到RAFT基准测试时,我们首次获得了超人类的性能,并且大大超过了之前提交的性能。通过对计算成本的详细描述,我们发现T-Few在推理过程中使用的FLOPs比使用GPT-3的少量ICL少1000倍以上,并且只需要30分钟就可以在单个NVIDIA A100 GPU上进行训练。由于我们所有的实验都是在分类任务上,我们有兴趣在未来的工作中将T-Few应用于生成任务,如摘要和问答。我们希望我们的研究结果为如何在大型语言模型中最好地进行少量学习提供了一个新的视角。

  • 21
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一个处女座的程序猿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值