摘要
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x’ that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string ˆx, from which the final output y can be derived.
摘要中作者对于提示学习的说明:输入的x经过模板得到x’,x’是有空位的文本串,然后使用语言模型进行填空得到x^,最后通过x_hat可以得到y。
.
感觉和bert的预训练过程有些许相似之处,其优点在于可以在大规模的生语料上进行预训练。
提示学习的示意图,似乎prompt的定位对应上文中提到的“模板”,而进行填空的语言模型则对应现有的这些“芝麻街”预训练模型。
NLP发展中的两次重要变化
Fully supervised learning, where a task-specific model is trained solely on a dataset of input-output examples for the target task, has long played a central role in many machine learning tasks, and natural language processing (NLP) was no exception. Because such fully supervised datasets are ever-insufficient for learning high-quality models, early NLP models relied heavily on feature engineering, where NLP researchers or engineers used their domain knowledge to define and extract salient features from raw data and provide models with the appropriate inductive bias to learn from this limited data.
With the advent of neural network models for NLP, salient features were learned jointly with the training of the model itself, and hence focus shifted to architecture engineering, where inductive bias was rather provided through the design of a suitable network architecture conducive to learning such features.
However, from 2017-2019 there was a sea change in the learning of NLP models, and this fully supervised paradigm is now playing an ever-shrinking role. Specifically, the standard shifted to the pre-train and fine-tune paradigm. In this paradigm, a model with a fixed architecture is pre-trained as a language model (LM), predicting the probability of observed textual data. Because the raw textual data necessary to train LMs is available in abundance, these LMs can be trained on large datasets, in the process learning robust general-purpose features of the language it is modeling. The above pre-trained LM will be then adapted to different downstream tasks by introducing additional parameters and fine-tuning them using task-specific objective functions. Within this paradigm, the focus turned mainly to objective engineering, designing the training objectives used at both the pre-training and fine-tuning stages. For example, Zhang et al. (2020a) show that introducing a loss function of predicting salient sentences from a document will lead to a better pre-trained model for text summarization. Notably, the main body of the pre-trained LM is generally (but not always;) fine-tuned as well to make it more suitable for solving the downstream task.
这里是作者总结的nlp发展中的第一次巨大变化:
- 特征工程阶段(对应传统机器学习方法):此时全监督学习盛行,标注语料作为监督的必要条件是很难以获取的。此时期的研究工作专注于人工抽取出特征,从而更好的诱导机器学习模型从有限的数据中进行学习。
- 架构工程阶段(对应神经网络早期):神经网络的一大突出优势就是不需要人工设计特征,网络会自动进行特征提取。此时期的研究工作则专注于设计更适合任务的神经网络结构,从而让模型的特征提取能力更强。
- 目标工程阶段(对应预训练模型兴起):全监督学习始终受制于有限的学习语料,而预训练语言模型的无监督学习模式则可不受限制地在丰富的无标注预料上进行学习。因此“预训练+微调”成为主流模式,而研究工作的重点则转变为给预训练阶段和微调阶段设计训练目标(损失函数)。
Now, as of this writing in 2021, we are in the middle of a second sea change, in which the “pre-train, fine-tune” procedure is replaced by one in which we dub “pre-train, prompt, and predict”. In this paradigm, instead of adapting pre-trained LMs to downstream tasks via objective engineering, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual prompt.
For example, when recognizing the emotion of a social media post, “I missed the bus today.”, we may continue with a prompt “I felt so ”, and ask the LM to fill the blank with an emotion-bearing word. Or if we choose the prompt “English: I missed the bus today. French: ”), an LM may be able to fill in the blank with a French translation. In this way, by selecting the appropriate prompts we can manipulate the model behavior so that the pre-trained LM itself can be used to predict the desired output, sometimes even without any additional task-specific training (Tab. 1). The advantage of this method is that, given a suite of appropriate prompts, a single LM trained in an entirely unsupervised fashion can be used to solve a great number of tasks .
However, as with most conceptually enticing prospects, there is a catch – this method introduces the necessity for prompt engineering, finding the most appropriate prompt to allow a LM to solve the task at hand.
作者认为我们当前正处于第二次巨大变化的中期,即提示学习的引入。
在此,作者对于提示学习给出了例子:
- “I missed the bus today.” -> “I missed the bus today. I felt so ___” (对应情感分类任务)
- “I missed the bus today.” -> “English: I missed the bus today. French: ______”(对应机器翻译任务)
我们知道,这些预训练模型是很擅长填空的,因此我们将问题稍作转化成为填空题,这样我们只需要有预训练模型就可以完成任务(似乎就不需要下游任务了)。
因此当前阶段的研究任务专注于提示工程,即设计合适的提示信息,让预训练模型更好的解决问题。
对于提示学习的正式描述
In Prompt Addition step a prompting function fprompt(x) is applied to modify the input text x into a prompt x‘ = fprompt(x). In the majority of previous work (Kumar et al., 2016; McCann et al., 2018; Radford et al., 2019; Schick and Sch¨utze,2021a), this function consists of a two step process:
- Apply a template, which is a textual string that has two slots: an input slot [X] for input x and an answer slot [Z] for an intermediate generated answer text z that will later be mapped into y.
- Fill slot [X] with the input text x.
In Answer Search step , we search for the highest-scoring text zˆ that maximizes the score of the LM. We first define Z as a set of permissible values for z. Z could range from the entirety of the language in the case of generative tasks, or could be a small subset of the words in the language in the case of classification, such as defining Z = {“excellent”, “good”, “OK”, “bad”, “horrible”} to represent each of the classes in Y = {++, +, ~ , -, --}.
We then define a function ffill(x0 , z) that fills in the location [Z] in prompt x0 with the potential answer z. We will call any prompt that has gone through this process as a filled prompt. Particularly, if the prompt is filled with a true answer, we will refer to it as an answered prompt (Tab. 2 shows an example). Finally, we search over the set of potential answers z by calculating the probability of their corresponding filled prompts using a pre-trained LM.
This search function could be an argmax search that searches for the highest-scoring output, or sampling that randomly generates outputs following the probability distribution of the LM.
In Answer Mapping step, we would like to go from the highest-scoring answer zˆ to the highest-scoring output yˆ. This is trivial in some cases, where the answer itself is the output (as in language generation tasks such as translation), but there are also other cases where multiple answers could result in the same output. For example, one may use multiple different sentiment-bearing words (e.g. “excellent”, “fabulous”, “wonderful”) to represent a single class (e.g. “++”), in which case it is necessary to have a mapping between the searched answer and the output value.
文章中提到,提示学习的过程大致可分为三步:
- Prompt Addition (添加提示):套公式,给字符串添加提示
- Answer Search (回答提示):答案搜索的过程中,根据任务不同备选集的大小可以不同
- Answer Mapping (回答匹配):根据任务不同,有时Output=Answer,有时Output=Map(Answer)
介绍一些预训练语言模型
标准语言建模方式 SLM
Training the model to optimize the probability P(x) of te