LLMs之InstructGPT：《Training language models to follow instructions with human feedback》翻译与解读

一个处女座的程序猿

已于 2023-06-28 23:37:02 修改

阅读量4.3k

点赞数 8

分类专栏： NLP/LLMs AI/AGI 文章标签： InstructGPT 语言模型自然语言处理

于 2022-04-03 21:15:19 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/123944674

版权

NLP/LLMs 同时被 2 个专栏收录

478 篇文章 372 订阅

订阅专栏

AI/AGI

317 篇文章 215 订阅

订阅专栏

LLMs之InstructGPT：《Training language models to follow instructions with human feedback》翻译与解读

导读：

>> 使用人类偏好微调大型语言模型能显著提高其任务表现：这篇论文的核心是使用人工反馈对GPT-3进行微调，构建了InstructGPT模型。通过收集人工标注者对模型输出的评分数据集，使用强化学习算法(RLHF)对GPT-3进行微调，提高了其遵循人类指令的能力。实验结果表明，相比175B参数的GPT-3，1.3B参数的InstructGPT模型在遵循指令的能力上更加出色。这显示了用人工反馈微调语言模型与人类意图对齐是一个很有前景的方向。
>> 提出一种通过人类反馈训练语言模型(RLHF)执行指令的方法：该论文介绍了一种新的训练方法，旨在使语言模型能够理解和执行人类指令。通过将人类反馈作为指导，模型可以更好地学习语言的语义和结构，从而提高性能和实用性。实验结果表明，这种方法可以显著提高模型的性能和实用性。

>> InstructGPT = SFT+PPO算法最大化RM模型：使用RL方法对GPT-3进行微调，通过RLHF的标记数据和RM模型来使其遵循广泛的书面指令类别，从而实现对齐。
>> 提高了真实性且改善了有害性但依旧存在偏见性：InstructGPT模型在真实性方面优于GPT-3，也改善了有害性但不包括偏见性。

>> InstructGPT模型能够泛化至训练数据外的新指令，但仍存在简单错误。

>> 提出相对于预训练，增加模型对齐的成本的性价比更高。

《Training language models to follow instructions with human feedback》翻译与解读

Abstract

1、Introduction引言

通过训练语言模型按照用户意图行动，推动语言模型与用户的对齐

SFT+PPO算法最大化RM模型：使用RL方法对GPT-3进行微调，通过RLHF的标记数据和RM模型来使其遵循广泛的书面指令类别，从而实现对齐

标记器明显更喜欢InstructGPT的输出，而不是GPT-3的输出Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.

InstructGPT模型在真实性方面优于GPT-3。InstructGPT models show improvements in truthfulness over GPT-3.

InstructGPT模型改善了有害性但不包括偏见性InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

可以通过修改RLHF微调过程来最小化在公共NLP数据集上的性能退化。We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

模型能泛化到训练数据外的标注者偏好——模型适用于没有产生任何训练数据的“留存”标注者的偏好。Our models generalize to the preferences of “held-out” labelers that did not produce any train- ing data.

公共NLP数据集不能反映我们语言模型的使用方式。Public NLP datasets are not reflective of how our language models are used.

InstructGPT模型在RLHF微调分布之外的指令上展现出了很好的泛化能力。InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution.

InstructGPT仍然会犯一些简单的错误。InstructGPT still makes simple mistakes.

Overall—使用人类偏好微调大型语言模型显著提高其任务表现

2、Related work相关工作

对齐和从人类反馈中学习的研究。Research on alignment and learning from human feedback.

训练语言模型遵循指示可以提高下游性能。 Training language models to follow instructions.

评估语言模型的危害。Evaluating the harms of language models.

修改语言模型行为以减轻危害—数据过滤/阻止某些词汇/词嵌入正则化。Modifying the behavior of language models to mitigate harms.

3、Methods and experimental details方法和实验细节

3.1、High-level methodology高级方法—收集示范数据(SFT)+收集比较数据(RM)+优化策略(PPO)

3.2、Dataset数据集：SFT数据集+RM数据集+PPO数据集

3.3、Tasks任务：训练任务来源于标注者写的提示和API上提交的提示，目标是培养模型按照用户指令回应真实且无害

3.4、Human data collection人类数据收集：通过严格的标注者筛选，培养了能够根据任务需求，辨别有害输出的标注者团队，来训练和评估语言模型。

3.5、Models模型

T1、监督微调Supervised fine-tuning (SFT).

T2、奖励建模Reward modeling (RM).

T3、强化学习Reinforcement learning (RL).

基线Baselines.

3.6、Evaluation评估

API分布的评估Evaluations on API distribution.

公开NLP数据集的评估Evaluations on public NLP datasets.

4、Results结果

4.1、API分布的结果Results on the API distribution

我们的模型推广到“保留”标注员的偏好上，这些标注员没有产生任何训练数据。Our models generalize to the preferences of "held-out" labelers that did not produce any train- ing data.

公开的NLP数据集不能反映我们的语言模型的使用情况。Public NLP datasets are not reflective of how our language models are used.

我们认为我们的InstructGPT模型在两个方面优于FLAN和T0模型。We believe our InstructGPT model outperforms FLAN and T0 for two reasons

4.2、Results on public NLP datasets在公开的NLP数据集上的结果

与GPT-3相比，InstructGPT在毒性方面略有改善，但没有偏倚。InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

我们可以通过修改RLHF微调过程来减少在公开的NLP数据集上的性能下降。We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

4.3、Qualitative results定性结果

InstructGPT模型展示了对RLHF微调分布之外指令的有希望的泛化能力。InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution.

InstructGPT仍然会犯简单错误InstructGPT still makes simple mistakes.

5、Discussion

5.1、Implications for alignment research对齐研究的意义

相对于预训练，增加模型对齐的成本性价比更高The cost of increasing model alignment is modest relative to pretraining

可知InstructGPT在我们不进行监督的设置中泛化了“按照指示”的能力We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in

降低对齐税—我们成功地减轻了我们的微调引入的大部分性能下降We were able to mitigate most of the performance degradations introduced by our fine-tuning

已在现实世界中验证了对齐技术的研究We’ve validated alignment techniques from research in the real world

5.2、Who are we aligning to?我们要对齐的对象是谁？

语言模型的最终行为受多个因素影响，包括模型本身、训练数据、调整方法，我们主要调整到标注者提供的偏好，但标注者代表性有限

5.3、Limitations局限性：基于有限代表性的标注者偏好进行训练和调整，模型存在不完全对齐和不完全安全的问题

5.4、Open questions开放问题：用对齐技术微调语言模型有利，但仍存在诸多开放问题和挑战需要解决

5.5、Broader impacts更广泛的影响：对齐技术不能解决安全问题，需要更完善的监管和安全机制

《Training language models to follow instructions with human feedback》翻译与解读

时间	2022年3月4日
地址	https://arxiv.org/abs/2203.02155
作者	OpenAI

Abstract

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

使语言模型变得更大并不会本质上使它们更擅长理解用户的意图。例如，大型语言模型可能生成不真实、有毒或对用户无帮助的输出。换句话说，这些模型与用户的意图不一致。在本文中，我们展示了一种通过使用人类反馈进行微调的方法，以在各种任务上将语言模型与用户意图对齐。我们首先从标签者编写的提示集和通过OpenAI API提交的提示开始，收集了一个标签者展示期望模型行为的数据集，然后我们使用这个数据集使用监督学习对GPT-3进行微调。然后，我们收集了模型输出的排名数据集，并使用人类反馈的强化学习进一步微调这个监督模型。我们将结果称为InstructGPT模型。在我们的提示分发的人工评估中，尽管参数仅为1.3B的InstructGPT模型的输出比175B的GPT-3更受欢迎，但InstructGPT模型在真实性方面有所改善，并减少了有害输出的生成，同时在公共NLP数据集上的性能退化最小。尽管InstructGPT仍然会犯一些简单的错误，但我们的结果表明，使用人类反馈进行微调是将语言模型与人类意图对齐的一个有前景的方向。

1、Introduction引言

通过训练语言模型按照用户意图行动，推动语言模型与用户的对齐

Large language models (LMs) can be “prompted” to perform a range of natural language process- ing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021; Gehman et al., 2020). This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.

We make progress on aligning language models by training them to act in accordance with the user’s intention (Leike et al., 2018). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of Askell et al. (2021), we want language models to be helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.6.

大型语言模型（LMs）可以在给定一些任务示例作为输入的情况下“提示”执行各种自然语言处理（NLP）任务。然而，这些模型经常表现出意外的行为，比如捏造事实、生成偏见或有害的文本，或者简单地不遵循用户的指令（Bender等，2021；Bommasani等，2021；Kenton等，2021；Weidinger等，2021；Tamkin等，2021；Gehman等，2020）。这是因为许多最近的大型语言模型使用的语言建模目标——从互联网预测网页上的下一个标记——与“帮助用户有益且安全地遵循用户的指示”这一目标不同（Radford等，2019；Brown等，2020；Fedus等，2021；Rae等，2021；Thoppilan等，2022）。因此，我们说语言建模目标不一致。避免这些意外行为对于在数百个应用中部署和使用的语言模型尤为重要。我们通过训练语言模型按照用户的意图行动来推动语言模型的对齐（Leike等，2018）。这包括明确的意图，如遵循指令，以及隐含的意图，如保持真实，不偏见、有害或以其他方式有害。使用Askell等人（2021）的术语，我们希望语言模型是有益的（它们应该帮助用户解决任务）、诚实的（它们不应该捏造信息或误导用户）和无害的（它们不应该对人或环境造成身体、心理或社会伤害）。我们在第3.6节详细阐述了这些标准的评估。

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals.

图1：我们的API提示分布上各个模型的人工评估结果，根据每个模型的输出在多大程度上被优先选择于175B SFT模型的输出进行评估。我们的InstructGPT模型（PPO-ptx）以及没有预训练混合训练的其变体（PPO）在很大程度上优于GPT-3的基线模型（GPT，GPT prompted）；我们的13B PPO-ptx模型的输出被优选于175B GPT-3的输出。本文中的误差线为95%置信区间。

SFT+PPO算法最大化RM模型：使用RL方法对GPT-3进行微调，通过RLHF的标记数据和RM模型来使其遵循广泛的书面指令类别，从而实现对齐

We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al., 2017). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models InstructGPT.

We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are as follows:

我们专注于通过微调方法对齐语言模型。具体来说，我们使用人类反馈的强化学习（RLHF；Christiano等，2017；Stiennon等，2020）来微调GPT-3，使其遵循广泛的书面指令类别（见图2）。这种技术将人类偏好作为奖励信号来微调我们的模型。我们首先聘请一个由40名承包商组成的团队，根据他们在筛选测试中的表现来标记我们的数据（有关详细信息，请参见第3.4节和附录B.1）。然后，我们收集一组人工编写的对所需输出行为的演示数据，这些数据是针对提交给OpenAI API3的（主要是英文的）提示和一些标记器编写的提示的，并使用此数据来训练我们的监督学习基线。接下来，我们收集了一组人工标记的比较数据，比较了我们模型在更大的API提示集上的输出。然后，我们在这个数据集上训练一个奖励模型（RM），用于预测标记器更喜欢哪个模型的输出。最后，我们将这个RM作为奖励函数，使用PPO算法（Schulman等，2017）来微调我们的监督学习基线，以最大化这个奖励。我们在图2中说明了这个过程。这个过程将GPT-3的行为与特定人群（主要是我们的标记器和研究人员）的声明偏好保持一致，而不是与“人类价值观”的更广泛概念保持一致；我们在第5.2节进一步讨论了这个问题。我们将得到的模型称为InstructGPT。

我们主要通过让标记器对我们的测试集上的模型输出进行质量评价来评估我们的模型，该测试集由未在训练数据中表示的客户的提示组成。我们还对一系列公共NLP数据集进行自动评估。我们训练了三种模型规模（1.3B、6B和175B参数），我们所有的模型都使用GPT-3架构。我们的主要发现如下：

Figure 2: A diagram illustrating the three steps of our method: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train one of our models. In Step 2, boxes A-D are samples from our models that get ranked by labelers. See Section 3 for more details on our method.

图2：说明我们方法的三个步骤的图表：（1）监督微调（SFT），（2）奖励模型（RM）训练，以及（3）根据这个奖励模型通过近端策略优化（PPO）进行强化学习。蓝色箭头表示这些数据被用于训练我们的一个模型。在步骤2中，A-D框是我们模型的样本，这些样本被标记器排序。更多关于我们方法的详细信息，请参见第3节。

标记器明显更喜欢InstructGPT的输出，而不是GPT-3的输出Labelers significantly prefer InstructGPT outputs over outputs from GPT-3.

On our test set, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. These models have the same architecture, and differ only by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot 175B GPT-3. InstructGPT models also generate more appropriate outputs according to our labelers, and more reliably follow explicit constraints in the instruction.

在我们的测试集上，尽管参数数量只有GPT-3的1/100，但1.3B参数的InstructGPT模型的输出被更多地优选于175B的GPT-3的输出。这些模型具有相同的架构，唯一的区别是InstructGPT在我们的人类数据上进行了微调。即使我们给GPT-3添加了一些few-shot提示，以使其更好地遵循指示，这个结果仍然成立。我们的175B InstructGPT的输出在175B GPT-3的输出中被优选的频率为85 ± 3%，在few-shot的175B GPT-3的输出中被优选的频率为71 ± 4%。根据我们的标记器，InstructGPT模型的输出更合适，并且更可靠地遵循指示中的明确约束。

InstructGPT模型在真实性方面优于GPT-3。InstructGPT models show improvements in truthfulness over GPT-3.

On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3. Our results are equally strong on the subset of questions that were not adversarially selected against GPT-3. On “closed-domain” tasks from our API prompt distribution, where the output should not contain information that is not present in the input (e.g. summarization and closed-domain QA), InstructGPT models make up information not present in the input about half as often as GPT-3 (a 21% vs. 41% hallucination rate, respectively).

在TruthfulQA基准测试中，InstructGPT生成真实和有信息量的答案的频率是GPT-3的两倍。在那些没有针对GPT-3进行敌对选择的问题子集上，我们的结果同样显著。在“封闭领域”的任务中，这些任务来自我们的API提示集，输出不应包含输入中没有的信息（例如，摘要和封闭领域的问答），InstructGPT模型生成的信息比GPT-3模型更少，不太可能编造出输入中没有的信息（分别为21%和41%的产生虚构信息率）。

InstructGPT模型改善了有害性但不包括偏见性InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

To measure toxicity, we use the RealToxicityPrompts dataset (Gehman et al., 2020) and conduct both automatic and human evaluations. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender (Rudinger et al., 2018) and CrowSPairs (Nangia et al., 2020) datasets.

为了衡量有害性，我们使用RealToxicityPrompts数据集（Gehman等，2020）进行了自动评估和人工评估。当要求尊重时，与GPT-3相比，InstructGPT模型生成的有害输出要少约25%。然而，在Winogender（Rudinger等，2018）和CrowSPairs（Nangia等，2020）数据集上，InstructGPT与GPT-3相比并没有显著改善偏见。

可以通过修改RLHF微调过程来最小化在公共NLP数据集上的性能退化。We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores.

在RLHF微调过程中，与GPT-3相比，我们观察到在某些公共NLP数据集上的性能退化，特别是SQuAD（Rajpurkar等，2018）、DROP（Dua等，2019）、HellaSwag（Zellers等，2019）和WMT 2015法英翻译（Bojar等，2015）。这是一个“对齐成本”的例子，因为我们的对齐过程以牺牲我们关心的某些任务的性能为代价。通过将PPO更新与增加预训练分布的对数似然的更新相结合（PPO-ptx），我们可以大大减少这些数据集上的性能退化，而不会损害标注者的偏好分数。

模型能泛化到训练数据外的标注者偏好——模型适用于没有产生任何训练数据的“留存”标注者的偏好。Our models generalize to the preferences of “held-out” labelers that did not produce any train- ing data.

To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer InstructGPT outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

为了测试我们模型的泛化能力，我们进行了一项初步实验，使用留存的标注者进行测试，并发现他们与我们的训练标注者一样更喜欢InstructGPT的输出，而不是来自GPT-3的输出。然而，还需要进一步研究这些模型在更广泛的用户群体中的表现，以及它们在人们对所需行为存在分歧的输入上的表现。

公共NLP数据集不能反映我们语言模型的使用方式。Public NLP datasets are not reflective of how our language models are used.

We compare GPT-3 fine-tuned on our human preference data (i.e. InstructGPT) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) (in particular, the T0++ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer InstructGPT to these models (InstructGPT has a 73.4 ±2% winrate vs. our baseline, compared to 26.8 ±2% and 29.8 ±2% for our version of T0 and FLAN, respectively).

我们将在我们的人类偏好数据上对GPT-3进行微调（即InstructGPT），与在两个不同的公共NLP任务编译上进行微调的GPT-3进行比较：FLAN（Wei等，2021）和T0（Sanh等，2021）（特别是T0++变体）。这些数据集包含各种NLP任务，并结合每个任务的自然语言指令。在我们的API提示分布上，我们的FLAN和T0模型的性能略低于我们的SFT基准，而标注者明显更喜欢InstructGPT而不是这些模型（在InstructGPT与基准模型的对比中，InstructGPT的胜率为73.4 ±2%，而我们版本的T0和FLAN分别为26.8 ±2%和29.8 ±2%）。

InstructGPT模型在RLHF微调分布之外的指令上展现出了很好的泛化能力。InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution.

We qualitatively probe InstructGPT’s capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of “following instructions.” They retain some alignment even on tasks for which they get very little direct supervision signal.

我们对InstructGPT的能力进行了定性探究，发现它能够按照代码的摘要指令进行摘要，回答关于代码的问题，并且有时能够按照不同语言的指令进行操作，尽管这些指令在微调分布中非常罕见。相比之下，GPT-3也能完成这些任务，但需要更谨慎的提示，并且通常不能按照这些领域的指令进行操作。这个结果令人振奋，因为它表明我们的模型能够泛化“按照指令”的概念，即使在它们几乎没有直接监督信号的任务上，它们仍然保持了一定的对齐性。

InstructGPT仍然会犯一些简单的错误。InstructGPT still makes simple mistakes.

For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.

例如，InstructGPT仍然可能无法遵循指令，虚构事实，对简单问题给出冗长的回避回答，或无法识别具有错误前提的指令。

Overall—使用人类偏好微调大型语言模型显著提高其任务表现

Overall, our results indicate that fine-tuning large language models using human preferences signifi- cantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability.

The rest of this paper is structured as follows: We first detail related work in Section 2, before diving into our method and experiment details in Section 3, including our high-level methodology (3.1), task and dataset details (3.3 and 3.2), human data collection (3.4), how we trained our models (3.5), and our evaluation procedure (3.6). We then present our results in Section 4, divided into three parts: results on the API prompt distribution (4.1), results on public NLP datasets (4.2), and qualitative results (4.3). Finally we give an extended discussion of our work in Section 5, including implications for alignment research (5.1), what we are aligning to (5.2), limitations (5.3), open questions (5.4), and broader impacts of this work (5.5).

总的来说，我们的结果表明，使用人类偏好进行大规模语言模型的微调显著改善了它们在各种任务上的表现，但还有很多工作需要做，以提高它们的安全性和可靠性。

本文的剩余部分结构如下：我们首先在第2节详细介绍相关工作，然后在第3节中深入介绍我们的方法和实验细节，包括我们的高级方法（3.1）、任务和数据集细节（3.3和3.2）、人类数据收集（3.4）、模型训练（3.5）和评估过程（3.6）。然后，在第4节中展示我们的结果，分为三个部分：在API提示分布上的结果（4.1）、在公共NLP数据集上的结果（4.2）和定性结果（4.3）。最后，在第5节中对我们的工作进行了广泛的讨论，包括对齐研究的影响（5.1）、我们对齐的内容（5.2）、限制（5.3）、开放问题（5.4）以及这项工作的广泛影响（5.5）。

2、Related work相关工作

对齐和从人类反馈中学习的研究。Research on alignment and learning from human feedback.

We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feed- back (RLHF). Originally developed for training simple robots in simulated environments and Atari games (Christiano et al., 2017; Ibarz et al., 2018), it has recently been applied to fine-tuning language models to summarize text (Ziegler et al., 2019; Stiennon et al., 2020; B鰄m et al., 2019; Wu et al., 2021). This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue (Jaques et al., 2019; Yi et al., 2019; Hancock et al., 2019), translation (Kreutzer et al., 2018; Bahdanau et al., 2016), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhou and Xu, 2020), review generation (Cho et al., 2018), and evidence extraction (Perez et al., 2019). Madaan et al. (2022) use written human feedback to augment prompts and improve the performance of GPT-3. There has also been work on aligning agents in text-based environments using RL with a normative prior (Nahian et al., 2021). Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks.

我们构建在先前的技术基础上，将模型与人类意图对齐，特别是从人类反馈中进行强化学习（RLHF）。最初是为了在模拟环境和Atari游戏中训练简单机器人而开发的（Christiano等，2017；Ibarz等，2018），最近已被应用于对语言模型进行微调以进行文本摘要（Ziegler等，2019；Stiennon等，2020；B鰄m等，2019；Wu等，2021）。这项工作受到了类似工作的影响，这些工作在对话（Jaques等，2019；Yi等，2019；Hancock等，2019）、翻译（Kreutzer等，2018；Bahdanau等，2016）、语义解析（Lawrence和Riezler，2018）、故事生成（Zhou和Xu，2020）、评论生成（Cho等，2018）和证据提取（Perez等，2019）等领域使用人类反馈作为奖励。Madaan等（2022）使用书面人类反馈来增强提示并提高GPT-3的性能。还有关于使用规范先验的RL在基于文本的环境中对齐代理的工作（Nahian等，2021）。我们的工作可以看作是RLHF在广泛的语言任务分布上对齐语言模型的直接应用。

The question of what it means for language models to be aligned has also received attention re- cently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming misspecified objectives. In concur- rent work, Askell et al. (2021) propose language assistants as a testbed for alignment research, study some simple baselines, and their scaling properties.

关于语言模型对齐的含义的问题也受到了最近的关注（Gabriel，2020）。Kenton等（2021）对由于失配而导致的LM行为问题进行了分类，包括产生有害内容和博弈错位目标。在同期的工作中，Askell等（2021）提出语言助手作为对齐研究的测试平台，研究了一些简单的基准以及它们的扩展性质。

训练语言模型遵循指示可以提高下游性能。 Training language models to follow instructions.

Our work is also related to research on cross- task generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.

我们的工作还与关于语言模型的跨任务泛化的研究相关，其中LMs在广泛的公共NLP数据集上进行微调（通常以适当的指示为前缀），并在不同的NLP任务集上进行评估。在这个领域有很多工作（Yi等，2019；Mishra等，2021；Wei等，2021；Khashabi等，2020；Sanh等，2021；Aribandi等，2021），它们在训练和评估数据、指示的格式、预训练模型的规模和其他实验细节上有所不同。研究的一个一致发现是，将LMs在一系列NLP任务上进行微调，并附上指示，可以提高它们在保留任务上的下游性能，无论是零-shot还是少-shot设置。

There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018; Abramson et al., 2020; Zhao et al., 2021).

还有一系列关于导航指令遵循的研究，其中模型被训练以遵循自然语言指令在模拟环境中导航（Bahdanau等，2018；Abramson等，2020；Zhao等，2021）。

评估语言模型的危害。Evaluating the harms of language models.

A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021).

改变语言模型行为的目标是在其在真实世界中部署时减轻这些模型的危害。这些风险已经有广泛的文献记录（Bender等，2021；Bommasani等，2021；Kenton等，2021；Weidinger等，2021；Tamkin等，2021）。语言模型可能会产生有偏见的输出（Dhamala等，2021；Liang等，2021；Manela等，2021；Caliskan等，2017；Kirk等，2021），泄露私人数据（Carlini等，2021），生成错误信息（Solaiman等，2019；Buchanan等，2021），并被恶意使用；关于这些问题的详细审查，我们建议读者参阅Weidinger等（2021）。在特定领域部署语言模型会引发新的风险和挑战，例如在对话系统中（Henderson等，2018；Xu等，2020；Dinan等，2019b）。目前正在形成但正在增长的领域旨在建立具体评估这些危害的基准，特别是关于有毒性（Gehman等，2020）、刻板印象（Nadeem等，2020）和社会偏见（Dhamala等，2021；Nangia等，2020；Rudinger等，2018）。在这些问题上取得重大进展很困难，因为对LM行为的善意干预可能会产生副作用（Welbl等，2021；Blodgett等，2020）；例如，试图减少LM的有毒性可能会降低它们对来自少数群体的文本建模能力，因为训练数据中存在偏见相关性（Xu等，2021）。

修改语言模型行为以减轻危害—数据过滤/阻止某些词汇/词嵌入正则化。Modifying the behavior of language models to mitigate harms.

There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-the- loop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021).

有许多方法可以改变语言模型的生成行为。Solaiman和Dennison（2021）在小型、以价值为目标的数据集上对LM进行微调，从而提高了模型在问答任务上遵守这些价值的能力。Ngo等（2021）通过删除训练数据中的文档，这些文档对于生成一组由研究人员撰写的触发短语具有高条件生成概率，来过滤预训练数据集。当在这个经过过滤的数据集上训练时，他们的LM生成的文本更少有害，但语言建模性能略有降低。Xu等（2020）采用多种方法改善聊天机器人的安全性，包括数据过滤、在生成过程中阻止某些词汇或n-gram、安全性专用的控制标记（Keskar等，2019；Dinan等，2019a）以及人类参与的数据收集（Dinan等，2019b）。减轻LM生成偏见的其他方法使用词嵌入正则化（Liu等，2019；Huang等，2019）、数据增强（Liu等，2019；Dinan等，2019a；Sheng等，2019）、空间投影以使敏感标记的分布更均匀（Liang等，2021）、不同的目标函数（Qian等，2019）或因果中介分析（Vig等，2020）。还有关于使用第二个（通常较小）语言模型来引导语言模型生成的工作（Dathathri等，2019；Krause等，2020），这个想法的变体已被应用于减少语言模型的有毒性（Schick等，2021）。

3、Methods and experimental details方法和实验细节

3.1、High-level methodology高级方法—收集示范数据(SFT)+收集比较数据(RM)+优化策略(PPO)

Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Sections 3.4 for details). We then apply the following three steps (Figure 2).

我们的方法基于Ziegler等人（2019）和Stiennon等人（2020）的方法，他们在风格延续和摘要领域应用了这种方法。我们从一个预训练的语言模型（Radford等人，2019；Brown等人，2020；Fedus等人，2021；Rae等人，2021；Thoppilan等人，2022）开始，这个模型是在我们希望模型产生对齐输出的提示分布上训练的，并且有一个经过训练的人工标注团队（详见第3.4节）。然后我们按照以下三个步骤进行操作（图2）。

Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demon- strations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

步骤1：收集示范数据，并训练一个监督策略。我们的标注员在输入提示分布上提供所需行为的示范（有关此分布的详细信息请参见第3.2节）。然后我们使用监督学习在这些数据上对预训练的GPT-3模型进行微调。

步骤2：收集比较数据，并训练奖励模型。我们收集一组模型输出之间的比较数据，标注员指示他们对给定输入更喜欢哪个输出。然后我们训练一个奖励模型来预测人类首选的输出。

步骤3：使用PPO算法根据奖励模型优化策略。我们使用奖励模型的输出作为标量奖励。我们使用PPO算法对监督策略进行微调，以优化这个奖励（Schulman等人，2017）。

步骤2和3可以持续迭代；在当前最佳策略上收集更多比较数据，用于训练新的奖励模型和策略。在实践中，我们的大部分比较数据来自于我们的监督策略，一部分来自于我们的PPO策略。

3.2、Dataset数据集：SFT数据集+RM数据集+PPO数据集

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.4 Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time InstructGPT models were used. In this paper we do not use data from customers using the API in production. We heuristically deduplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we filter all prompts in the training split for personally identifiable information (PII).

我们的提示数据集主要包括提交给OpenAI API的文本提示，具体是使用早期版本的InstructGPT模型（通过对我们的示范数据子集进行监督学习训练）在Playground界面上使用的提示。我们通知在Playground使用API的客户，他们的数据可能会被用来通过周期性通知训练进一步的模型，每当使用InstructGPT模型时都会发送通知。在本文中，我们没有使用从API中使用生产数据的客户的数据。我们通过检查共享相同长公共前缀的提示来启发性地去重，我们限制每个用户ID的提示数量为200个。我们还根据用户ID创建了训练集、验证集和测试集，以便验证集和测试集不包含训练集中的用户数据。为了避免模型学习到潜在的敏感客户详细信息，我们对训练集中的所有提示进行了个人身份信息（PII）的过滤。

To train the very first InstructGPT models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API. We asked labelers to write three kinds of prompts:

•Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity.

•Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction.

•User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

为了训练第一个InstructGPT模型，我们要求标注员自己编写提示。这是因为我们需要一个初始的指令式提示来源来启动该过程，而这类提示通常不会被提交到API上的常规GPT-3模型。我们要求标注员编写三种类型的提示：

• Plain：我们只是要求标注员提出一个任意的任务，同时确保任务具有足够的多样性。

• Few-shot：我们要求标注员提供一条指令，以及与该指令相关的多个查询/响应对。

• User-based：我们在OpenAI API的等待列表申请中列出了一些使用案例。我们要求标注员提供与这些使用案例相对应的提示。

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API). More details on dataset sizes are provided in Table 6.

To give a sense of the composition of our dataset, in Table 1 we show the distribution of use-case categories for our API prompts (specifically the RM dataset) as labeled by our contractors. Most of the use-cases have are generative, rather than classification or QA. We also show some illustrative prompts (written by researchers to mimic the kinds of prompts submitted to InstructGPT models) in Table 2; more prompts submitted to InstructGPT models are shown in Appendix A.2.1, and prompts submitted to GPT-3 models are shown in Appendix A.2.2. We provide more details about our dataset in Appendix A.

从这些提示中，我们产生了在我们的微调过程中使用的三个不同数据集：

（1）我们的SFT数据集，包含用于训练SFT模型的标注员示范，

（2）我们的RM数据集，包含用于训练RM的标注员对模型输出的排名，以及

（3）我们的PPO数据集，不包含任何人工标注，用作RLHF微调的输入。

SFT数据集包含约13,000个训练提示（来自API和标注员编写），RM数据集有33,000个训练提示（来自API和标注员编写），而PPO数据集有31,000个训练提示（仅来自API）。表6提供了更多关于数据集大小的详细信息。

为了了解我们数据集的组成情况，表1显示了由我们的承包商标注的API提示的使用案例类别分布（特别是RM数据集）。

大多数使用案例是生成性的，而不是分类或问答。表2显示了一些说明性的提示（由研究人员编写，以模拟提交给InstructGPT模型的提示），附录A.2.1中显示了提交给InstructGPT模型的更多提示，附录A.2.2中显示了提交给GPT-3模型的提示。我们在附录A中提供了有关数据集的更多细节。

3.3、Tasks任务：训练任务来源于标注者写的提示和API上提交的提示，目标是培养模型按照用户指令回应真实且无害

Our training tasks are from two sources: (1) a dataset of prompts written by our labelers and (2) a dataset of prompts submitted to early InstructGPT models on our API (see Table 6). These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1). Our dataset is over 96% English, however in Section 4.3 we also probe our model’s ability to respond to instructions in other languages and complete coding tasks.

我们的训练任务来自两个来源：

（1）由我们的标注员编写的提示数据集和

（2）提交给我们API上早期InstructGPT模型的提示数据集（见表6）。

这些提示非常多样化，包括生成、问答、对话、摘要、提取和其他自然语言任务（见表1）。我们的数据集中超过96%的内容为英语，然而在第4.3节中，我们还测试了模型对其他语言指令和完成编码任务的能力。

For each natural language prompt, the task is most often specified directly through a natural language instruction (e.g. “Write a story about a wise frog”), but could also be indirectly through either few-shot examples (e.g. giving two examples of frog stories, and prompting the model to generate a new one) or implicit continuation (e.g. providing the start of a story about a frog). In each case, we ask our labelers to do their best to infer the intent of the user who wrote the prompt, and ask them to skip inputs where the task is very unclear. Moreover, our labelers also take into account the implicit intentions such as truthfulness of the response, and potentially harmful outputs such as biased or toxic language, guided by the instructions we provide them (see Appendix B) and their best judgment.

对于每个自然语言提示，任务通常通过自然语言指令直接进行说明（例如：“写一个关于聪明青蛙的故事”），但也可以通过少量示例间接进行说明（例如，给出两个青蛙故事的例子，提示模型生成一个新的故事），或者通过隐含的延续进行说明（例如，提供一个关于青蛙的故事的开头）。在每种情况下，我们要求标注员尽力推断出编写提示的用户的意图，并要求他们跳过任务非常不清楚的输入。此外，我们的标注员还考虑隐含的意图，如回答的真实性和潜在的有害输出，例如偏见或有毒语言，这是根据我们为他们提供的指导和他们的最佳判断而进行的（见附录B）。

3.4、Human data collection人类数据收集：通过严格的标注者筛选，培养了能够根据任务需求，辨别有害输出的标注者团队，来训练和评估语言模型。

To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of summarization (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics, see Appendix B.1.

为了产生我们的示范和比较数据，并进行主要评估，我们在Upwork和ScaleAI上聘请了约40名承包商组成的团队。与先前在摘要任务上收集人类偏好数据的工作（Ziegler等，2019；Stiennon等，2020；Wu等，2021）相比，我们的输入范围涵盖了更广泛的任务，并且有时可能涉及有争议和敏感的主题。我们的目标是选择一组对不同人群偏好敏感并且擅长识别潜在有害输出的标注员。因此，我们进行了一项筛选测试，以衡量标注员在这些方面的表现。我们选择了在这项测试中表现良好的标注员；有关我们的选择过程和标注员的人口统计信息的更多信息，请参见附录B.1。

During training and evaluation, our alignment criteria may come into conflict: for example, when a user requests a potentially harmful response. During training we prioritize helpfulness to the user (not doing so requires making some difficult design decisions that we leave to future work; see Section 5.4 for more discussion). However, in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about).

As in Stiennon et al. (2020), we collaborate closely with labelers over the course of the project. We have an onboarding process to train labelers on the project, write detailed instructions for each task (see Appendix B.2), and answer labeler questions in a shared chat room.

在训练和评估过程中，我们的对齐准则可能发生冲突：例如，当用户请求可能有害的回复时。在训练过程中，我们优先考虑对用户的有益性（不这样做需要做出一些困难的设计决策，我们将这些决策留给未来的工作；有关更多讨论，请参见第5.4节）。然而，在我们最终的评估中，我们要求标注员优先考虑真实性和无害性（因为这是我们真正关心的）。

与Stiennon等人（2020）的研究相似，我们与标注员密切合作。我们有一个入职流程，培训标注员参与项目，为每个任务编写详细说明（请参见附录B.2），并在共享聊天室中回答标注员的问题。

As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other 72.6 ± 1.5% of the time, while for held-out labelers this number is 77.3 ± 1.3%. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was 73 ± 4%.

作为初步研究，以了解我们的模型对其他标注员偏好的泛化能力，我们雇佣了一组不产生任何训练数据的独立标注员。这些标注员来自同样的供应商，但没有经过筛选测试。

尽管任务的复杂性，我们发现标注员之间的一致性相当高：训练标注员之间的一致性为72.6 ± 1.5%，而保留标注员之间的一致性为77.3 ± 1.3%。相比之下，在Stiennon等人（2020）的摘要工作中，研究人员之间的一致性为73 ± 4%。

3.5、Models模型

We start with the GPT-3 pretrained language models from Brown et al. (2020). These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. Starting from these models, we then train models with three different techniques:

我们从Brown等人（2020）的GPT-3预训练语言模型开始。这些模型是在广泛的互联网数据上训练的，适用于各种下游任务，但行为特征尚不清楚。在这些模型的基础上，我们使用三种不同的技术训练模型：

T1、监督微调Supervised fine-tuning (SFT).

We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. Similarly to Wu et al. (2021), we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting.

我们使用监督学习将GPT-3在我们的标注员示范数据上进行微调。我们进行了16个epoch的训练，使用余弦学习率衰减和0.2的残差丢弃率。我们根据验证集上的RM得分进行最终的SFT模型选择。与Wu等人（2021）类似，我们发现我们的SFT模型在1个epoch后对验证损失过拟合；然而，我们发现进行更多的训练epoch有助于提高RM得分和人类偏好评级，尽管存在过拟合问题。

T2、奖励建模Reward modeling (RM).

Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details).

从去除最后一层非嵌入层的SFT模型开始，我们训练一个模型，输入是提示和回复，输出是一个标量奖励。在本文中，我们只使用6B的RM，因为这节省了大量的计算资源，并且我们发现175B的RM训练可能不稳定，因此不太适合用作强化学习中的值函数（有关更多细节，请参见附录C）。

In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.

In order to speed up comparison collection, we present labelers with anywhere between K = 4 and

K = 9 responses to rank. This produces rK) comparisons for each prompt shown to a labeler. Since

comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5

Instead, we train on all rK) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than rK) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss.

在Stiennon等人（2020）的研究中，RM是在一个数据集上进行训练的，该数据集包含了对相同输入的两个模型输出的比较。他们使用交叉熵损失，其中比较作为标签，奖励差异表示人类标注员更喜欢其中一个回复的对数几率。

为了加快比较收集的速度，我们向标注员展示了K = 4至K = 9个回复进行排序。这为每个展示给标注员的提示产生了rK)个比较。由于每个标注任务内部的比较非常相关，我们发现如果我们简单地将比较随机混洗到一个数据集中，对数据集进行一次遍历就会导致奖励模型过拟合。相反，我们将每个提示的所有rK)个比较作为单个批次元素进行训练。这样做在计算上更高效，因为每个完成只需要一次RM的前向传播（而不是K个完成需要rK)次前向传播），并且由于不再过拟合，它的验证准确性和对数损失得到了显著提高。

Specifically, the loss function for the reward model is:

where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons.

Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.

具体来说，奖励模型的损失函数如下：

其中rθ(x, y)是奖励模型对提示x和完成y的标量输出，θ是参数，yw是一对yw和yl中更受欢迎的完成，D是人类比较的数据集。

最后，由于RM损失对奖励的平移不变，我们使用偏差对奖励模型进行归一化，以使标注员的示范获得平均得分为0，然后进行强化学习。

T3、强化学习Reinforcement learning (RL).

Once again following Stiennon et al. (2020), we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over- optimization of the reward model. The value function is initialized from the RM. We call these models “PPO.”

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the following combined objective function in RL training:

where πRL is the learned RL policy, πSFT is the supervised trained model, and Dpretrain is the pretraining distribution. The KL reward coefficient, β, and the pretraining loss coefficient, γ, control the strength of the KL penalty and pretraining gradients respectively. For "PPO" models, γ is set to 0. Unless otherwise specified, in this paper InstructGPT refers to the PPO-ptx models.

再次遵循Stiennon等人（2020）的方法，我们使用PPO（Schulman等人，2017）对SFT模型在我们的环境上进行微调。该环境是一个赌博机环境，它呈现一个随机的用户提示并期望得到一个回复。给定提示和回复，它根据奖励模型产生一个奖励，并结束该回合。此外，我们在每个标记中添加了来自SFT模型的每个令牌的KL惩罚，以减轻奖励模型的过度优化。值函数是从RM初始化的。我们将这些模型称为“PPO”。

我们还尝试将预训练梯度与PPO梯度混合在一起，以解决在公开NLP数据集上的性能回退问题。我们将这些模型称为“PPO-ptx”。在强化学习训练中，我们最大化以下组合目标函数：

其中πRL是学习得到的强化学习策略，πSFT是经过监督训练的模型，Dpretrain是预训练分布。KL奖励系数β和预训练损失系数γ分别控制KL惩罚和预训练梯度的强度。对于“PPO”模型，γ被设为0。除非另有说明，在本文中，InstructGPT指的是PPO-ptx模型。

基线Baselines.

We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction.

我们将我们的PPO模型的性能与我们的SFT模型和GPT-3进行了比较。我们还与给予GPT-3一些示例前缀以引导其进入指令跟随模式时的GPT-3进行了比较（GPT-3-prompted）。这个前缀被添加到用户指定的指令之前。

We additionally compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (the datasets differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples respectively and choose the checkpoint which obtains the highest reward model score on the validation set. See Appendix C for more training details.

此外，我们还将InstructGPT与在FLAN（Wei等人，2021）和T0（Sanh等人，2021）数据集上对175B GPT-3进行微调进行了比较，这两个数据集都包含了各种NLP任务，并结合了每个任务的自然语言指令（这些数据集在包含的NLP数据集和使用的指令风格上有所不同）。我们分别在约100万个示例上对它们进行微调，并选择在验证集上获得最高奖励模型得分的检查点。有关更多训练细节，请参见附录C。

3.6、Evaluation评估

To evaluate how “aligned” our models are, we first need to clarify what alignment means in this context. The definition of alignment has historically been a vague and confusing topic, with various competing proposals (Chen et al., 2021; Leike et al., 2018; Gabriel, 2020). Following Leike et al. (2018), our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless.

To be helpful, the model should follow instructions, but also infer intention from a few-shot prompt or another interpretable pattern such as “Q: {question}\nA:”. Since a given prompt’s intention can be unclear or ambiguous, we rely on judgment from our labelers, and our main metric is labeler preference ratings. However, since our labelers are not the users who generated the prompts, there could be a divergence between what a user actually intended and what the labeler thought was intended from only reading the prompt.

为了评估我们的模型是否“一致”，我们首先需要明确在这个背景下“一致”是什么意思。在历史上，对一致性的定义一直是一个模糊且令人困惑的话题，存在各种不同的提议（Chen等，2021；Leike等，2018；Gabriel，2020）。根据Leike等人（2018）的方法，我们的目标是训练与用户意图相一致的模型。从实际角度来看，对于我们的语言任务，我们使用了类似于Askell等人（2021）的框架，他们定义了如果模型具有帮助性、诚实性和无害性，则认为模型是一致的。

为了具有帮助性，模型应该遵循指令，但也应该从几个示例提示或其他可解释的模式（例如“Q：{问题} \nA：”）中推断出意图。由于给定的提示的意图可能不明确或模棱两可，我们依赖于我们标注员的判断，而我们的主要指标是标注员的偏好评分。然而，由于我们的标注员不是生成提示的用户，用户实际意图与标注员仅通过阅读提示而产生的判断之间可能存在差异。

It is unclear how to measure honesty in purely generative models; this requires comparing the model’s actual output to its “belief” about the correct output, and since the model is a big black box, we can’t infer its beliefs. Instead, we measure truthfulness—whether the model’s statements about the world are true—using two metrics: (1) evaluating our model’s tendency to make up information on closed domain tasks (“hallucinations”), and (2) using the TruthfulQA dataset (Lin et al., 2021). Needless to say, this only captures a small part of what is actually meant by truthfulness.

Similarly to honesty, measuring the harms of language models also poses many challenges. In most cases, the harms from language models depend on how their outputs are used in the real world. For instance, a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might even be helpful if used for data augmentation to train a more accurate toxicity detection model. Earlier in the project, we had labelers evaluate whether an output was ‘potentially harmful’. However, we discontinued this as it required too much speculation about how the outputs would ultimately be used; especially since our data also comes from customers who interact with the Playground API interface (rather than from production use cases).

在纯生成模型中如何衡量诚实性尚不清楚；这需要将模型的实际输出与其对正确输出的“信念”进行比较，但由于模型是一个黑盒子，我们无法推断其信念。因此，我们使用两个度量指标来衡量真实性，即模型在封闭域任务上编造信息的倾向（“幻觉”），以及使用TruthfulQA数据集（Lin等，2021）。不用说，这只捕捉到真实性的一小部分。

与诚实性类似，衡量语言模型的危害也面临许多挑战。在大多数情况下，语言模型的危害取决于其输出在现实世界中的使用方式。例如，生成有害输出的模型在部署的聊天机器人环境中可能是有害的，但如果用于数据增强以训练更准确的有害性检测模型，则可能是有益的。在项目早期，我们让标注员评估输出是否“可能有害”。然而，由于这需要对输出的最终使用进行过多的猜测，而且我们的数据还来自与Playground API界面交互的客户（而不是来自实际使用案例），因此我们中止了这个评估方法。

Therefore we use a suite of more specific proxy criteria that aim to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content. We also benchmark our model on datasets intended to measure bias and toxicity, such as RealToxicityPrompts (Gehman et al., 2020) and CrowS-Pairs (Nangia et al., 2020).

To summarize, we can divide our quantitative evaluations into two separate parts:

因此，我们使用一套更具体的代理标准来评估部署模型中可能产生危害的不同行为方面：我们让标注员评估输出是否在客户助手的背景下不合适、是否贬低受保护群体，或者是否包含性或暴力内容。我们还对旨在衡量偏见和有害性的数据集进行基准测试，例如RealToxicityPrompts（Gehman等，2020）和CrowS-Pairs（Nangia等，2020）。

总之，我们可以将定量评估分为两个独立部分：

API分布的评估Evaluations on API distribution.

Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by customers we haven’t included in training. However, given that our training prompts are designed to be used with InstructGPT models, it’s likely that they disadvantage the GPT-3 baselines. Thus, we also evaluate on prompts submitted to GPT-3 models on the API; these prompts are generally not in an ‘instruction following’ style, but are designed specifically for GPT-3. In both cases, for each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 3).

我们的主要指标是来自与我们训练数据源相同的一组保留提示的标注员偏好评分。在使用API提示进行评估时，我们只选择在训练中未包含的客户的提示。然而，鉴于我们的训练提示被设计用于InstructGPT模型，它们很可能对GPT-3基线造成了不利影响。因此，我们还评估了提交给API上的GPT-3模型的提示；这些提示通常不是“按照指示进行”风格的，而是专门为GPT-3设计的。对于每个模型，我们计算其输出相对于基线策略的优势频率；我们选择我们的175B SFT模型作为基线，因为其性能接近中等水平。此外，我们要求标注员根据1-7的Likert评分判断每个回答的整体质量，并收集每个模型输出的一系列元数据（参见表3）。

公开NLP数据集的评估Evaluations on public NLP datasets.

We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehen- sion, and summarization. We also conduct human evaluations of toxicity on the RealToxicityPrompts dataset (Gehman et al., 2020). We are releasing samples from our models on all of the sampling-based NLP tasks.7

我们在两类公开数据集上进行评估：捕捉语言模型安全性的数据集，特别是真实性、有害性和偏见，以及捕捉传统NLP任务（如问答、阅读理解和摘要）的零样本性能的数据集。我们还在RealToxicityPrompts数据集（Gehman等，2020）上进行有害性的人工评估。我们在所有基于采样的NLP任务上发布了我们模型的样本。

4、Results结果

In this section, we provide experimental evidence for our claims in Section 1, sorted into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.

在本节中，我们提供了我们在第1节中的论述的实验证据，分为三个部分：API提示分布的结果，公开NLP数据集的结果以及定性结果。

Figure 3: Preference results of our models, measured by winrate against the 175B SFT model. Left: results on prompts submitted to GPT models on the API; Right: results on prompts submitted to InstructGPT models on the API; Top: results from held-out labelers; Bottom: results from training labelers. We omit GPT (prompted) from the evals on prompts submitted to GPT-3 models (left) as these prompts are already designed to perform well for GPT-3, as opposed to prompts submitted to InstructGPT models (right).

图3：我们模型的偏好结果，通过与175B SFT模型的胜率进行测量。左侧：在API上提交给GPT模型的提示的结果；右侧：在API上提交给InstructGPT模型的提示的结果；顶部：来自保留标注员的结果；底部：来自训练标注员的结果。我们在GPT-3模型（提示）在GPT-3模型上提交的提示的评估中省略了GPT（提示），因为这些提示已经专门设计用于GPT-3的表现，而不是提交给InstructGPT模型的提示。

4.1、API分布的结果Results on the API distribution

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set of prompts, our labelers significantly prefer InstructGPT outputs across model sizes. These results are shown in Figure 1. We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pretraining mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot GPT-3.

We also found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3), though our PPO-ptx models perform slightly worse at larger model sizes.

标注员明显更喜欢InstructGPT的输出而不是GPT-3的输出。在我们的提示测试集上，我们的标注员在不同模型大小上明显更喜欢InstructGPT的输出。这些结果如图1所示。我们发现，GPT-3的输出表现最差，通过使用经过精心设计的少样本提示（GPT-3（提示）），然后通过使用监督学习的演示进行训练（SFT），最后通过使用PPO使用比较数据进行训练，可以获得显著的步长改进。在PPO期间对预训练混合进行更新并不会导致标注员偏好的大幅改变。为了说明我们的收益幅度：直接比较时，175B的InstructGPT输出在85±3%的情况下优于GPT-3输出，并在71±4%的情况下优于少样本的GPT-3。我们还发现，当在API上提交给GPT-3模型的提示上进行评估时，我们的结果没有发生显著变化（见图3），尽管我们的PPO-ptx模型在更大的模型大小上表现稍差。

Figure 4: Metadata results on the API distribution. Note that, due to dataset sizes, these results are collapsed across model sizes. See Appendix E.2 for analysis that includes model size. Compared to GPT-3, the PPO models are more appropriate in the context of a customer assistant, are better at following explicit constraints in the instruction and attempting the correct instruction, and less likely to ‘hallucinate’ (meaning, making up information on closed domain tasks like summarization).

Figure 5: Comparing our models with FLAN and T0 in terms of Likert scores on a 1-7 scale, on the InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably with a few-shot GPT-3 model placed into ‘instruction-following’ mode.

图4：API分布的元数据结果。请注意，由于数据集大小的原因，这些结果已经合并为不同模型大小。请参见附录E.2以包含模型大小的分析。与GPT-3相比，PPO模型在客户助手的背景下更合适，在指令中更擅长遵循明确的约束并尝试正确的指令，并且在封闭域任务（如摘要）中“幻觉”（即，编造信息）的可能性较小。

图5：在InstructGPT提示分布上，使用1-7分的Likert评分比较我们的模型与FLAN和T0的结果。FLAN和T0的表现优于默认的GPT-3，并与将少样本的GPT-3模型置于“按照指示进行”模式的表现相当。

In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes. Specifically, compared to GPT-3, InstructGPT outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. “Write your answer in 2 paragraphs or less.”), are less likely to fail to follow the correct instruction entirely, and make up facts (‘hallucinate’) less often in closed-domain tasks. These results suggest that InstructGPT models are more reliable and easier to control than GPT-3. We’ve found that our other metadata categories occur too infrequently in our API to obtain statistically significant differences between our models.

在图4中，我们展示了标注员还在几个具体的方面对InstructGPT的输出持积极评价。具体来说，与GPT-3相比，InstructGPT的输出在客户助手的背景下更合适，在指令中更常遵循明确的约束（例如“用2段或更少的段落编写您的答案”），更少出现完全不遵循正确指令和在封闭域任务中更少编造事实（“幻觉”）。这些结果表明，与GPT-3相比，InstructGPT模型更可靠且更容易控制。我们发现，在我们的API中，其他元数据类别发生的频率太低，无法得到我们模型之间的统计显著差异。

我们的模型推广到“保留”标注员的偏好上，这些标注员没有产生任何训练数据。Our models generalize to the preferences of "held-out" labelers that did not produce any train- ing data.

Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren’t simply overfitting to the preferences of our training labelers.

We see further evidence of this from the generalization capabilities of our reward models. We ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group). These RMs have an accuracy of 69.6 ± 0.9% on predicting the preferences of labelers in the held-out group, a small decrease from their 72.4 ± 0.4% accuracy on predicting the preferences of labelers in their training set.

与我们用于产生训练数据的工人相比，保留标注员具有类似的偏好排序（参见图3）。特别地，根据保留工人的说法，我们所有的InstructGPT模型在很大程度上都优于GPT-3基准模型。因此，我们的InstructGPT模型不仅仅是过度拟合我们训练标注员的偏好。

从我们的奖励模型的泛化能力来看，我们看到了这一点的进一步证据。我们进行了一个实验，将我们的标注员分为5组，并使用5折交叉验证训练了5个奖励模型（使用3个不同的种子），其中4组用于训练，1组用于在保留组上进行评估。这些奖励模型在预测保留组标注员偏好方面的准确率为69.6±0.9%，略低于它们在训练集标注员偏好方面的72.4±0.4%的准确率。

公开的NLP数据集不能反映我们的语言模型的使用情况。Public NLP datasets are not reflective of how our language models are used.

In Figure 5, we also compare InstructGPT to our 175B GPT-3 baselines fine-tuned on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets (see Appendix C for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt distribution. In a head to head comparison, our 175B InstructGPT model outputs were preferred over our FLAN model 78 ±4% of the time and over our T0 model 79 ± 4% of the time. Likert scores for these models are shown in Figure 5.

在图5中，我们还将InstructGPT与在FLAN（Wei等，2021）和T0（Sanh等，2021）数据集上微调的175B GPT-3基准模型进行了比较（详见附录C）。我们发现，这些模型在GPT-3上表现优于GPT-3，与经过精心选择的GPT-3少样本模型相当，但比我们的SFT基准模型差。这表明这些数据集不足以在我们的API提示分布上提高性能。在一对一的比较中，我们的175B InstructGPT模型输出在78±4%的情况下优于我们的FLAN模型，并在79±4%的情况下优于我们的T0模型。这些模型的Likert评分显示在图5中。

我们认为我们的InstructGPT模型在两个方面优于FLAN和T0模型。We believe our InstructGPT model outperforms FLAN and T0 for two reasons

We believe our InstructGPT model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent summarization and translation. However, classification and QA are only a small part (about 18%) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about 57% of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.

我们认为我们的InstructGPT模型在两个方面优于FLAN和T0模型。首先，公开的NLP数据集旨在捕捉容易使用自动评估指标进行评估的任务，例如分类、问答，以及在一定程度上的摘要和翻译。然而，分类和问答仅占API用户使用我们的语言模型的很小一部分（约18%），而开放式生成和头脑风暴占据了我们的提示数据集的约57%（参见表1）。其次，对于公开的NLP数据集来说，很难获得非常高的输入多样性（至少是在真实用户可能有兴趣使用的输入类型上）。当然，NLP数据集中的任务确实代表了我们希望语言模型能够解决的一种指令类型，因此最广泛类型的指令遵循模型将结合这两种类型的数据集。

4.2、Results on public NLP datasets在公开的NLP数据集上的结果

InstructGPT models show improvements in truthfulness over GPT-3. As measured by human evaluatoins on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. When evaluated only on prompts that were not adversarially selected against GPT-3, our PPO models are still significantly more truthful and informative than GPT-3 (although the absolute improvement decreases by a couple of percentage points.

InstructGPT模型在真实性方面优于GPT-3。通过对TruthfulQA数据集进行人工评估，我们的PPO模型在生成真实和信息丰富的输出方面相对于GPT-3表现出小但显著的改进（参见图6）。这种行为是默认的：我们的模型不需要特别指示以确保真实性的提升。有趣的是，我们的1.3B PPO-ptx模型是个例外，它的表现略逊于相同规模的GPT-3模型。当仅对没有针对GPT-3进行对抗选择的提示进行评估时，我们的PPO模型仍然比GPT-3更真实和信息丰富（尽管绝对改进程度降低了几个百分点）。

Figure 6: Results on the TruthfulQA dataset. Gray bars indicate ratings of truthfulness; colored bars indicate ratings of truthfulness and informativeness.

图6：在TruthfulQA数据集上的结果。灰色柱表示真实性评分；彩色柱表示真实性和信息丰富性评分。

Following Lin et al. (2021), we also give a helpful “Instruction+QA” prompt that instructs the model to respond with “I have no comment” when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren’t as good at this.

Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate (i.e. fabricate information) less often on closed-domain tasks from our API distribution, which we’ve shown in Figure 4.

根据Lin等人（2021）的方法，我们还提供了一个有益的“指令+问答”提示，指示模型在不确定正确答案时回答“我没有评论”。在这种情况下，我们的PPO模型倾向于说出真实但无信息的回答，而不是自信地说出虚假的内容；而基准的GPT-3模型在这方面表现不佳。

我们的真实性改进也得到了在API分布的封闭域任务上的幻觉（即编造信息）发生频率较低的证明，如图4所示。

与GPT-3相比，InstructGPT在毒性方面略有改善，但没有偏倚。InstructGPT shows small improvements in toxicity over GPT-3, but not bias.

InstructGPT shows small improvements in toxicity over GPT-3, but not bias. We first evaluate our models on the RealToxicityPrompts dataset (Gehman et al., 2020). We do this in two ways: we run model samples through the Perspective API8 to obtain automatic toxicity scores, which is the standard evaluation procedure for this dataset, and we also send these samples to labelers to obtain ratings on absolute toxicity, toxicity relative to the prompt, continuity, and overall output preference. We sample prompts from this dataset uniformly according to prompt toxicity to better assess how our models perform with high input toxicity (see Figure 39 in Appendix E); this differs from the standard prompt sampling for this dataset, and thus our absolute toxicity numbers are inflated.	InstructGPT在有毒性方面相对于GPT-3表现出一定改进，但并未改善偏见。首先，我们使用RealToxicityPrompts数据集（Gehman等，2020）对我们的模型进行评估。我们采用两种方法：我们使用Perspective API对模型生成的样本进行自动有毒性评分，这是该数据集的标准评估程序；同时，我们还将这些样本发送给标注员，以获得有关绝对有毒性、相对于提示的有毒性、连贯性和整体输出偏好的评分。我们按照提示的有毒性均匀采样这个数据集中的提示，以更好地评估我们的模型在高输入有毒性情况下的表现（详见附录E中的图39）；这与该数据集的标准提示采样不同，因此我们的绝对有毒性数值有所提高。
Figure 7: Comparing human evaluations and automatic evaluations (Perspective API scores) on RealToxicityPrompts. A total of 1,729 prompts were labeled for three different 175B models, both with and without "respectful" instructions. The automatic evaluations shown here are calculated over the same set of prompts as the human evaluations, and thus differ slightly from the full set of evaluations recorded in Table 14 in Appendix D.	图7：在RealToxicityPrompts数据集上的人工评估和自动评估（Perspective API得分）的对比。共对三个175B模型进行了1,729个提示的标注，包括有“尊重”的指令和没有“尊重”的指令。此处显示的自动评估是在与人工评估相同的提示集上计算的，因此与附录D中记录的全部评估结果略有不同。
Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output (“respectful prompt”), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed (“no prompt”). Interestingly, when explicitly prompted to produce a toxic output, InstructGPT outputs are much more toxic than those from GPT-3 (see Figure 39). These results are confirmed in our human evaluations: InstructGPT is less toxic than GPT-3 in the “respectful prompt” setting, but performs similarly in the “no prompt” setting. We provide extended results in Appendix E. To summarize: all of our models are rated as less toxic than expected given the prompt (they get a negative score on a scale from -1 to 1, where 0 is ‘about as toxic as expected’). Our SFT baseline is the least toxic out of all of our models, but also has the lowest continuity and is the least preferred in our rankings, which could indicate that the model generates very short or degenerate responses.	我们的结果如图7所示。当指示模型生成安全和尊重的输出（“尊重的提示”）时，InstructGPT模型根据Perspective API生成的有毒性输出比GPT-3模型更少。当移除“尊重的提示”时，这种优势消失。有趣的是，当明确要求生成有毒性输出时，InstructGPT的输出比GPT-3更有毒（参见图39）。这些结果在我们的人工评估中得到了证实：在“尊重的提示”设置下，InstructGPT的有毒性比GPT-3更低，但在“无提示”设置下表现类似。我们在附录E中提供了更详细的结果。总结起来：所有的模型都被评为比提示所预期的更不有毒（它们在从-1到1的尺度上获得负分，其中0表示“与预期的毒性差不多”）。我们的SFT基线是所有模型中最不有毒的，但连贯性最低，而且在我们的排名中最不受欢迎，这可能表明该模型生成的回答非常简短或退化。
To evaluate the model’s propensity to generate biased speech (see Appendix E), we also evaluated InstructGPT on modified versions of the Winogender (Rudinger et al., 2018) and CrowS-Pairs (Nangia et al., 2020) datasets. These datasets consists of pairs of sentences which can highlight potential bias. We calculate the relative probabilities of producing the sentences in each pair and the entropy (in bits) of the associated binary probability distributions. Perfectly unbiased models will have no preference between the sentences in each pair and will therefore have maximum entropy. By this metric, our models are not less biased than GPT-3. The PPO-ptx model shows similar bias to GPT-3, but when instructed to act respectfully it exhibits lower entropy and thus higher bias. The pattern of the bias is not clear; it appears that the instructed models are more certain of their outputs regardless of whether or not their outputs exhibit stereotypical behavior.	为了评估模型产生偏见性言论的倾向（请参见附录E），我们还对InstructGPT在修改版本的Winogender（Rudinger等，2018）和CrowS-Pairs（Nangia等，2020）数据集上进行了评估。这些数据集由一对句子组成，可以突出潜在的偏见。我们计算了生成每对句子的相对概率和相关二进制概率分布的熵（以比特为单位）。完全没有偏见的模型在每对句子之间没有偏好，因此熵最大。根据这个度量标准，我们的模型在偏见方面并不比GPT-3更少。PPO-ptx模型的偏见与GPT-3相似，但当指示尊重时，它的熵较低，因此偏见较高。偏见的模式并不明确；似乎指示的模型对其输出更加确定，无论其输出是否展现了陈规行为。

我们可以通过修改RLHF微调过程来减少在公开的NLP数据集上的性能下降。We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.

By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases. We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks.

默认情况下，当我们在API分布上训练PPO模型时，它会遭受“对齐税”的困扰，其在几个公开的NLP数据集上的性能会降低。我们希望有一种对齐过程可以避免对齐税，因为这会鼓励使用在这些任务上不对齐但能力更强的模型。

Figure 8: Examples of generalization in the 175B PPO-ptx model (InstructGPT 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) InstructGPT can follow instructions in other languages, though it sometimes generates outputs in English. GPT-3 requires more careful prompting, similarly to in English. (2) InstructGPT can summarize and answer questions about code more reliably than GPT-3 (though its answer here isn’t quite correct). For the code QA example, GPT-3 does answer the question about 50% of the time.

图8：175B PPO-ptx模型（InstructGPT 175B）与没有附加前缀的GPT-3 175B的泛化示例。提示被挑选出来以说明某些行为，但输出并未挑选。（1）InstructGPT可以遵循其他语言的指令，尽管有时会生成英文输出。GPT-3需要更仔细的提示，与在英文中类似。（2）InstructGPT在总结和回答关于代码的问题方面比GPT-3更可靠（尽管这里的回答并不完全正确）。对于代码问答示例，GPT-3大约有50%的时间会回答问题。

In Figure 29 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions.

Mixing in pretraining updates performs better than the simpler solution of increasing the KL co- efficient. In Figure 33, we show that there is a value of the pretraining mix coefficient that both reverses the performance regressions on SQuADv2 and DROP (the datasets we used for testing), and has minimal reductions in validation reward. In contrast, increasing the KL coefficient (Figure 34) leads to significant decreases in validation reward and never fully recovers on DROP and SQuAD. Changing the KL model from the PPO init to GPT-3 gives similar results.

在图29中，我们展示了将预训练更新引入我们的PPO微调（PPO-ptx）可以减轻所有数据集上的性能下降，并且甚至在HellaSwag数据集上超过了GPT-3。PPO-ptx模型的性能仍然落后于DROP、SQuADv2和翻译方面的GPT-3，需要更多的工作来研究和进一步消除这些性能下降。

混合预训练更新的效果比简单增加KL系数的解决方案要好。在图33中，我们展示了一种预训练混合系数的值，它可以扭转SQuADv2和DROP上的性能下降（我们用于测试的数据集），并且在验证奖励上只有微小的降低。相比之下，增加KL系数（图34）会导致验证奖励显著降低，并且在DROP和SQuAD上永远无法完全恢复。将KL模型从PPO初始模型更改为GPT-3的效果类似。

4.3、Qualitative results定性结果

InstructGPT模型展示了对RLHF微调分布之外指令的有希望的泛化能力。InstructGPT models show promising generalization to instructions outside of the RLHF fine- tuning distribution.

In particular, we find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code. This is interesting because non-English languages and code form a tiny minority of our fine-tuning data,9 and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise.

特别是，我们发现InstructGPT展现了在非英语语言中遵循指令以及对代码进行总结和问答的能力。这很有趣，因为非英语语言和代码在我们的微调数据中只占极小比例，这表明在某些情况下，对齐方法可能可以泛化到在人类直接监督下的输入上产生所期望的行为。

Figure 9: Simple mistakes in the 175B PPO-ptx model (InstructGPT 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) InstructGPT can be confused by instructions that assume false premises, and simply go along with it. (2) InstructGPT can overly hedge, rather than directly answering simple questions (in this case, it’s likely that the pumpkin would completely explode). Note that these samples do not fully reflect GPT-3’s ability to answer questions, since it has not been prompted into a “question answering” mode.

图9：175B PPO-ptx模型（InstructGPT 175B）与没有附加前缀的GPT-3 175B在简单错误方面的比较。这些示例是精选出来以说明某些行为，但输出并不是经过精选的。（1）InstructGPT可能会被假设错误前提的指令所迷惑，并简单地遵循这个前提。（2）InstructGPT可能过度保守，而不是直接回答简单问题（在这种情况下，很可能是南瓜会完全爆炸）。请注意，这些样本并不能完全反映GPT-3回答问题的能力，因为它没有被引导进入“问题回答”模式。

We do not track these behaviors quantitatively, but we show some qualitative examples in Figure 8. Our 175B PPO-ptx model is able to reliably answers questions about code, and can also follow instructions in other languages; however, we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains.

我们没有定量追踪这些行为，但在图8中展示了一些定性示例。我们的175B PPO-ptx模型能够可靠地回答有关代码的问题，并且还能够遵循其他语言的指令。然而，我们注意到，即使指令是其他语言的，它经常产生英文输出。相比之下，我们发现GPT-3可以执行这些任务，但需要更仔细的引导，并且在这些领域很少遵循指令。

InstructGPT仍然会犯简单错误InstructGPT still makes simple mistakes.

In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model’s performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930’s set in France”) or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).

在与我们的175B PPO-ptx模型互动中，我们注意到尽管在许多不同的语言任务上表现出色，但它仍然会犯一些简单的错误。举几个例子：（1）当给出一个带有错误前提的指令时，模型有时会错误地假设前提是正确的；（2）模型可能过度保守；当给出一个简单的问题时，它有时会说这个问题没有一个确定的答案，并给出多个可能的答案，即使从上下文中有一个相当明确的答案；（3）当指令包含多个明确约束（例如，“列出在法国拍摄的20世纪30年代的10部电影”）或者当约束对语言模型具有挑战性时（例如，在指定的句子数量内写一份摘要），模型的性能会下降。

We show some examples of these behaviors in Figure 9. We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don’t generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection (Dinan et al., 2019b).

我们在图9中展示了这些行为的一些示例。我们怀疑行为（2）的出现部分是因为我们指示标注者奖励认识论谦卑；因此，他们可能倾向于奖励保守的输出，而这被我们的奖励模型捕捉到。我们怀疑行为（1）的发生是因为训练集中很少有假设错误前提的提示，而我们的模型在这些示例上泛化能力不强。我们相信通过对抗性数据收集（Dinan等人，2019b），这两种行为都可以得到明显减少。

5、Discussion

5.1、Implications for alignment research对齐研究的意义

This research is part of our broader research program to align AI systems with human intentions (Chris-tiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020). Even though this work focuses on our current language model systems, we seek general and scalable methods that work for future AI systems (Leike et al., 2018). The systems we work with here are still fairly limited, but they are among the largest language models today and we apply them on a wide range of language tasks, including classification, summarization, question-answering, creative writing, dialogue, and others.

这项研究是我们更广泛的研究计划的一部分，旨在将AI系统与人类意图对齐（Chris-tiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020）。尽管这项工作侧重于我们目前的语言模型系统，但我们寻求通用且可扩展的方法，适用于未来的AI系统（Leike et al., 2018）。我们在这里使用的系统仍然相当有限，但它们是当今最大的语言模型之一，并且我们将它们应用于广泛的语言任务，包括分类、摘要、问答、创意写作、对话等等。

Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don’t yet exist. A disadvantage of this approach is that we are not directly facing alignment problems that occur only when aligning superhuman systems (Bostrom, 2014). However, our approach does provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning. Moreover, the alignment technique we use here, RLHF, is an important building block in several proposals to align superhuman systems (Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018). For example, RLHF was a central method in recent work on summarizing books, a task that exhibits some of the difficulties of aligning superhuman AI systems as it is difficult for humans to evaluate directly (Wu et al., 2021).

我们在这项工作中对齐研究的方法是迭代式的：我们正在改进当前AI系统的对齐，而不是抽象地专注于对齐尚不存在的超人类系统（Bostrom, 2014）。然而，我们的方法为我们提供了一个明确的经验反馈循环，告诉我们哪些方法有效，哪些方法无效。我们认为这个反馈循环对于改进我们的对齐技术至关重要，它迫使我们跟上机器学习的进展。此外，我们在这里使用的对齐技术RLHF是几个将超人类系统对齐的提案中的重要组成部分（Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018）。例如，RLHF是最近关于书籍摘要的研究中的一个核心方法，这个任务展示了将超人类AI系统对齐的一些困难，因为人类很难直接评估这些系统（Wu et al., 2021）。

From this work, we can draw lessons for alignment research more generally:

从这项工作中，我们可以得出一些对齐研究的普遍教训：

相对于预训练，增加模型对齐的成本性价比更高The cost of increasing model alignment is modest relative to pretraining

The cost of increasing model alignment is modest relative to pretraining. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models—at least for our customers’ natural language task distribution.

相对于预训练，增加模型对齐的成本较低。我们收集数据、进行训练运行的计算（包括实验运行）的成本，是训练GPT-3所花费的一小部分：训练我们的175B SFT模型需要4.9 petaflops/s-days，训练我们的175B PPO-ptx模型需要60 petaflops/s-days，而GPT-3的训练成本是3,640 petaflops/s-days（Brown et al., 2020）。与此同时，我们的结果表明，RLHF在使语言模型更有帮助方面非常有效，比增加100倍的模型规模更为有效。这表明，目前增加对现有语言模型的对齐投资比训练更大的模型更具成本效益，至少对于我们客户的自然语言任务分布来说。

可知InstructGPT在我们不进行监督的设置中泛化了“按照指示”的能力We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in

We’ve seen some evidence that InstructGPT generalizes ‘following instructions’ to settings that we don’t supervise it in, for example on non-English language tasks and code-related tasks. This is an important property because it’s prohibitively expensive to have humans supervise models on every task they perform. More research is needed to study how well this generalization scales with increased capabilities; see Christiano et al. (2021) for recent research in this direction.

我们看到了InstructGPT在我们不进行监督的设置中泛化了“按照指示”的能力，例如在非英语语言任务和与代码相关的任务上。这是一个重要的特性，因为让人类在每个任务上监督模型的成本太高。需要进一步研究来研究这种泛化随着能力增加而如何扩展；参见Christiano et al.（2021）关于这个方向的最新研究。

降低对齐税—我们成功地减轻了我们的微调引入的大部分性能下降We were able to mitigate most of the performance degradations introduced by our fine-tuning

We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique.

我们成功地减轻了我们的微调引入的大部分性能下降。如果不是这样的话，这些性能下降将构成一种对齐税 - 对模型进行对齐的额外成本。任何具有高税收的技术可能不会被采用。为了避免对未来具有高能力的AI系统产生留下与人类意图不一致的动机，需要一种具有低对齐税的对齐技术。为此，我们的结果对于作为低税收对齐技术的RLHF来说是个好消息。

已在现实世界中验证了对齐技术的研究We’ve validated alignment techniques from research in the real world

We’ve validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results (Soares et al., 2015), small synthetic domains (Christiano et al., 2018; Leike et al., 2017), or training ML models on public NLP datasets (Ziegler et al., 2019; Stiennon et al., 2020). Our work provides grounding for alignment research in AI systems that are being used in production in the real world with customers.10 This enables an important feedback loop on the techniques’ effectiveness and limitations.

我们已经在现实世界中验证了对齐技术的研究。对齐研究在历史上一直相当抽象，要么关注理论结果（Soares et al., 2015），要么关注小规模的合成域（Christiano et al., 2018; Leike et al., 2017），或者在公共NLP数据集上训练ML模型（Ziegler et al., 2019; Stiennon et al., 2020）。我们的工作为正在与客户在实际生产中使用的AI系统的对齐研究提供了基础。这使得对这些技术的效果和局限性能够进行重要的反馈循环。

5.2、Who are we aligning to?我们要对齐的对象是谁？

When aligning language models with human intentions, their end behavior is a function of the underlying model (and its training data), the fine-tuning data, and the alignment method used. In this section, we describe a number of factors that influence the fine-tuning data specifically, to ultimately determine what and who we’re aligning to. We then consider areas for improvement before a larger discussion of the limitations of our work in Section 5.3.

The literature often frames alignment using such terms as “human preferences” or “human values.” In this work, we have aligned to a set of labelers’ preferences that were influenced, among others things, by the instructions they were given, the context in which they received them (as a paid job), and who they received them from. Some crucial caveats apply:

当将语言模型与人类意图对齐时，它们的最终行为取决于底层模型（及其训练数据）、微调数据和所使用的对齐方法。在本节中，我们具体描述了影响微调数据的一些因素，以最终确定我们要对齐的内容和对象。然后，在第5.3节中，我们考虑改进的方向，并讨论我们工作的局限性。

文献常常用“人类偏好”或“人类价值观”等术语来描述对齐。在这项工作中，我们对一组标注者的偏好进行了对齐，这些偏好在很大程度上受到他们接受的指示、接受指示的背景（作为一份有偿工作）以及指示来源的影响。有一些关键的注意事项：

语言模型的最终行为受多个因素影响，包括模型本身、训练数据、调整方法，我们主要调整到标注者提供的偏好，但标注者代表性有限

First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models. We describe our labeler hiring process and demographics in Appendix B; in general, they are mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI. They disagree with each other on many examples; we found the inter-labeler agreement to be about 73%.

Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room. More study is needed on the exact effect of different instruction sets and interface designs on the data collected from labelers and its ultimate effect on model behavior.

Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for. Customers and their end users may disagree or customers may not be optimizing for end users’ well-being; for example, a customer may want a model that maximizes the amount of time a user spends on their platform, which is not necessarily what end-users want. In practice, our labelers don’t have visibility into the contexts in which a given prompt or completion will be seen.

Fourth, OpenAI’s customers are not representative of all potential or current users of language models—let alone of all individuals and groups impacted by language model use. For most of the duration of this project, users of the OpenAI API were selected off of a waitlist. The initial seeds for this waitlist were OpenAI employees, biasing the ultimate group toward our own networks.

首先，我们对我们的训练标注者提供的示范和偏好进行对齐，这些标注者直接产生我们用于微调模型的数据。我们在附录B中描述了我们的标注者招聘过程和人口统计信息；总的来说，他们大多是通过Upwork或Scale AI招聘的居住在美国或东南亚的英语为主的人员。他们在许多示例上存在分歧；我们发现标注者之间的一致性约为73%。

其次，我们对我们自己的偏好进行对齐，作为设计这项研究的研究人员（从而间接对齐到我们更广泛的研究机构OpenAI）：我们编写标注指示供标注者在撰写示范和选择首选输出时使用，并在共享聊天室中回答他们对边界情况的问题。需要对不同指令集和界面设计对从标注者那收集的数据产生的确切影响以及对模型行为的最终影响进行进一步研究。

第三，我们的训练数据由OpenAI的客户通过OpenAI API Playground向模型发送的提示确定，因此我们隐性地对客户认为有价值的内容进行对齐，并且在某些情况下，对客户的最终用户认为目前使用API有价值的内容进行对齐。客户及其最终用户可能存在分歧，或者客户可能不会优化最终用户的福祉；例如，客户可能希望模型最大化用户在其平台上花费的时间，而这不一定是最终用户想要的。在实践中，我们的标注者无法看到给定提示或完成将在哪些上下文中显示。

第四，OpenAI的客户并不代表所有潜在或当前的语言模型用户，更不用说所有受语言模型使用影响的个人和群体了。在该项目的大部分时间里，OpenAI API的用户是从候补名单中选择的。这个候补名单的最初种子是OpenAI的员工，从而使最终的用户群体在很大程度上偏向于我们自己的网络。

Stepping back, there are many difficulties in designing an alignment process that is fair, transparent, and has suitable accountability mechanisms in place. The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. We are not claiming that researchers, the labelers we hired, or our API customers are the right source of preferences. There are many stakeholders to consider—the organization training the model, the customers using the model to develop products, the end users of these products, and the broader population who may be directly or indirectly affected. It is not only a matter of making the alignment process more participatory; it is impossible that one can train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the tradeoffs.

回过头来看，设计一个公平、透明且具有适当问责机制的对齐过程存在许多困难。本文的目标是证明这种对齐技术可以对齐到特定的人类参考群体以应用于特定的场景。我们并不声称研究人员、我们雇佣的标注者或我们的API客户是正确的偏好来源。有很多利益相关者需要考虑——训练模型的组织、使用模型开发产品的客户、这些产品的最终用户，以及可能直接或间接受到影响的广大人口群体。这不仅是使对齐过程更具参与性的问题；训练一个同时符合所有人偏好或每个人都认可权衡的系统是不可能的。

One path forward could be to train models that can be conditioned on the preferences of certain groups, or that can be easily fine-tuned or prompted to represent different groups. Different models can then be deployed and used by groups who endorse different values. However, these models might still end up affecting broader society and there are a lot of difficult decisions to be made relating to whose preferences to condition on, and how to ensure that all groups can be represented and can opt out of processes that may be harmful.

前进的一种途径可能是训练可以根据特定群体的偏好进行条件化的模型，或者可以轻松进行微调或提示以代表不同群体的模型。然后，不同的模型可以被部署和被认同不同价值观的群体使用。然而，这些模型仍然可能影响到更广泛的社会，而且在与哪些偏好进行条件化以及如何确保所有群体都能被代表并能够退出可能有害的过程方面需要做出许多困难的决策。

5.3、Limitations局限性：基于有限代表性的标注者偏好进行训练和调整，模型存在不完全对齐和不完全安全的问题

Methodology. The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.

方法论。我们的InstructGPT模型的行为在一定程度上取决于从承包商那里获得的人类反馈。一些标注任务依赖于价值判断，这些判断可能会受到承包商的身份、信念、文化背景和个人经历的影响。我们雇佣了约40名承包商，根据他们在筛选测试中的表现来判断他们对敏感提示的识别和回应能力，以及他们在一个带有详细说明的标注任务上与研究人员的一致率（见附录B）。我们将承包商团队保持得较小，因为这样可以与一小组全职从事任务的承包商进行高带宽的沟通。然而，这个团队显然不能代表将使用和受到我们部署模型影响的所有人的完整范围。举个简单的例子，我们的标注者主要是以英语为主，并且我们的数据几乎完全由英语指令组成。

There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily.

我们还有许多改进数据收集设置的方式。例如，由于成本原因，大多数比较只有1名承包商进行标注。对示例进行多次标注可以帮助确定承包商意见不一致的领域，从而单个模型很难与所有承包商保持一致。在意见不一致的情况下，将对平均标注者偏好进行对齐可能并不理想。例如，在生成对某个少数群体影响较大的文本时，我们可能希望给属于该群体的标注者的偏好更多地分配权重。

Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9.

Perhaps the greatest limitation of our models is that, in most cases, they follow the user’s instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.

模型。我们的模型既没有完全对齐也没有完全安全；它们仍会生成有毒或有偏见的输出，捏造事实，并生成没有明确提示的性和暴力内容。它们还可能在某些输入上无法生成合理的输出；我们在图9中展示了一些例子。

也许我们模型最大的局限性是，在大多数情况下，它们会遵循用户的指令，即使这可能导致现实世界中的伤害。例如，当给定一个提示，要求模型尽可能具有偏见时，InstructGPT生成的有害输出比同样规模的GPT-3模型更多。我们在接下来的章节中讨论了一些潜在的缓解方法。

5.4、Open questions开放问题：用对齐技术微调语言模型有利，但仍存在诸多开放问题和挑战需要解决

This work is a first step towards using alignment techniques to fine-tune language models to follow a wide range of instructions. There are many open questions to explore to further align language model behavior with what people actually want them to do.

Many methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. For example, one could use an adversarial set-up where labelers find the worst-case behaviors of the model, which are then labeled and added to the dataset (Dinan et al., 2019b). One could also combine our method with ways of filtering the pretraining data (Ngo et al., 2021), either for training the initial pretrained models, or for the data we use for our pretraining mix approach. Similarly, one could combine our approach with methods that improve models’ truthfulness, such as WebGPT (Nakano et al., 2021).

这项工作是使用对齐技术对语言模型进行微调，以使其遵循各种指令的第一步。还有许多开放性问题需要探索，以进一步使语言模型的行为与人们实际希望它们做的事情相一致。

可以尝试许多方法来进一步减少模型生成有害、偏见或其他有害输出的倾向。例如，可以使用对抗性设置，其中标注者找出模型的最坏行为，并将其进行标注并添加到数据集中（Dinan等，2019b）。还可以将我们的方法与过滤预训练数据的方式相结合（Ngo等，2021），无论是用于训练初始预训练模型，还是用于我们用于预训练混合方法的数据。类似地，可以将我们的方法与改善模型真实性的方法相结合，例如WebGPT（Nakano等，2021）。

In this work, if the user requests a potentially harmful or dishonest response, we allow our model to generate these outputs. Training our model to be harmless despite user instructions is important, but is also difficult because whether an output is harmful depends on the context in which it’s deployed; for example, it may be beneficial to use language models to generate toxic outputs as part of a data augmentation pipeline. Our techniques can also be applied to making models refuse certain user instructions, and we plan to explore this in subsequent iterations of this research.

Getting models to do what we want is directly related to the steerability and controllability litera-ture (Dathathri et al., 2019; Krause et al., 2020). A promising future path is combining RLHF with other methods of steerability, for example using control codes (Keskar et al., 2019), or modifying the sampling procedure at inference time using a smaller model (Dathathri et al., 2019).

While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration (Anthony et al., 2017; Silver et al., 2017), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches (Achiam et al., 2017) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors.

在这项工作中，如果用户请求潜在有害或不诚实的回应，我们允许模型生成这些输出。尽管无论一个输出是否有害取决于它被部署的上下文，将我们的模型训练成无害的输出尽管用户的指令很重要，但也很困难；例如，将语言模型用于生成有毒输出作为数据增强流程的一部分可能是有益的。我们的技术也可以应用于使模型拒绝某些用户指令，我们计划在后续的研究中探索这一点。

使模型按照我们的意愿行事与可操控性的文献密切相关（Dathathri等，2019；Krause等，2020）。一个有希望的未来路径是将RLHF与其他可操控性方法相结合，例如使用控制代码（Keskar等，2019），或者在推理时使用较小模型修改采样过程（Dathathri等，2019）。

虽然我们主要关注RLHF，但还有许多其他算法可用于在我们的示范和比较数据上训练策略，以获得更好的结果。例如，可以探索专家迭代（Anthony等，2017；Silver等，2017），或者使用比较数据子集的简单行为克隆方法。还可以尝试约束优化方法（Achiam等，2017），该方法最大化以生成少数有害行为为条件的奖励模型的得分。

Comparisons are also not necessarily the most efficient way of providing an alignment signal. For example, we could have labelers edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem.

Our proposal for mitigating the alignment tax, by incorporating pretraining data into RLHF fine-tuning, does not completely mitigate performance regressions, and may make certain undesirable behaviors more likely for some tasks (if these behaviors are present in the pretraining data). This is an interesting area for further research. Another modification that would likely improve our method is to filter the pretraining mix data for toxic content (Ngo et al., 2021), or augment this data with synthetic instructions.

比较也不一定是提供对齐信号的最有效方式。例如，我们可以让标注者编辑模型的回应以使其更好，或者以自然语言生成对模型回应的批评。对于设计标注者与语言模型之间的反馈接口，有很多选择的空间，这是一个有趣的人机交互问题。

我们提出的通过将预训练数据纳入RLHF微调来减轻对齐成本的建议并不能完全减轻性能回退，并且可能会使某些任务中某些不良行为更有可能（如果这些行为存在于预训练数据中）。这是一个有待进一步研究的有趣领域。改进我们的方法的另一个修改是过滤预训练混合数据中的有害内容（Ngo等，2021），或者用合成指令增加这些数据。

As discussed in detail in Gabriel (2020), there are subtle differences between aligning to instructions, intentions, revealed preferences, ideal preferences, interests, and values. Gabriel (2020) advocate for a principle-based approach to alignment: in other words, for identifying “fair principles for alignment that receive reflective endorsement despite widespread variation in people’s moral beliefs.” In our paper we align to the inferred user intention for simplicity, but more research is required in this area. Indeed, one of the biggest open questions is how to design an alignment process that is transparent, that meaningfully represents the people impacted by the technology, and that synthesizes peoples’ values in a way that achieves broad consensus amongst many groups. We discuss some related considerations in Section 5.2.

正如Gabriel（2020）中详细讨论的那样，将对齐到指令、意图、显露的偏好、理想偏好、利益和价值之间存在微妙差异。Gabriel（2020）主张采用基于原则的方法进行对齐：换句话说，要确定“在人们的道德信仰普遍变化的情况下，获得反思认可的对齐公平原则。”在我们的论文中，为简单起见，我们对齐到了推断出的用户意图，但在这个领域还需要进行更多的研究。事实上，最大的开放问题之一是如何设计一个透明、有意义地代表受技术影响的人，并以一种合成人们的价值观的方式实现广泛共识的对齐过程。我们在第5.2节中讨论了一些相关的考虑。

5.5、Broader impacts更广泛的影响：对齐技术不能解决安全问题，需要更完善的监管和安全机制

This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations. We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions (Bostrom,2014).

However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content.

本项工作的动机是通过训练大型语言模型使其做一组特定人类希望它们做的事情，从而增加它们的积极影响。默认情况下，语言模型优化下一个单词预测目标，这只是我们希望这些模型做的事情的一个代理。我们的结果表明，我们的技术有望使语言模型更有帮助、更真实和更无害。从长远来看，如果对齐失败，可能会导致更严重的后果，特别是如果这些模型在安全关键的情况下部署。我们预计，随着模型的规模扩大，必须更加谨慎地确保它们与人类意图一致（Bostrom，2014）。

然而，使语言模型更好地遵循用户意图也使它们更容易被滥用。可能更容易使用这些模型生成令人信服的错误信息，或者令人憎恨或虐待的内容。

Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or hous-ing, generating political advertisements, and law enforcement. If these models are open-sourced, it becomes challenging to limit harmful applications in these and other domains without proper regulation. On the other hand, if large language model access is restricted to a few organizations with the resources required to train them, this excludes most people from access to cutting-edge ML technology. Another option is for an organization to own the end-to-end infrastructure of model deployment, and make it accessible via an API. This allows for the implementation of safety protocols like use case restriction (only allowing the model to be used for certain applications), monitoring for misuse and revoking access to those who misuse the system, and rate limiting to prevent the generation of large-scale misinformation. However, this can come at the cost of reduced transparency and increased centralization of power because it requires the API provider to make decisions on where to draw the line on each of these questions.

Finally, as discussed in Section 5.2, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative.

对齐技术并不能解决与大型语言模型相关的安全问题，而应作为更广泛安全生态系统中的一种工具来使用。除了有意滥用之外，还有许多领域只有在极其谨慎的情况下才应该部署大型语言模型，或者根本不应该部署。例如，高风险领域包括医学诊断、基于受保护特征对人们进行分类、确定信用、就业或住房资格、生成政治广告和执法。如果这些模型是开源的，在没有适当监管的情况下，很难限制在这些和其他领域中的有害应用。另一方面，如果只有少数组织拥有训练这些模型所需的资源，那么大多数人将无法访问最先进的机器学习技术。另一个选项是一个组织拥有端到端的模型部署基础设施，并通过 API 提供可访问性。这允许实施安全协议，例如用例限制（仅允许模型用于特定应用程序），监控滥用情况并撤销滥用系统的访问权限，以及限制生成大规模错误信息。然而，这可能会导致透明度降低和权力集中增加的代价，因为它要求 API 提供者在每个问题上做出决策。

最后，正如在第5.2节中讨论的那样，这些模型对齐的对象是非常重要的，并且将显著影响这些模型的净影响是积极的还是消极的。

一个处女座的程序猿

关注

8
点赞
踩
21

收藏

觉得还不错? 一键收藏
打赏
0
评论
LLMs之InstructGPT：《Training language models to follow instructions with human feedback》翻译与解读

LLMs之InstructGPT：《Training language models to follow instructions with human feedback》翻译与解读目录《Training language models to follow instructions with human feedback》翻译与解读Abstract1、Introduction引言2、Related work相关工作3、Methods a
复制链接

扫一扫