LLMs：《元奖励语言模型：通过 LLM-as-a-Meta-Judge 进行自我改进Meta-Rewarding Language Models: Self-Improving Alignment

最新推荐文章于 2025-04-10 18:11:13 发布

一个处女座的程序猿

最新推荐文章于 2025-04-10 18:11:13 发布

阅读量1.5k

点赞数 31

分类专栏： NLP/LLMs 文章标签：人工智能 LLM

本文链接：https://blog.csdn.net/qq_41185868/article/details/141234588

版权

NLP/LLMs 专栏收录该内容

792 篇文章

订阅专栏

LLMs：《元奖励语言模型：通过 LLM-as-a-Meta-Judge 进行自我改进Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge》翻译与解读

导读：2024年7月28日Meta等团队发布。该工作提出了一种元评分机制，让语言模型自我评判并改进评判能力，从而提高无监督的自我改进效果。

背景与痛点：传统的大型语言模型改进依赖于耗费人力的人工标注数据。现有的自我评分机制主要关注改进模型响应，而忽视了对评判能力的改进，导致在迭代训练中很快达到饱和。

● 模型超越人类知识：大型语言模型（LLMs）在很多领域快速超过人类知识。

● 依赖人类数据的局限性：传统上，改进模型依赖于昂贵的人类数据，且质量受限于人类能力。

● 现有方法的局限：现有的自我奖励机制主要集中在改进模型的响应生成能力，而忽视了判断能力，导致训练的快速饱和。

具体解决方案：提出元评分(Meta-Rewarding)。引入元判断者(meta-judge)的角色，让模型评估自身的判断，从而提高评判能力。

● 引入Meta-Rewarding：提出一种新的Meta-Rewarding方法，模型不仅评估自己的响应，还评估自己的评估，以改进判断能力。

● 自我改进流程：模型在无监督的情况下，通过自我评估和反馈来提升自身能力。

● 角色分配：同一个模型在训练中扮演三个角色：执行者(生成响应)、评判者(评分响应)和元判断者(评估评判)。

核心思路和步骤

● 执行者生成响应：模型生成对指令的响应，执行者生成多个响应候选。

● 评判者评估响应：评判者评估每个响应，产生分数判断。使用LLM-as-a-Judge机制对响应进行评估并给予奖励。

● 元评判者评估评估：元判断者比较不同的判断，确定更准确的判断。元评判者通过LLM-as-a-Meta-Judge机制对评判者的判断进行评估。

● 偏好对创建：利用评判和元评判的结果创建偏好对，用于训练模型的执行和判断能力。

● 迭代训练：使用执行者响应偏好对、评判者判断偏好对迭代训练模型。每次迭代从训练数据中优化模型性能，改进执行和判断能力。

优势

● 无需额外人类数据：模型完全通过自我生成的数据进行改进。无需人工标注数据，实现了自我改进。

● 提高判断能力：同时提高了模型作为执行者和评判者的能力。元评判机制显著提升了模型的判断能力。

● 解决长度偏差：引入长度控制机制，避免模型偏好长响应。

● 显著性能提升：在多个基准测试中模型表现优于传统方法，并接近更大规模的模型。在AlpacaEval 2和Arena-Hard等评测中显著提升了指令执行能力。

● 通过Meta-Rewarding方法，模型实现了自我改进，在不依赖额外人类数据的情况下，显著提高了模型的响应生成和判断能力。

《Gorilla: Large Language Model Connected with Massive APIs》翻译与解读

ABSTRACT

1 INTRODUCTION

Figure 1: Meta-Rewarding iterative training scheme. The language model at step t behaves as an actor to generate responses to instructions, as a judge to assign rewards to those responses, and as a meta-judge to evaluate its own judgments. The judgments are used to create preference pairs to improve its ability to act, and the meta-judgments are used to create preference pairs to improve its ability to judge. Both preference pair sets are used together to train the model for the next iteration.图1:元奖励迭代训练方案。步骤1中的语言模型表现为一个演员，生成对指令的反应，作为一个评判者，为这些反应分配奖励，作为一个元评判者，评估自己的判断。通过判断生成偏好对来提高其行动能力，通过元判断生成偏好对来提高其判断能力。两个偏好对集一起用于训练下一次迭代的模型。

6 CONCLUSION

《Gorilla: Large Language Model Connected with Massive APIs》翻译与解读

地址

论文地址：https://arxiv.org/abs/2305.15334

时间

2024年7月28日

作者

Meta FAIR

University of California, Berkeley

New York University

ABSTRACT

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

大型语言模型（LLMs）在许多领域的知识已经迅速超越了人类。虽然传统上改进这些模型依赖于昂贵的人类数据，但最近的自我奖励机制（Yuan 等，2024）显示，LLMs 可以通过评判自己的回答而不是依赖人类标注者来进行改进。然而，现有方法主要集中在改进模型的回答质量，而非判断能力，这导致在迭代训练过程中模型性能的快速饱和。为了解决这一问题，我们在自我改进过程中引入了一种新的“元奖励”（Meta-Rewarding）步骤，模型通过评判自己的判断并利用这些反馈来优化其判断技能。令人惊讶的是，这种无监督的方法不仅提升了模型的判断能力，还提升了其遵循指令的能力。在 AlpacaEval 2 中，Llama-3-8B-Instruct 的胜率从 22.9% 提升至 39.4%，在 Arena-Hard 中从 20.6% 提升至 29.1%。这些结果强烈表明，模型在没有人类监督的情况下自我改进的潜力。

1 INTRODUCTION

Large Language Models (LLMs) are advancing significantly in their ability to follow instructions and respond to user queries (OpenAI, 2023; Touvron et al., 2023). An important phase in training these models is instruction tuning (Ouyang et al., 2022), which typically involves training LLMs on datasets curated by humans, either via supervised finetuning or preference optimization. Neverthe-less, the acquisition of human-generated data is both costly and time-consuming. Furthermore, the quality of such data is inherently constrained by the limitations of human capabilities. The so-called ‘Super Alignment’ challenge (Burns et al., 2023) aims to find a solution to steering or controlling potentially super-intelligent AIs when their actions are inherently beyond human abilities to judge.

Among the potential solutions to this challenge, self-judging by the AI emerges as a particularly promising approach. Yuan et al. (2024c) introduces an iterative Self-Rewarding mechanism that enables an LLM to improve autonomously. The process involves a single model that takes on two distinct roles, as an actor and as a judge. As an actor, the model produces responses that are aimed to fulfill specific instructions. As a judge (a special kind of acting), the model evaluates these responses via LLM-as-a-Judge prompting (Zheng et al., 2024) and assigns rewards. The objective of the actor during this self-play is to maximize its reward, thereby improving its ability to follow instructions.

大型语言模型（LLMs）在遵循指令和响应用户查询方面的能力显著提升（OpenAI, 2023；Touvron 等，2023）。训练这些模型的一个重要阶段是指令微调（Ouyang 等，2022），这通常涉及在人类策划的数据集上对 LLM 进行监督微调或偏好优化。然而，人类生成数据的获取既昂贵又耗时。此外，此类数据的质量受到人类能力的内在限制。所谓的“超级对齐”挑战（Burns 等，2023）旨在找到一种方法来引导或控制可能的超级智能 AI，当其行为本质上超出人类判断能力时。

在解决这一挑战的潜在方案中，AI 的自我评判成为一种特别有前途的方法。Yuan 等（2024c）引入了一种迭代的自我奖励机制，使 LLM 能够自主改进。该过程涉及单一模型在演员和评判者两个不同角色之间切换。作为演员，模型生成旨在完成特定指令的响应；作为评判者（这是一种特殊的演员），模型通过 LLM-as-a-Judge 提示（Zheng 等，2024）评估这些响应并分配奖励。在这一自我对弈过程中，行为人/演员的目标是最大化其奖励，从而提高其遵循指令的能力。

We hypothesize that a major limitation of this previous work is that its learning objective enhances the model’s ability as an actor to generate better responses, while overlooking improving the model’s ability as a judge. If the ability to judge does not improve then training the actor over iterations can quickly saturate – or worse could overfit the reward signal, a.k.a. reward hacking. Consequently, it is imperative to also improve the model’s capabilities as a judge in addition to its ability to act.

In this paper, we propose a novel method called Meta-Rewarding which assigns rewards to its own judgements to train the model’s ability to judge. The key idea is to introduce a third role of meta-judge, whose task is to evaluate the model’s own judgements. While the judge evaluates the actor’s responses, the meta-judge evaluates the judge’s judgments (including rewards that it assigns) using a mechanism similar to LLM-as-a-Judge, which we term LLM-as-a-Meta-Judge. The meta-judge en-ables us to build training data containing preference pairs of judgements, in addition to the standard preferences between actor responses derived from the standard judge. Our Meta-Rewarding method thus aims to explicitly improve both the acting and judging skills of a model – whereby these com-bined skills should help to enhance its instruction following ability as an actor. It is important to note that all three roles - actor, judge, and meta-judge - are performed by the same model, thereby maintaining a self-improving nature that requires no extra human data.

我们假设，之前这项工作的一个主要局限性是，它的学习目标增强了模型作为行动者产生更好反应的能力，而忽略了提高模型作为判断者的能力。如果判断能力没有提高，那么反复训练参与者可能会很快饱和——或者更糟的是，可能会过度拟合奖励信号，也就是奖励黑客。因此，除了提高模型的行动能力外，还必须提高模型的判断能力。

在本文中，我们提出了一种称为元奖励的新方法，该方法为其自己的判断分配奖励，以训练模型的判断能力。关键思想是引入第三个角色元判断者，其任务是评估模型自己的判断。当评判者评估行为人的反应时，元评判者评估评判者的判断(包括它分配的奖励)，使用类似于llm -as- judge的机制，我们称之为LLM-as-a-Meta-Judge。元判断器使我们能够构建包含判断偏好对的训练数据，以及来自标准判断器的参与者响应之间的标准偏好。因此，我们的元奖励方法旨在明确地提高模型的表演和判断技能-这些综合技能应该有助于提高其作为演员的指令跟随能力。需要注意的是，所有三个角色——参与者、判断者和元判断者——都由同一个模型执行，因此保持了自我改进的性质，不需要额外的人工数据。

In addition to enhancing the judging ability through Meta-Rewarding, we also address the length-bias issue in the judging process (Singhal et al., 2023). Like other reward models, the judge tends to favor long responses, which can make response length grow during iterative DPO (Yuan et al., 2024c). To counteract this, we combine the judge score with length information to determine the winning response, ensuring that a shorter response is chosen when scores are close.

In our experiments we start from Llama-3-8B-Instruct and perform multiple iterations of our Meta-Rewarding training. When evaluated on AlpacaEval 2 (Dubois et al., 2024b), we see a substantial improvement in the length-controlled (LC) win rate (from 22.9% to 39.4%), even outperforming GPT-4-03141. We also observe that our method outperforms standard Self-Rewarding training even if it is enhanced with our length-bias improvements (35.5% vs 39.4%), highlighting the importance of the meta-judge. We also see similar improvement on Arena-Hard benchmark (Li et al., 2024), which is a benchmark targeting models’ ability to answer complex and hard questions.

除了通过元奖励提高判断能力外，我们还解决了判断过程中的长度偏差问题(Singhal et al.， 2023)。与其他奖励模型一样，判断者倾向于选择较长的反应，这使得在迭代DPO过程中，反应长度不断增长(Yuan et al.， 2024c)。为了解决这个问题，我们将裁判分数与长度信息结合起来，以确定获胜的回答，确保在分数接近时选择较短的回答。

在我们的实验中，我们从Llama-3-8B-Instruct开始，并执行多次迭代的元奖励训练。当在AlpacaEval 2上进行评估时(Dubois等人，2024b)，我们看到长度控制(LC)胜率有了实质性的提高(从22.9%提高到39.4%)，甚至优于GPT-4-03141。我们还观察到，我们的方法优于标准的自我奖励训练，即使它随着我们的长度偏差的改进而增强(35.5% vs 39.4%)，这突出了元判断的重要性。我们在Arena-Hard基准上也看到了类似的改进(Li et al.， 2024)，这是一个针对模型回答复杂和困难问题的能力的基准。

Figure 1: Meta-Rewarding iterative training scheme. The language model at step t behaves as an actor to generate responses to instructions, as a judge to assign rewards to those responses, and as a meta-judge to evaluate its own judgments. The judgments are used to create preference pairs to improve its ability to act, and the meta-judgments are used to create preference pairs to improve its ability to judge. Both preference pair sets are used together to train the model for the next iteration.图1:元奖励迭代训练方案。步骤1中的语言模型表现为一个演员，生成对指令的反应，作为一个评判者，为这些反应分配奖励，作为一个元评判者，评估自己的判断。通过判断生成偏好对来提高其行动能力，通过元判断生成偏好对来提高其判断能力。两个偏好对集一起用于训练下一次迭代的模型。

6 CONCLUSION

In this work, we propose a novel mechanism for improving the judging skill of models by using a meta-judge that assigns meta-rewards to select chosen and rejected judgments for preference opti-mization. This addresses a major limitation of the Self-Rewarding framework (Yuan et al., 2024c), specifically the lack of training the judge. To make Meta-Rewarding training work, we additionally introduce a new length-control technique to mitigate the issue of length explosion when training with AI feedback. The effectiveness of our method is demonstrated through auto-evaluation benchmarks AlpacaEval, Arena-Hard, and MT-Bench. Remarkably, even without additional human feedback, our approach significantly improves upon Llama-3-8B-Instruct and surpasses both Self-Rewarding and SPPO (Wu et al., 2024), a strong baseline that relies heavily on human feedback. Furthermore, when we evaluate our model’s judging ability, it shows significant improvement in correlation with both human judges and strong AI judges like gpt-4-1106-preview. Overall, our findings provide strong evidence that self-improving the model without any human feedback is a promising direction for achieving super alignment.

在这项工作中，我们提出了一种新机制，通过使用“元评判者”对选定和拒绝的判断进行偏好优化的元奖励分配，从而提升模型的评判技能。这一机制解决了自我奖励框架（Yuan 等，2024c）的一个主要限制，即缺乏对评判者的训练。为了使元奖励训练有效，我们还引入了一种新的长度控制技术，以缓解在使用 AI 反馈训练时出现的长度爆炸问题。我们的方法通过 AlpacaEval、Arena-Hard 和 MT-Bench 等自动评估基准测试验证了其有效性。值得注意的是，即使没有额外的人类反馈，我们的方法仍显著改进了 Llama-3-8B-Instruct，并超越了自我奖励和 SPPO（Wu 等，2024），后者是一个依赖大量人类反馈的强基线。此外，当我们评估模型的评判能力时，它在与人类评判者以及像 gpt-4-1106-preview 这样强大的 AI 评判者的相关性上显示出显著提升。总体而言，我们的研究结果提供了有力的证据，表明在没有任何人类反馈的情况下实现模型的自我改进是实现超级对齐的一个有前途的方向。