深入解析 Plackett-Luce 和 Bradley-Terry 模型及其 Under-Specification 问题：中英双语

最新推荐文章于 2025-03-19 18:39:15 发布

阿正的梦工坊

最新推荐文章于 2025-03-19 18:39:15 发布

阅读量1k

点赞数 23

分类专栏： LLM Deep Learning 文章标签： DPO

本文链接：https://blog.csdn.net/shizheng_Li/article/details/144767828

版权

Deep Learning 同时被 2 个专栏收录

290 篇文章

订阅专栏

LLM

217 篇文章

订阅专栏

深入解析 Plackett-Luce 和 Bradley-Terry 模型及其 Under-Specification 问题

本文将详细介绍 Plackett-Luce 和 Bradley-Terry 模型的定义及其在偏好学习中的应用，重点分析这些模型的 under-specification（欠约束）问题。同时，我们将结合数学公式和示例解释其核心概念，并讨论如何通过重新参数化（reparameterization）解决该问题。

1. 什么是 Plackett-Luce 和 Bradley-Terry 模型？

1.1 Plackett-Luce 模型

Plackett-Luce 模型 是一种经典的概率模型，用于描述对多个候选项进行排序或选择的过程。在该模型中，每个候选项 ( $y$ ) 被赋予一个非负分数 ( $s (y)$ )，表示其相对吸引力或偏好程度。

在给定输入 ( $x$ ) 的情况下，我们假设候选项 ( $y$ ) 的选择概率服从以下公式：

$\frac{\exp(s(y))}{\sum_{y' \in Y} \exp(s(y'))}$

其中：

( $s (y)$ )：候选项 ( $y$ ) 的分数或得分函数。
( $Y$ )：所有候选项的集合。

1.2 Bradley-Terry 模型

Bradley-Terry 模型 是 Plackett-Luce 模型的特殊情况，专门用于 二元偏好对比（pairwise preference comparison）。在这种情况下，给定两个候选项 ( $y_w$ ) 和 ( $y_l$ )（分别表示偏好项和不偏好项），它们之间的偏好关系由以下概率表示：

$P(y_w \succ y_l | x) = \frac{\exp(s(y_w))}{\exp(s(y_w)) + \exp(s(y_l))}$

该公式描述了 ( $y_w$ ) 比 ( $y_l$ ) 更受偏好的概率。

2. Under-Specification 问题解析

2.1 定义与本质

在 Plackett-Luce 和 Bradley-Terry 模型中，under-specification 问题是指多个不同的奖励函数 ( $r (x, y)$ ) 可以生成相同的偏好分布。这导致模型对真实偏好信号的唯一性解释存在不确定性，从而影响参数的可解释性和稳定性。

具体而言，如果存在两个不同的奖励函数 ( $r_1$ ) 和 ( $r_2$ )，它们生成的偏好概率分布满足以下关系：

$P_1(y | x) = P_2(y | x)$

那么从概率上来看，这两个奖励函数是等效的，但它们的形式和参数可能完全不同。

2.2 示例分析

假设输入为文本生成任务，提示为：

输入：Summarize this article.

候选输出：

( $y_w$ )：“Key points are…”（偏好输出）。
( $y_l$ )：“The main topic is…”（不偏好输出）。

根据 Bradley-Terry 模型，计算偏好概率：

$P(y_w \succ y_l | x) = \frac{\exp(s(y_w))}{\exp(s(y_w)) + \exp(s(y_l))}$

若 ( $s(y_w) = 1.5$ ) 且 ( $s(y_l) = 1.0$ )，则：

$P(y_w \succ y_l | x) = \frac{e^{1.5}}{e^{1.5} + e^{1.0}} \approx 0.622$

此概率表明更偏好第一个输出。然而，由于奖励函数的欠约束性，我们可以使用不同形式的函数生成相同的偏好分布。

3. 重新参数化：解决欠约束问题

3.1 核心思路

DPO 提出了重新参数化方法，以消除欠约束导致的不确定性。这种方法通过将奖励函数表示为策略模型与参考模型的 概率比率，明确了等价类归属，从而确保偏好分布的稳定性和唯一性。

3.2 奖励函数的重新参数化公式

假设奖励函数定义如下：

$\beta \log \frac{\pi(y | x)}{\pi_{ref}(y | x)}$

其中：

( $\pi(y | x)$ )：当前策略模型的概率分布。
( $\pi_{ref}(y | x)$ )：参考模型的概率分布。
( $\beta$ )：控制偏好分布的尺度因子。

该公式明确奖励函数与策略模型之间的联系，确保奖励函数的等价类表达，从而解决欠约束问题。

3.3 归一化条件

为了确保模型输出是合法概率分布（所有概率和为 1），需满足以下归一化条件：

$\sum_y \pi_{ref}(y | x) \exp\left( \frac{1}{\beta} r(x, y) \right) = 1$

此条件将奖励函数约束在合理范围内，便于训练和优化。

4. 实践应用示例

示例 1: 奖励函数计算

假设参考模型概率为：

$\pi_{ref}(y_w | x) = 0.5, \quad \pi_{ref}(y_l | x) = 0.5$

当前策略模型概率为：

$\pi(y_w | x) = 0.7, \quad \pi(y_l | x) = 0.3$

计算奖励函数：

$y_w) = \beta \log \frac{0.7}{0.5} \approx 0.336 \beta$

此奖励值表明偏好输出的概率比率提高了约 33.6%。

示例 2: 偏好建模优化

假设我们对用户反馈数据进行建模，通过上述奖励公式对策略进行微调，偏好分布将逐步向用户需求靠拢。例如，在对话生成任务中，模型可以通过迭代调整偏好得分，使回答更加符合语义偏好。

5. 总结与启示

Plackett-Luce 和 Bradley-Terry 模型广泛用于偏好学习和排序任务，但它们存在 under-specification（欠约束）问题，导致奖励函数的唯一性难以确定。

DPO 提出的重新参数化技术通过显式定义奖励函数和偏好分布之间的关系，有效解决了该问题，使偏好建模更加稳定和可控。

关键点总结：

欠约束问题本质： 不同奖励函数可以生成相同偏好分布，导致参数不确定性。
重新参数化方法： 将奖励函数与策略概率比率关联，确保一致性。
应用场景： 优化对话生成、摘要生成和翻译模型，使结果更符合人类偏好。

这种解决思路不仅简化了强化学习中的复杂步骤，还为大规模语言模型的偏好微调提供了更加高效的解决方案。

英文版

In-Depth Analysis of Plackett-Luce and Bradley-Terry Models and Their Under-Specification Problem

This article provides a detailed explanation of Plackett-Luce and Bradley-Terry models and their applications in preference learning, focusing on the issue of under-specification. We will also discuss how reparameterization effectively resolves this problem, supported by mathematical formulas and examples.

1. What Are the Plackett-Luce and Bradley-Terry Models?

1.1 Plackett-Luce Model

The Plackett-Luce model is a probabilistic model widely used to describe ranking or selection processes among multiple candidates. Each candidate ( $y$ ) is assigned a non-negative score ( $s (y)$ ), representing its relative attractiveness or preference level.

Given input ( $x$ ), the probability of selecting candidate ( $y$ ) follows:

$\frac{\exp(s(y))}{\sum_{y' \in Y} \exp(s(y'))}$

Where:

( $s (y)$ ): The score or utility function for candidate ( $y$ ).
( $Y$ ): The set of all candidates.

1.2 Bradley-Terry Model

The Bradley-Terry model is a special case of the Plackett-Luce model, specifically designed for pairwise preference comparisons. For two candidates ( $y_w$ ) and ( $y_l$ ) (preferred and less preferred, respectively), the preference probability is:

$P(y_w \succ y_l | x) = \frac{\exp(s(y_w))}{\exp(s(y_w)) + \exp(s(y_l))}$

This formula represents the probability that ( $y_w$ ) is preferred over ( $y_l$ ).

2. Understanding the Under-Specification Problem

2.1 Definition and Nature

In Plackett-Luce and Bradley-Terry models, the under-specification problem occurs when multiple distinct reward functions ( $r (x, y)$ ) produce the same preference distribution. This ambiguity hinders the interpretability and stability of the model’s parameters.

Specifically, if two reward functions ( $r_1$ ) and ( $r_2$ ) generate the same preference distribution:

$P_1(y | x) = P_2(y | x)$

Then, from a probabilistic perspective, these reward functions are equivalent, even though their forms and parameters may differ entirely.

2.2 Example Analysis

Consider a text summarization task with the input:

Input: Summarize this article.

Candidate outputs:

( $y_w$ ): “Key points are…” (preferred output).
( $y_l$ ): “The main topic is…” (less preferred output).

According to the Bradley-Terry model, the preference probability is:

$P(y_w \succ y_l | x) = \frac{\exp(s(y_w))}{\exp(s(y_w)) + \exp(s(y_l))}$

If ( $s(y_w) = 1.5$ ) and ( $s(y_l) = 1.0$ ):

$P(y_w \succ y_l | x) = \frac{e^{1.5}}{e^{1.5} + e^{1.0}} \approx 0.622$

This probability indicates a preference for the first output. However, due to under-specification, different functional forms of reward functions can produce the same preference probabilities.

3. Reparameterization: Solving the Under-Specification Problem

3.1 Key Idea

Direct Preference Optimization (DPO) introduces reparameterization to address under-specification by explicitly linking the reward function to the ratio of probabilities between the policy model and a reference model. This ensures preference distributions are stable and unique.

3.2 Reparameterized Reward Function Formula

The reward function is defined as:

$\beta \log \frac{\pi(y | x)}{\pi_{ref}(y | x)}$

Where:

( $\pi(y | x)$ ): Probability distribution of the current policy model.
( $\pi_{ref}(y | x)$ ): Probability distribution of the reference model.
( $\beta$ ): A scale factor controlling preference distribution.

This formula explicitly connects the reward function to the model’s policy, ensuring equivalence classes are well-defined and eliminating ambiguity caused by under-specification.

3.3 Normalization Condition

To ensure valid probability distributions (i.e., probabilities sum to 1), the following normalization condition must be satisfied:

$\sum_y \pi_{ref}(y | x) \exp\left( \frac{1}{\beta} r(x, y) \right) = 1$

This condition constrains the reward function to a feasible range, simplifying optimization and training.

4. Practical Applications

Example 1: Computing Rewards

Given reference model probabilities:

$\pi_{ref}(y_w | x) = 0.5, \quad \pi_{ref}(y_l | x) = 0.5$

And current policy model probabilities:

$\pi(y_w | x) = 0.7, \quad \pi(y_l | x) = 0.3$

The reward function becomes:

$y_w) = \beta \log \frac{0.7}{0.5} \approx 0.336 \beta$

This indicates an approximately 33.6% improvement in preference probability.

Example 2: Optimizing Preference Modeling

For tasks like dialog generation, the above formula enables iterative fine-tuning based on user preferences. Over time, the preference distribution shifts towards outputs that align better with user expectations.

5. Conclusion and Key Takeaways

The Plackett-Luce and Bradley-Terry models are foundational for preference learning and ranking tasks. However, they face a critical under-specification problem, where multiple reward functions can generate identical preference distributions, leading to parameter uncertainty.

Reparameterization techniques, as introduced in DPO, resolve this issue by explicitly defining the relationship between rewards and preference probabilities. This approach ensures:

Consistency: Models produce stable and interpretable outputs.
Flexibility: Reward functions adapt to complex preference patterns.
Scalability: Efficiently handles large-scale fine-tuning in NLP applications.

By adopting reparameterization, we simplify reinforcement learning processes, improve stability, and enhance preference-based optimization for large language models.