Bradley-Terry 模型:经典的人类偏好建模方法及其在 DPO 方法中的应用

Bradley-Terry 模型:经典的人类偏好建模方法及其在 DPO 方法中的应用

什么是 Bradley-Terry 模型?

Bradley-Terry(BT)模型 是一种经典的概率模型,用于预测和建模两个竞争者之间的偏好关系。该模型最初应用于竞赛结果的分析,例如判断一个人或团队相对于另一个的胜率。它也被广泛用于排名数据的处理,例如对物品、服务或答案的偏好建模。

BT 模型的核心思想是为每个选项(或竞争者)分配一个 正实数评分(score),该评分反映了选项相对“偏好”的强弱。


Bradley-Terry 模型的数学定义

在 BT 模型中,两个选项 ( i i i) 和 ( j j j) 的相对偏好概率定义为:

P ( i > j ) = p i p i + p j P(i > j) = \frac{p_i}{p_i + p_j} P(i>j)=pi+pjpi

其中:

  • ( p i p_i pi) 和 ( p j p_j pj) 是选项 ( i i i) 和 ( j j j) 的分数,均为正实数。

为了更方便地优化和建模,我们通常使用重参数化,将 ( p i p_i pi) 表示为指数形式:
p i = e a i p_i = e^{a_i} pi=eai

于是,偏好概率可以重写为:
P ( i > j ) = e a i e a i + e a j = 1 1 + e − ( a i − a j ) P(i > j) = \frac{e^{a_i}}{e^{a_i} + e^{a_j}} = \frac{1}{1 + e^{-(a_i - a_j)}} P(i>j)=eai+eajeai=1+e(aiaj)1

这与经典的 sigmoid 函数 一致:
σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1


重参数化的意义
  1. 数学稳定性:直接优化正实数 ( p i p_i pi) 的分数可能会带来数值问题,而通过重参数化后使用 ( a i a_i ai)(可以取任意实数)来表示,优化过程更加稳定。
  2. 优化效率:重参数化后的公式方便使用梯度下降等优化方法进行参数学习。
  3. 对数空间解释:( a i a_i ai) 可以理解为选项 ( i i i) 的对数分数,便于从对数线性关系的角度分析模型。

MLE(最大似然估计)优化

为了估计选项的分数,我们使用 最大似然估计(MLE),即通过观测数据(偏好比较的结果),找到最符合数据的参数 ( a i a_i ai)。

目标函数:
arg ⁡ max ⁡ { a i } ∏ i , j P ( i > j ) \arg \max_{\{a_i\}} \prod_{i,j} P(i > j) arg{ai}maxi,jP(i>j)

取对数后转化为:
arg ⁡ min ⁡ { a i } − ∑ i , j log ⁡ P ( i > j ) \arg \min_{\{a_i\}} -\sum_{i,j} \log P(i > j) arg{ai}mini,jlogP(i>j)

这个公式通过迭代优化的方法求解,得到每个选项的分数。


Bradley-Terry 模型在 DPO 方法中的应用

在大模型训练中,特别是基于人类偏好构建奖励模型(Reward Model, RM)时,Bradley-Terry 模型起到了核心作用。例如,在 DPO(Direct Preference Optimization) 方法中,需要根据人类反馈的对比数据(如用户更喜欢答案 A 而不是答案 B)来优化模型生成的结果。

DPO 中的作用
  1. 偏好建模:DPO 方法直接依赖偏好概率计算,例如给定两组答案 A 和 B,BT 模型可以计算出用户更偏好 A 的概率。
  2. 参数优化:通过重参数化,将生成模型的分数直接映射到偏好概率中,使用类似最大似然的方法调整生成结果的倾向性。
  3. 模型改进:通过 BT 模型计算的概率梯度信息,可以指导生成模型的参数更新,使其更符合人类的偏好。

小结

Bradley-Terry 模型 是一种经典的人类偏好建模方法,通过重参数化和最大似然估计,可以高效地捕捉和预测偏好关系。在强化学习训练(例如 DPO 方法)中,它帮助将人类反馈转化为模型优化的信号,为构建更符合人类需求的生成式 AI 提供了重要工具。

参考:【手撕RLHF-DPO(1)】不是PPO训不起,而是DPO更有性价比!

英文版

Bradley-Terry Model: A Classic Approach for Modeling Human Preferences and Its Role in DPO

What is the Bradley-Terry Model?

The Bradley-Terry (BT) model is a classic probabilistic model used to predict and model pairwise preference relationships between two competing entities. Originally developed to analyze outcomes of competitions (e.g., between individuals or teams), it has since been widely applied in ranking data, such as modeling preferences for items, services, or answers.

The core idea of the BT model is to assign each option (or competitor) a positive real-valued score, which reflects its relative strength or preference.


Mathematical Definition of the Bradley-Terry Model

In the BT model, the probability that option ( i i i) is preferred over ( j j j) is defined as:

P ( i > j ) = p i p i + p j P(i > j) = \frac{p_i}{p_i + p_j} P(i>j)=pi+pjpi

Where:

  • ( p i p_i pi) and ( p j p_j pj) are the scores of options ( i i i) and ( j j j), respectively, and are positive real numbers.

To make optimization easier, a reparameterization is typically used, where ( p i p_i pi) is expressed in exponential form:
p i = e a i p_i = e^{a_i} pi=eai

The probability can then be rewritten as:
P ( i > j ) = e a i e a i + e a j = 1 1 + e − ( a i − a j ) P(i > j) = \frac{e^{a_i}}{e^{a_i} + e^{a_j}} = \frac{1}{1 + e^{-(a_i - a_j)}} P(i>j)=eai+eajeai=1+e(aiaj)1

This is equivalent to the classic sigmoid function:
σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1


Why Reparameterization is Important
  1. Numerical Stability: Direct optimization of the positive scores ( p i p_i pi) may lead to numerical issues, while reparameterizing with ( a i a_i ai) (which can take any real value) makes the optimization process more stable.
  2. Optimization Efficiency: Reparameterization enables the use of gradient-based methods like stochastic gradient descent for efficient parameter learning.
  3. Logarithmic Interpretation: The parameter ( a i a_i ai) can be interpreted as the log score of option ( i i i), which provides a convenient linear perspective for analyzing the model.

Maximum Likelihood Estimation (MLE) Optimization

To estimate the scores of the options, maximum likelihood estimation (MLE) is used. Given observational data (i.e., pairwise comparisons), the goal is to find the parameters ( a i a_i ai) that best fit the data.

The likelihood function:
arg ⁡ max ⁡ { a i } ∏ i , j P ( i > j ) \arg \max_{\{a_i\}} \prod_{i,j} P(i > j) arg{ai}maxi,jP(i>j)

Taking the logarithm to simplify, the optimization becomes:
arg ⁡ min ⁡ { a i } − ∑ i , j log ⁡ P ( i > j ) \arg \min_{\{a_i\}} -\sum_{i,j} \log P(i > j) arg{ai}mini,jlogP(i>j)

This can be solved iteratively to obtain the scores for each option.


The Role of Bradley-Terry Model in DPO

In large language model training, particularly for constructing reward models (RM) based on human feedback, the Bradley-Terry model plays a crucial role. For example, in Direct Preference Optimization (DPO), the model helps convert human feedback (e.g., preferring answer A over answer B) into a mathematical framework for optimization.

Contribution of BT Model in DPO
  1. Preference Modeling: The DPO method relies on the BT model to compute the probability of preference, such as determining the likelihood that a user prefers answer A over answer B.
  2. Parameter Optimization: Through reparameterization, the BT model maps scores directly to preference probabilities, which can then be optimized using maximum likelihood or similar methods.
  3. Improving Model Behavior: The gradients derived from the BT model’s probability calculations guide updates to the generative model’s parameters, ensuring that the outputs better align with human preferences.

Conclusion

The Bradley-Terry model is a classic approach for modeling human preferences and pairwise comparisons. Through reparameterization and maximum likelihood estimation, it efficiently captures preference relationships. In methods like DPO, it serves as a fundamental tool for converting human feedback into optimization signals, helping to train AI systems that better align with human values and preferences.

Bradley-Terry 模型的实际应用:排名数据处理与偏好建模

Bradley-Terry(BT)模型不仅在比赛结果预测中有重要作用,还被广泛用于处理排名数据,特别是在物品、服务或答案的偏好建模中。例如,当需要根据用户反馈对产品进行排序,或者根据用户选择偏好对答案进行优劣排序时,BT 模型都能提供有效的解决方案。


示例:偏好建模案例

假设我们有 3 个候选选项 ( A A A )、( B B B )、( C C C ),用户提供了以下偏好比较:

  • ( A A A ) 比 ( B B B ) 更受欢迎
  • ( B B B ) 比 ( C C C ) 更受欢迎
  • ( A A A ) 比 ( C C C ) 更受欢迎

我们希望通过 Bradley-Terry 模型对这些选项进行评分,并最终得到一个排序。


Python 实现:数值模拟

以下代码展示了如何用 Bradley-Terry 模型对物品偏好进行建模,并估计各选项的评分。

import numpy as np
from scipy.optimize import minimize

# 假设比较数据:每一行表示 (winner, loser)
comparisons = [
    ('A', 'B'),
    ('B', 'C'),
    ('A', 'C'),
    ('A', 'B'),
    ('B', 'C')
]

# 初始化选项
items = ['A', 'B', 'C']
n_items = len(items)

# 将选项映射为索引
item_to_index = {item: idx for idx, item in enumerate(items)}

# 构建胜负记录矩阵
win_counts = np.zeros((n_items, n_items))
for winner, loser in comparisons:
    win_counts[item_to_index[winner], item_to_index[loser]] += 1

# 定义 Bradley-Terry 模型的负对数似然函数
def negative_log_likelihood(params):
    scores = np.exp(params)  # 将分数参数转换为正值
    log_likelihood = 0
    for i in range(n_items):
        for j in range(n_items):
            if i != j and win_counts[i, j] > 0:
                prob = scores[i] / (scores[i] + scores[j])
                log_likelihood += win_counts[i, j] * np.log(prob)
    return -log_likelihood  # 取负值,因为要最小化

# 初始化参数
initial_params = np.zeros(n_items)

# 优化分数参数
result = minimize(negative_log_likelihood, initial_params, method='BFGS')
scores = np.exp(result.x)

# 打印结果
print("选项评分:")
for item, score in zip(items, scores):
    print(f"{item}: {score:.3f}")

# 按评分排序
ranked_items = sorted(zip(items, scores), key=lambda x: x[1], reverse=True)
print("\n排序结果:")
for rank, (item, score) in enumerate(ranked_items, 1):
    print(f"{rank}. {item} (评分: {score:.3f})")

代码运行结果示例

运行上述代码后,可能会得到如下结果(根据模拟数据):

选项评分:
A: 2.718
B: 1.491
C: 0.606

排序结果:
1. A (评分: 2.718)
2. B (评分: 1.491)
3. C (评分: 0.606)

从结果可以看出,选项 ( A A A ) 的评分最高,因此排名第一;( B B B ) 排名第二;( C C C ) 排名第三。


模型应用分析
  • 用户偏好建模:BT 模型将用户的偏好比较映射为具体分数,并通过优化方法寻找最符合用户偏好的分数分布。
  • 排名生成:通过对分数进行排序,可以得到物品或答案的最终排名。
  • 适用场景:推荐系统、问答系统中的答案排序、竞赛结果预测等。

总结

Bradley-Terry 模型是一个简单而有效的偏好建模工具,尤其适用于基于配对比较的场景。通过 Python 的数值优化方法,我们可以轻松实现模型训练并生成排名结果,为复杂的排序和推荐任务提供可靠支持。

Practical Application of the Bradley-Terry Model: Preference Modeling and Ranking Data

The Bradley-Terry (BT) model is widely used not only for predicting match outcomes but also for handling ranking data, especially for modeling preferences for items, services, or answers. For example, it is useful when ranking products based on user feedback or ordering answers based on user preferences.


Example: Preference Modeling Use Case

Suppose we have three candidates ( A A A ), ( B B B ), and ( C C C ), with the following user preference comparisons:

  • ( A A A ) is preferred over ( B B B )
  • ( B B B ) is preferred over ( C C C )
  • ( A A A ) is preferred over ( C C C )

We aim to estimate the scores for these options using the Bradley-Terry model and generate a ranking based on their relative scores.


Python Implementation: Numerical Simulation

The following Python code demonstrates how to implement the Bradley-Terry model to model preferences and estimate the scores of items.

import numpy as np
from scipy.optimize import minimize

# Preference comparison data: each row represents (winner, loser)
comparisons = [
    ('A', 'B'),
    ('B', 'C'),
    ('A', 'C'),
    ('A', 'B'),
    ('B', 'C')
]

# Initialize items
items = ['A', 'B', 'C']
n_items = len(items)

# Map items to indices
item_to_index = {item: idx for idx, item in enumerate(items)}

# Construct the win-loss matrix
win_counts = np.zeros((n_items, n_items))
for winner, loser in comparisons:
    win_counts[item_to_index[winner], item_to_index[loser]] += 1

# Define the negative log-likelihood function for the BT model
def negative_log_likelihood(params):
    scores = np.exp(params)  # Convert raw parameters into positive scores
    log_likelihood = 0
    for i in range(n_items):
        for j in range(n_items):
            if i != j and win_counts[i, j] > 0:
                prob = scores[i] / (scores[i] + scores[j])
                log_likelihood += win_counts[i, j] * np.log(prob)
    return -log_likelihood  # Negate the log-likelihood for minimization

# Initial parameter values
initial_params = np.zeros(n_items)

# Optimize the score parameters
result = minimize(negative_log_likelihood, initial_params, method='BFGS')
scores = np.exp(result.x)

# Display the results
print("Item scores:")
for item, score in zip(items, scores):
    print(f"{item}: {score:.3f}")

# Rank the items by score
ranked_items = sorted(zip(items, scores), key=lambda x: x[1], reverse=True)
print("\nRanking results:")
for rank, (item, score) in enumerate(ranked_items, 1):
    print(f"{rank}. {item} (Score: {score:.3f})")

Example Output

When running the above code, you might get the following results based on the simulated data:

Item scores:
A: 2.718
B: 1.491
C: 0.606

Ranking results:
1. A (Score: 2.718)
2. B (Score: 1.491)
3. C (Score: 0.606)

From the results, it is clear that ( A A A ) has the highest score and is ranked first, followed by ( B B B ), and ( C C C ) is ranked last.


Analysis of the Model Application
  • Preference Modeling: The BT model maps pairwise preference comparisons into numerical scores, allowing us to quantify preferences.
  • Ranking Generation: By sorting the scores, we can derive a final ranking for the items or answers.
  • Applicable Scenarios: Common applications include recommendation systems, ranking answers in Q&A systems, competition result predictions, etc.

Conclusion

The Bradley-Terry model is a simple yet effective tool for modeling preferences, especially in pairwise comparison scenarios. Using Python’s numerical optimization methods, we can train the model and generate rankings, providing reliable support for complex ranking and recommendation tasks.

后记

2024年12月21日12点08分于上海,在GPT4o大模型辅助下完成。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值