理解 Sigmoid 和 -log Sigmoid 函数:定义、特点及在 Bradley-Terry 模型中的应用

理解 Sigmoid 和 -log Sigmoid 函数:定义、特点及在 Bradley-Terry 模型中的应用


1. 什么是 Sigmoid 函数?

Sigmoid 函数是机器学习和深度学习中常用的一种激活函数,其公式为:

σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1

Sigmoid 函数的特点
  1. 输出范围:Sigmoid 的输出值在 ( ( 0 , 1 ) (0, 1) (0,1) ) 之间,适合用于概率建模。
  2. 单调性:随着输入 ( x x x) 增大,Sigmoid 函数的输出也单调递增。
  3. 平滑过渡:Sigmoid 函数在 0 附近具有最大的梯度,越靠近两端(正负无穷)梯度越小(容易产生梯度消失问题)。
用途
  • 在二分类问题中,Sigmoid 通常用于将模型输出的分值映射为概率。
  • 用于深度学习中的神经元激活函数。

2. 什么是 -log Sigmoid 函数?

-log Sigmoid 是 Sigmoid 的负对数变换,公式为:

− log ⁡ σ ( x ) = − log ⁡ ( 1 1 + e − x ) = log ⁡ ( 1 + e − x ) -\log \sigma(x) = -\log \left( \frac{1}{1 + e^{-x}} \right) = \log(1 + e^{-x}) logσ(x)=log(1+ex1)=log(1+ex)

-log Sigmoid 的特点
  1. 数值范围
    • 当 ( x x x) 非常大(正无穷)时,( − log ⁡ σ ( x ) → 0 -\log \sigma(x) \to 0 logσ(x)0),表示预测正确的置信度非常高。
    • 当 ( x x x) 非常小(负无穷)时,( − log ⁡ σ ( x ) → ∞ -\log \sigma(x) \to \infty logσ(x)),表示预测错误的惩罚非常大。
  2. 对称性
    ( − log ⁡ σ ( x ) -\log \sigma(x) logσ(x)) 和 ( − log ⁡ σ ( − x ) -\log \sigma(-x) logσ(x)) 互为镜像,适合建模两类对立的事件。
  3. 稳定性
    -log Sigmoid 在数值计算上相对稳定,适合用于损失函数,尤其是概率预测问题。
用途
  • 交叉熵损失:-log Sigmoid 是交叉熵损失函数的核心组成部分,用于衡量预测的概率与真实值之间的偏差。
  • 偏好建模:如在 Bradley-Terry 模型中,用来优化分数的差值。

3. Sigmoid 和 -log Sigmoid 的区别与优点
特点Sigmoid-log Sigmoid
定义输出在 (0, 1) 之间,用于概率建模用于计算概率的负对数,常用作损失函数
值的意义值越接近 1,表示预测置信度越高值越小表示预测置信度越高,值越大表示惩罚越大
梯度信息两端梯度容易消失,计算不够敏感对差值敏感,优化过程中表现稳定
应用场景用于二分类激活函数和概率映射用于交叉熵损失和偏好建模损失

4. 在 Bradley-Terry 模型中的应用

Bradley-Terry 模型是一个经典的概率模型,用于建模两选一偏好数据的概率。通过使用 -log Sigmoid 函数,可以有效地衡量预测与真实偏好之间的误差。

公式
在 BT 模型中,两个选项 ( i i i) 和 ( j j j) 的胜负概率为:

P ( i > j ) = e β i e β i + e β j P(i > j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} P(i>j)=eβi+eβjeβi

其损失函数为:

− log ⁡ σ ( β i − β j ) -\log \sigma(\beta_i - \beta_j) logσ(βiβj)

解读
  • 当 ( β i − β j \beta_i - \beta_j βiβj ) 差值越大时,表示模型对 ( i i i) 胜出的信心越高,损失越小。
  • 当 ( β i − β j \beta_i - \beta_j βiβj ) 差值越小时,损失变大,模型会调整参数,使得预测更接近实际结果。

5. 实现代码

以下代码使用 Python 模拟 Bradley-Terry 模型,并使用 -log Sigmoid 作为损失函数来优化分数。

import numpy as np
from scipy.optimize import minimize
from scipy.special import expit  # Sigmoid 函数

# 假设有三名选手 A, B, C
items = ['A', 'B', 'C']
n_items = len(items)

# 比赛数据 (winner, loser)
comparisons = [
    ('A', 'B'),
    ('B', 'C'),
    ('A', 'C'),
    ('A', 'B'),
    ('B', 'C')
]

# 建立选手索引
item_to_index = {item: idx for idx, item in enumerate(items)}

# 初始化得分
initial_scores = np.zeros(n_items)

# 损失函数(-log Sigmoid)
def loss_function(scores):
    loss = 0
    for winner, loser in comparisons:
        winner_idx = item_to_index[winner]
        loser_idx = item_to_index[loser]
        # 差值计算
        diff = scores[winner_idx] - scores[loser_idx]
        # -log Sigmoid 损失
        loss += np.log(1 + np.exp(-diff))
    return loss

# 优化分数
result = minimize(loss_function, initial_scores, method='BFGS')
optimized_scores = result.x

# 打印优化结果
print("Optimized Scores:")
for item, score in zip(items, optimized_scores):
    print(f"{item}: {score:.3f}")

# 计算排名
ranking = sorted(zip(items, optimized_scores), key=lambda x: x[1], reverse=True)
print("\nRanking:")
for rank, (item, score) in enumerate(ranking, 1):
    print(f"{rank}. {item} (Score: {score:.3f})")

6. 示例结果

运行上述代码后,可能会输出如下结果:

Optimized Scores:
A: 1.579
B: 0.693
C: -0.285

Ranking:
1. A (Score: 1.579)
2. B (Score: 0.693)
3. C (Score: -0.285)

分析

  • 选手 A 的分数最高,表明其偏好或胜率最高。
  • 使用 -log Sigmoid 作为损失函数,可以有效地优化分数并生成可靠的排名。

7. 总结
  • Sigmoid 函数:将输入值映射到概率空间,用于概率预测。
  • -log Sigmoid 函数:用于衡量预测的置信度或作为损失函数,具有良好的稳定性和数值表现。
  • 在 Bradley-Terry 模型中的作用:通过优化得分差值,生成高质量的偏好排序。

这种方法不仅适用于比赛结果预测,还可扩展到推荐系统、问答系统等领域,具有很强的通用性。

Understanding Sigmoid and -log Sigmoid Functions: Definitions, Benefits, and Applications in the Bradley-Terry Model


1. What is the Sigmoid Function?

The Sigmoid function is a widely used activation function in machine learning and deep learning. Its formula is:

σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1

Characteristics of Sigmoid:
  1. Output range: The output values lie in the range ( ( 0 , 1 ) (0, 1) (0,1) ), making it suitable for probability modeling.
  2. Monotonicity: The output increases monotonically as the input ( x x x) increases.
  3. Smooth transition: Sigmoid has its largest gradient near 0, while the gradient diminishes as the input moves toward extreme positive or negative values (leading to the vanishing gradient issue).
Applications:
  • Used in binary classification to map model outputs to probabilities.
  • Acts as an activation function in neural networks to introduce non-linearity.

2. What is the -log Sigmoid Function?

The -log Sigmoid function is the negative logarithmic transformation of the Sigmoid function, defined as:

− log ⁡ σ ( x ) = − log ⁡ ( 1 1 + e − x ) = log ⁡ ( 1 + e − x ) -\log \sigma(x) = -\log \left( \frac{1}{1 + e^{-x}} \right) = \log(1 + e^{-x}) logσ(x)=log(1+ex1)=log(1+ex)

Characteristics of -log Sigmoid:
  1. Value range:
    • When ( x x x) is very large (positive infinity), ( − log ⁡ σ ( x ) → 0 -\log \sigma(x) \to 0 logσ(x)0), indicating high confidence in predictions.
    • When ( x x x) is very small (negative infinity), ( − log ⁡ σ ( x ) → ∞ -\log \sigma(x) \to \infty logσ(x)), reflecting severe penalties for incorrect predictions.
  2. Symmetry:
    • ( − log ⁡ σ ( x ) -\log \sigma(x) logσ(x)) and ( − log ⁡ σ ( − x ) -\log \sigma(-x) logσ(x)) are symmetric, making it suitable for modeling mutually exclusive events.
  3. Stability:
    • The function is numerically stable and well-suited for use as a loss function, especially in probability-based predictions.
Applications:
  • Cross-Entropy Loss: -log Sigmoid is a key component in cross-entropy loss, which measures the difference between predicted and true probabilities.
  • Preference Modeling: It is often used to model pairwise preferences, such as in the Bradley-Terry model.

3. Differences and Advantages of Sigmoid and -log Sigmoid
FeatureSigmoid-log Sigmoid
DefinitionOutputs values between ( ( 0 , 1 ) (0, 1) (0,1) ), used for probability modelingComputes the negative log of Sigmoid, often used as a loss function
Value MeaningHigher values indicate higher confidenceSmaller values indicate higher confidence, larger values impose penalties
Gradient InformationGradients diminish at extreme valuesSensitive to differences, stable in optimization
Use CaseProbability mapping and activation functionsLoss function for tasks like preference modeling

4. Application in the Bradley-Terry Model

The Bradley-Terry (BT) model is a probabilistic model used to describe pairwise preferences, such as ranking items based on comparisons. The -log Sigmoid function is used to measure the error between predictions and actual preferences.

Formula:
In the BT model, the probability that item ( i i i) is preferred over item ( j j j) is:

P ( i > j ) = e β i e β i + e β j P(i > j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} P(i>j)=eβi+eβjeβi

The corresponding loss function is:

− log ⁡ σ ( β i − β j ) -\log \sigma(\beta_i - \beta_j) logσ(βiβj)

Interpretation:
  • When ( β i − β j \beta_i - \beta_j βiβj ) (the score difference) is large, the model is confident in predicting that (i) is preferred, and the loss is small.
  • When ( β i − β j \beta_i - \beta_j βiβj ) is small or negative, the loss increases, prompting the model to adjust the scores to better align with the observed preferences.

5. Implementation in Python

Below is a Python implementation of the Bradley-Terry model using -log Sigmoid as the loss function.

import numpy as np
from scipy.optimize import minimize
from scipy.special import expit  # Sigmoid function

# Define items (e.g., players or products)
items = ['A', 'B', 'C']
n_items = len(items)

# Pairwise comparisons (winner, loser)
comparisons = [
    ('A', 'B'),
    ('B', 'C'),
    ('A', 'C'),
    ('A', 'B'),
    ('B', 'C')
]

# Map items to indices
item_to_index = {item: idx for idx, item in enumerate(items)}

# Initialize scores
initial_scores = np.zeros(n_items)

# Define the -log Sigmoid loss function
def loss_function(scores):
    loss = 0
    for winner, loser in comparisons:
        winner_idx = item_to_index[winner]
        loser_idx = item_to_index[loser]
        # Calculate the score difference
        diff = scores[winner_idx] - scores[loser_idx]
        # Add the -log Sigmoid loss
        loss += np.log(1 + np.exp(-diff))
    return loss

# Optimize scores using the loss function
result = minimize(loss_function, initial_scores, method='BFGS')
optimized_scores = result.x

# Print the optimized scores
print("Optimized Scores:")
for item, score in zip(items, optimized_scores):
    print(f"{item}: {score:.3f}")

# Rank the items based on scores
ranking = sorted(zip(items, optimized_scores), key=lambda x: x[1], reverse=True)
print("\nRanking:")
for rank, (item, score) in enumerate(ranking, 1):
    print(f"{rank}. {item} (Score: {score:.3f})")

6. Example Results

Running the above code may produce results like the following:

Optimized Scores:
A: 1.579
B: 0.693
C: -0.285

Ranking:
1. A (Score: 1.579)
2. B (Score: 0.693)
3. C (Score: -0.285)
Analysis:
  • Player ( A A A) has the highest score, indicating the highest preference or likelihood of winning.
  • Using -log Sigmoid as the loss function ensures that the model effectively captures the relative differences between items and adjusts the scores accordingly.

7. Summary
  • Sigmoid Function: Maps input values to probabilities, commonly used in classification and activation functions.
  • -log Sigmoid Function: Measures confidence or penalty, often used as a loss function for tasks involving pairwise preferences or probability modeling.
  • Application in BT Model: By optimizing score differences with -log Sigmoid, the model produces reliable rankings based on observed pairwise comparisons.

This approach is not only effective for ranking tasks but can also be extended to recommendation systems, question-answering systems, and more.

后记

2024年12月21日13点40分于上海,在GPT4o大模型辅助下完成。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值