UCB(Upper Confidence Bound)算法

UCB(Upper Confidence Bound)算法

在推荐系统中,通常量化一个物品的收益率(或者说点击率)是使用点击数/展示数,例如点击为10,展示数为8,则估计的点击率为80%,在展示数达到10000后,其表现ctr是否还能达到80%呢? 显然是不可能的。而这就是统计学中的置信度问题,计算点击率的置信区间的方法也有很多,比如威尔逊置信空间

UCB算法步骤包括:首先对所有item的尝试一下,然后每次选择score值最大的那个:

Input: N arms, number of rounds T >= N

  • step.1. For t = 1 … N, play arm t
  • step.2. For t = N+1 … T, play arm

其中step.2. 中,play arm 时选取的item应该是经过排序的,排序score的计算方式为:
s c o r e ( j ) = C j , t T j , t + 2 ln ⁡ t T j , t score(j) = \frac { C _ { j, t } } { T _ { j , t } } + \sqrt { \frac { 2 \ln t } { T _ { j , t } } } score(j)=Tj,tCj,t+Tj,t2lnt

其中,式子“+”的左半部分为物品到目前的收益均值,右半部分本质上是均值的标准差。

  • t: 总实验次数
  • T(j,t) : j 帖子被推荐次数
  • C(j,t) :j 帖子被点击次数

这个公式表明随着每个物品试验次数的增加,其置信区间就越窄,收益概率就越能确定。如果收益均值越大,则被选中的机会就越大(exploit),如果收益均值越小,其被选中的概率也就越少,同时哪些被选次数较少的item也会得到试验机会,起到了explore的作用。

  • 6
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
以下是Python代码实现: 1. Softmax算法: ```python import numpy as np def softmax_action_selection(q_values, tau=1.0): """ Softmax action selection algorithm for multi-armed bandit problem. :param q_values: numpy array of shape (num_actions,) representing the estimated action values :param tau: float temperature parameter controlling the degree of exploration :return: selected action """ probabilities = np.exp(q_values / tau) / np.sum(np.exp(q_values / tau)) action = np.random.choice(len(q_values), p=probabilities) return action ``` 2. Epsilon-Greedy算法: ```python import numpy as np def epsilon_greedy_action_selection(q_values, epsilon=0.1): """ Epsilon-greedy action selection algorithm for multi-armed bandit problem. :param q_values: numpy array of shape (num_actions,) representing the estimated action values :param epsilon: float parameter controlling the degree of exploration :return: selected action """ if np.random.rand() < epsilon: action = np.random.choice(len(q_values)) else: action = np.argmax(q_values) return action ``` 3. BetaThompson sampling算法: ```python import numpy as np class BetaThompsonSampling: def __init__(self, num_actions): """ Beta Thompson sampling algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) """ self.alpha = np.ones(num_actions) self.beta = np.ones(num_actions) def action_selection(self): """ Select action according to the Beta distribution of each arm. :return: selected action """ samples = np.random.beta(self.alpha, self.beta) action = np.argmax(samples) return action def update(self, action, reward): """ Update the Beta distribution of the selected arm. :param action: selected action :param reward: observed reward """ if reward == 1: self.alpha[action] += 1 else: self.beta[action] += 1 ``` 4. UCB算法: ```python import numpy as np class UCB: def __init__(self, num_actions, c=1.0): """ Upper Confidence Bound (UCB) algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) :param c: exploration parameter """ self.num_actions = num_actions self.c = c self.N = np.zeros(num_actions) self.Q = np.zeros(num_actions) def action_selection(self): """ Select action according to the UCB upper confidence bound. :return: selected action """ upper_bounds = self.Q + self.c * np.sqrt(np.log(np.sum(self.N)) / (self.N + 1e-8)) action = np.argmax(upper_bounds) return action def update(self, action, reward): """ Update the estimated action value of the selected arm. :param action: selected action :param reward: observed reward """ self.N[action] += 1 self.Q[action] += (reward - self.Q[action]) / self.N[action] ``` 5. LinUCB算法: ```python import numpy as np class LinUCB: def __init__(self, num_actions, num_features, alpha=0.1): """ Linear Upper Confidence Bound (LinUCB) algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) :param num_features: number of features :param alpha: exploration parameter """ self.num_actions = num_actions self.num_features = num_features self.alpha = alpha self.A = np.array([np.eye(num_features) for _ in range(num_actions)]) self.b = np.zeros((num_actions, num_features)) self.theta = np.zeros((num_actions, num_features)) def action_selection(self, features): """ Select action according to the LinUCB upper confidence bound. :param features: numpy array of shape (num_features,) representing the features of the context :return: selected action """ upper_bounds = np.zeros(self.num_actions) for i in range(self.num_actions): A_inv = np.linalg.inv(self.A[i]) self.theta[i] = np.dot(A_inv, self.b[i]) upper_bounds[i] = np.dot(self.theta[i], features) + \ self.alpha * np.sqrt(np.dot(features.T, np.dot(A_inv, features))) action = np.argmax(upper_bounds) return action def update(self, action, features, reward): """ Update the estimated parameters of the selected arm. :param action: selected action :param features: numpy array of shape (num_features,) representing the features of the context :param reward: observed reward """ self.A[action] += np.outer(features, features) self.b[action] += reward * features ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值