强化学习中的multiarmed-Bandit以及经典解法epsilon-greedy算法与UCB算法,附加python实现

最近在看Management Science上的文章《A Dynamic Clustering Approach to Data-Driven Assortment Personalization》,其中提到了一个Multiarmed-Bandit模型,想要深入学习一下,但是查遍各种网站,都没有中文的关于这个问题的介绍,因此去油管上学习,然后翻译成中文在这里跟大家分享。Exploration a...
摘要由CSDN通过智能技术生成

最近在看Management Science上的文章《A Dynamic Clustering Approach to Data-Driven Assortment Personalization》,其中提到了一个Multiarmed-Bandit模型,想要深入学习一下,但是查遍各种网站,都没有中文的关于这个问题的介绍,因此去油管上学习,然后翻译成中文在这里跟大家分享。

Exploration and exploitation tradeoff

在强化学习中有一个经典问题,即Exploration and esploitation tradeoff,在该问题中有一个两难境地:到底我们应该花精力去探索从而对收益有更精确的估计,还是应该按照目前拥有的信息,选择最大收益期望的行动?
由此引申出Multiarmed-Bandit模型

Multiarmed-Bandit Model

假设现在有n台老虎机,每台老虎机的收益不同,但我们事先并不知道每台老虎机的期望收益。
我们在这里假设:每台老虎机的收益服从方差为1的正态分布,均值事先并不知道。我们需要探索每台老虎机的收益分布,并最终让行动选择拥有最有的期望收益的老虎机。

传统的解决方案 A/B test

A/B test的思路是,给每台老虎机分配数量相同的测试数目,然后根据所有老虎机的测试结果,选择表现最优的老虎机进行剩下的所有操作。
这种方法的最大劣势就是将探索与开发割裂开来。在探索过程中只考虑探索,只收集信息,而在开发的阶段就不再考虑探索,也就失去了学习的机会,有可能陷入了局部最优,没有找到最优的老虎机。

epsilon-greedy 算法

epsilon-greedy算法也是一种贪婪算法,不过在每次选择的过程中,会以一个较小的改了选择不是最优行动的其他行动,从而能够不断进行探索。由于epsilon较少,并且最终算法会找到最优的行动,因此最终选择最优的行动的概率会趋近于1-epsilon。
下面展示python代码:

import numpy as np
import matplotlib.pyplot as plt


class EpsilonGreedy:
    def __init__(self):
        self.epsilon = 0.1  # 设定epsilon值
        self.num_arm = 10  # 设置arm的数量
        self.arms = np.random.uniform(0, 1, self.num_arm)  # 设置每一个arm的均值,为0-1之间的随机数
        self.best = np.argmax(self.arms)  # 找到最优arm的index
        self.T = 50000  # 设置进行行动的次数
        self.hit = np.zeros(self.T)  # 用来记录每次行动是否找到最优arm
        self.reward = np.zeros(self.num_arm)  # 用来记录每次行动后各个arm的平均收益
        self.num = np.zeros(self.num_arm)  # 用来记录每次行动后各个arm被拉动的总次数

    def get_reward(self, i):  # i为arm的index
        return self.arms[i] + np.random.normal(0, 1)  # 生成的收益为arm的均值加上一个波动

    def update(self, i):
        self.num[i] += 1
        self.reward[i] = (self.reward[i]*(self.num[i]-1)+self.get_reward(i))/self.num[i]

    def calculate(self):
        for i in range(self.T):
            if np.random.random() > self.epsilon:
                index = np.argmax(self.reward)
            else:
                a = np.argmax(self.reward)
                index = a
                while index == a:
                    index = np.random.randint(0, self.num_arm)
            if index == self.best:
                self.hit[i] = 1  # 如果拿到的arm是最优的arm,则将其记为1
            self.update(index)

    def plot(self):  # 画图查看收敛性
        x = np.array(range(self.T
  • 10
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
以下是Python代码实现: 1. Softmax算法: ```python import numpy as np def softmax_action_selection(q_values, tau=1.0): """ Softmax action selection algorithm for multi-armed bandit problem. :param q_values: numpy array of shape (num_actions,) representing the estimated action values :param tau: float temperature parameter controlling the degree of exploration :return: selected action """ probabilities = np.exp(q_values / tau) / np.sum(np.exp(q_values / tau)) action = np.random.choice(len(q_values), p=probabilities) return action ``` 2. Epsilon-Greedy算法: ```python import numpy as np def epsilon_greedy_action_selection(q_values, epsilon=0.1): """ Epsilon-greedy action selection algorithm for multi-armed bandit problem. :param q_values: numpy array of shape (num_actions,) representing the estimated action values :param epsilon: float parameter controlling the degree of exploration :return: selected action """ if np.random.rand() < epsilon: action = np.random.choice(len(q_values)) else: action = np.argmax(q_values) return action ``` 3. BetaThompson sampling算法: ```python import numpy as np class BetaThompsonSampling: def __init__(self, num_actions): """ Beta Thompson sampling algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) """ self.alpha = np.ones(num_actions) self.beta = np.ones(num_actions) def action_selection(self): """ Select action according to the Beta distribution of each arm. :return: selected action """ samples = np.random.beta(self.alpha, self.beta) action = np.argmax(samples) return action def update(self, action, reward): """ Update the Beta distribution of the selected arm. :param action: selected action :param reward: observed reward """ if reward == 1: self.alpha[action] += 1 else: self.beta[action] += 1 ``` 4. UCB算法: ```python import numpy as np class UCB: def __init__(self, num_actions, c=1.0): """ Upper Confidence Bound (UCB) algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) :param c: exploration parameter """ self.num_actions = num_actions self.c = c self.N = np.zeros(num_actions) self.Q = np.zeros(num_actions) def action_selection(self): """ Select action according to the UCB upper confidence bound. :return: selected action """ upper_bounds = self.Q + self.c * np.sqrt(np.log(np.sum(self.N)) / (self.N + 1e-8)) action = np.argmax(upper_bounds) return action def update(self, action, reward): """ Update the estimated action value of the selected arm. :param action: selected action :param reward: observed reward """ self.N[action] += 1 self.Q[action] += (reward - self.Q[action]) / self.N[action] ``` 5. LinUCB算法: ```python import numpy as np class LinUCB: def __init__(self, num_actions, num_features, alpha=0.1): """ Linear Upper Confidence Bound (LinUCB) algorithm for multi-armed bandit problem. :param num_actions: number of actions (arms) :param num_features: number of features :param alpha: exploration parameter """ self.num_actions = num_actions self.num_features = num_features self.alpha = alpha self.A = np.array([np.eye(num_features) for _ in range(num_actions)]) self.b = np.zeros((num_actions, num_features)) self.theta = np.zeros((num_actions, num_features)) def action_selection(self, features): """ Select action according to the LinUCB upper confidence bound. :param features: numpy array of shape (num_features,) representing the features of the context :return: selected action """ upper_bounds = np.zeros(self.num_actions) for i in range(self.num_actions): A_inv = np.linalg.inv(self.A[i]) self.theta[i] = np.dot(A_inv, self.b[i]) upper_bounds[i] = np.dot(self.theta[i], features) + \ self.alpha * np.sqrt(np.dot(features.T, np.dot(A_inv, features))) action = np.argmax(upper_bounds) return action def update(self, action, features, reward): """ Update the estimated parameters of the selected arm. :param action: selected action :param features: numpy array of shape (num_features,) representing the features of the context :param reward: observed reward """ self.A[action] += np.outer(features, features) self.b[action] += reward * features ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值