梯度上升的老虎机算法

梯度上升的老虎机

接上概要,我们定义
E [ R t ] = ∑ b π t ( b ) q ( b ) E[R_t]=\sum_b\pi_t(b)q(b) E[Rt]=bπt(b)q(b)
其中 ∑ b π t ( b ) = 1 \sum_b\pi_t(b)=1 bπt(b)=1表示在时刻t下选择各个动作b的概率和为1.
我们想极大化此期望,为此引入变量 H t ( a ) H_t(a) Ht(a)表示在时间t下对时间a的偏好,我们初始定义其均为0,即 H 0 ( a ) = 0 H_0(a)=0 H0(a)=0.
为与上述期望产生联系,我们利用softmax操作定义每个时刻t下选择动作a的概率:
π t ( a ) = e H t ( a ) ∑ b = 1 n e H t ( b ) \pi_t(a)=\frac{e^{H_t(a)}}{\sum_{b=1}^ne^{H_t(b)}} πt(a)=b=1neHt(b)eHt(a)
这样,期望E便是偏好 H ⃗ \vec{H} H 的函数,我们在每步只要更新偏好 H ⃗ \vec{H} H 的值即可.利用梯度上升:
H t + 1 ( a ) = H t ( a ) + α ∂ E [ R t ] ∂ H t ( a ) H_{t+1}(a)=H_t(a)+\alpha\frac{\partial E[R_t]}{\partial H_t(a)} Ht+1(a)=Ht(a)+αHt(a)E[Rt]
其中 α \alpha α为常数步长.经过下列推导,我们可得到最终更新格式:
H t + 1 ( A t ) = H t ( A t ) + α ( R t − R t ‾ ) ( 1 − π t ( A t ) ) , H t + 1 ( a ) = H t ( a ) − α ( R t − R t ‾ ) π t ( a ) , ∀ a ≠ A t H_{t+1}(A_t)=H_t(A_t)+\alpha(R_t-\overline{R_t})(1-\pi_t(A_t)), \\ H_{t+1}(a)=H_t(a)-\alpha(R_t-\overline{R_t})\pi_t(a),\quad \forall a \not= A_t Ht+1(At)=Ht(At)+α(RtRt)(1πt(At)),Ht+1(a)=Ht(a)α(RtRt)πt(a),a=At
其中 A t A_t At为t时刻选择的动作, R t ‾ \overline{R_t} Rt为前t时刻所有 R k , k ≤ n R_k,k\leq n Rk,kn的平均值,加快收敛的作用(类似如同深度学习的中的归一化),换成其它数列也可.
以下为推导过程:
∂ E [ R t ] ∂ H t ( a ) = ∂ ∂ H t ( a ) [ ∑ b π t ( b ) q ( b ) ] = ∑ b q ( b ) ∂ π t ( b ) ∂ H t ( a ) = ∑ b ( q ( b ) − X t ) ∂ π t ( b ) ∂ H t ( a ) = ∑ b π t ( b ) ( q ( b ) − X t ) ∂ π t ( b ) ∂ H t ( a ) / π t ( b ) = E [ ( q ( A t ) − X t ) ∂ π t ( A t ) ∂ H t ( a ) / π t ( A t ) ] = E [ ( R t − R t ‾ ) ∂ π t ( A t ) ∂ H t ( a ) / π t ( A t ) ] = E [ ( R t − R t ‾ ) π t ( A t ) ( ∏ a = A t − π t ( a ) ) / π t ( A t ) ] = E [ ( R t − R t ‾ ) ( ∏ a = A t − π t ( a ) ) ] \begin{align} \frac{\partial E[R_t]}{\partial H_t(a)} &= \frac{\partial}{\partial H_t(a)}[\sum_b\pi_t(b)q(b)] \\ &=\sum_bq(b)\frac{\partial \pi_t(b)}{\partial H_t(a)}\\ &=\sum_b(q(b)-X_t)\frac{\partial \pi_t(b)}{\partial H_t(a)}\\ &=\sum_b\pi_t(b)(q(b)-X_t)\frac{\partial \pi_t(b)}{\partial H_t(a)}/\pi_t(b)\\ &=E[(q(A_t)-X_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})\pi_t(A_t)(\prod_{a=A_t}-\pi_t(a))/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})(\prod_{a=A_t}-\pi_t(a))] \end{align} Ht(a)E[Rt]=Ht(a)[bπt(b)q(b)]=bq(b)Ht(a)πt(b)=b(q(b)Xt)Ht(a)πt(b)=bπt(b)(q(b)Xt)Ht(a)πt(b)/πt(b)=E[(q(At)Xt)Ht(a)πt(At)/πt(At)]=E[(RtRt)Ht(a)πt(At)/πt(At)]=E[(RtRt)πt(At)(a=Atπt(a))/πt(At)]=E[(RtRt)(a=Atπt(a))]

其中 X t X_t Xt不与动作有关, ∏ a = A t ( a ) = 1 \prod_{a=A_t}(a)=1 a=At(a)=1如果 a = A t a=A_t a=At,不然为0.第三个等式成立时因为(根据softmax函数求导):
∑ b ∂ π t ( b ) ∂ H t ( a ) = 0 \sum_b\frac{\partial \pi_t(b)}{\partial H_t(a)}=0\\ bHt(a)πt(b)=0
第五个等式利用了数学期望的定义.第六个等式利用期望的性质,可将期望内的变量替换为等期望的变量.倒数第二个等式成立是对softmax函数求导.

以下为代码:

'''
It's an algorithm using stochastic gradient ascent to solve the Bandits problem
'''
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10
q = np.random.randn(n)
T = 1000
tspan = np.arange(T)

class GradientBandit:
    def __init__(self, n, q) -> None:
        self.n = n
        self.q = q
        self.Actions = np.arange(n)
    def softmax(self,H):
        return np.exp(H) / np.sum(np.exp(H))
    def play(self, T, alpha, baseline=True):
        ActionRecord = []
        Reward_avg = [0]
        H = np.zeros(self.n)
        for t in range(1,T):
            a = np.random.choice(np.arange(n), p=self.softmax(H))  # action
            ActionRecord.append(a)
            R = q[a] + np.random.randn()  # repay
            avg = (R + Reward_avg[-1] * t) / (t + 1)
            Reward_avg.append(avg)
            if baseline:
                H = H + alpha * (R - Reward_avg[-1]) * (np.eye(self.n)[a] - self.softmax(H))
            else:
                H = H + alpha * (R - 0) * (np.eye(self.n)[a] - self.softmax(H))
        return ActionRecord, Reward_avg  
        
if __name__ == "__main__":
    slot_machine = GradientBandit(n, q)
    actions, reward_avg = slot_machine.play(T, 0.1)
    actions1, reward_avg1 = slot_machine.play(T, 0.1, baseline=False)
    actions2, reward_avg2 = slot_machine.play(T, 0.2)
    actions3, reward_avg3 = slot_machine.play(T, 0.2, baseline=False)
    plt.figure()
    plt.plot(tspan, reward_avg, label='alpha=0.1,baseline')
    plt.plot(tspan, reward_avg1, label='alpha=0.1,without baseline')
    plt.plot(tspan, reward_avg2, label='alpha=0.2,baseline')
    plt.plot(tspan, reward_avg3, label='alpha=0.2,without baseline')
    plt.xlabel('Steps')
    plt.ylabel("Average reward")
    plt.title("n Armed Bandit")
    plt.legend()
    print(actions2)
    plt.show()

结果如下:

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

可见此算法表现良好.
到此,老虎机算法告一段落,接下来是强化学习中的马尔科夫决策过程.

参考:
RL
强化学习读书笔记之Gradient Bandit及实现(二)
梯度上升
多臂老虎机问题
UCB算法

  • 11
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

已忘深色

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值