梯度上升的老虎机
接上和概要,我们定义
E
[
R
t
]
=
∑
b
π
t
(
b
)
q
(
b
)
E[R_t]=\sum_b\pi_t(b)q(b)
E[Rt]=b∑πt(b)q(b)
其中
∑
b
π
t
(
b
)
=
1
\sum_b\pi_t(b)=1
∑bπt(b)=1表示在时刻t下选择各个动作b的概率和为1.
我们想极大化此期望,为此引入变量
H
t
(
a
)
H_t(a)
Ht(a)表示在时间t下对时间a的偏好,我们初始定义其均为0,即
H
0
(
a
)
=
0
H_0(a)=0
H0(a)=0.
为与上述期望产生联系,我们利用softmax操作定义每个时刻t下选择动作a的概率:
π
t
(
a
)
=
e
H
t
(
a
)
∑
b
=
1
n
e
H
t
(
b
)
\pi_t(a)=\frac{e^{H_t(a)}}{\sum_{b=1}^ne^{H_t(b)}}
πt(a)=∑b=1neHt(b)eHt(a)
这样,期望E便是偏好
H
⃗
\vec{H}
H的函数,我们在每步只要更新偏好
H
⃗
\vec{H}
H的值即可.利用梯度上升:
H
t
+
1
(
a
)
=
H
t
(
a
)
+
α
∂
E
[
R
t
]
∂
H
t
(
a
)
H_{t+1}(a)=H_t(a)+\alpha\frac{\partial E[R_t]}{\partial H_t(a)}
Ht+1(a)=Ht(a)+α∂Ht(a)∂E[Rt]
其中
α
\alpha
α为常数步长.经过下列推导,我们可得到最终更新格式:
H
t
+
1
(
A
t
)
=
H
t
(
A
t
)
+
α
(
R
t
−
R
t
‾
)
(
1
−
π
t
(
A
t
)
)
,
H
t
+
1
(
a
)
=
H
t
(
a
)
−
α
(
R
t
−
R
t
‾
)
π
t
(
a
)
,
∀
a
≠
A
t
H_{t+1}(A_t)=H_t(A_t)+\alpha(R_t-\overline{R_t})(1-\pi_t(A_t)), \\ H_{t+1}(a)=H_t(a)-\alpha(R_t-\overline{R_t})\pi_t(a),\quad \forall a \not= A_t
Ht+1(At)=Ht(At)+α(Rt−Rt)(1−πt(At)),Ht+1(a)=Ht(a)−α(Rt−Rt)πt(a),∀a=At
其中
A
t
A_t
At为t时刻选择的动作,
R
t
‾
\overline{R_t}
Rt为前t时刻所有
R
k
,
k
≤
n
R_k,k\leq n
Rk,k≤n的平均值,加快收敛的作用(类似如同深度学习的中的归一化),换成其它数列也可.
以下为推导过程:
∂
E
[
R
t
]
∂
H
t
(
a
)
=
∂
∂
H
t
(
a
)
[
∑
b
π
t
(
b
)
q
(
b
)
]
=
∑
b
q
(
b
)
∂
π
t
(
b
)
∂
H
t
(
a
)
=
∑
b
(
q
(
b
)
−
X
t
)
∂
π
t
(
b
)
∂
H
t
(
a
)
=
∑
b
π
t
(
b
)
(
q
(
b
)
−
X
t
)
∂
π
t
(
b
)
∂
H
t
(
a
)
/
π
t
(
b
)
=
E
[
(
q
(
A
t
)
−
X
t
)
∂
π
t
(
A
t
)
∂
H
t
(
a
)
/
π
t
(
A
t
)
]
=
E
[
(
R
t
−
R
t
‾
)
∂
π
t
(
A
t
)
∂
H
t
(
a
)
/
π
t
(
A
t
)
]
=
E
[
(
R
t
−
R
t
‾
)
π
t
(
A
t
)
(
∏
a
=
A
t
−
π
t
(
a
)
)
/
π
t
(
A
t
)
]
=
E
[
(
R
t
−
R
t
‾
)
(
∏
a
=
A
t
−
π
t
(
a
)
)
]
\begin{align} \frac{\partial E[R_t]}{\partial H_t(a)} &= \frac{\partial}{\partial H_t(a)}[\sum_b\pi_t(b)q(b)] \\ &=\sum_bq(b)\frac{\partial \pi_t(b)}{\partial H_t(a)}\\ &=\sum_b(q(b)-X_t)\frac{\partial \pi_t(b)}{\partial H_t(a)}\\ &=\sum_b\pi_t(b)(q(b)-X_t)\frac{\partial \pi_t(b)}{\partial H_t(a)}/\pi_t(b)\\ &=E[(q(A_t)-X_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})\pi_t(A_t)(\prod_{a=A_t}-\pi_t(a))/\pi_t(A_t)]\\ &=E[(R_t-\overline{R_t})(\prod_{a=A_t}-\pi_t(a))] \end{align}
∂Ht(a)∂E[Rt]=∂Ht(a)∂[b∑πt(b)q(b)]=b∑q(b)∂Ht(a)∂πt(b)=b∑(q(b)−Xt)∂Ht(a)∂πt(b)=b∑πt(b)(q(b)−Xt)∂Ht(a)∂πt(b)/πt(b)=E[(q(At)−Xt)∂Ht(a)∂πt(At)/πt(At)]=E[(Rt−Rt)∂Ht(a)∂πt(At)/πt(At)]=E[(Rt−Rt)πt(At)(a=At∏−πt(a))/πt(At)]=E[(Rt−Rt)(a=At∏−πt(a))]
其中
X
t
X_t
Xt不与动作有关,
∏
a
=
A
t
(
a
)
=
1
\prod_{a=A_t}(a)=1
∏a=At(a)=1如果
a
=
A
t
a=A_t
a=At,不然为0.第三个等式成立时因为(根据softmax函数求导):
∑
b
∂
π
t
(
b
)
∂
H
t
(
a
)
=
0
\sum_b\frac{\partial \pi_t(b)}{\partial H_t(a)}=0\\
b∑∂Ht(a)∂πt(b)=0
第五个等式利用了数学期望的定义.第六个等式利用期望的性质,可将期望内的变量替换为等期望的变量.倒数第二个等式成立是对softmax函数求导.
以下为代码:
'''
It's an algorithm using stochastic gradient ascent to solve the Bandits problem
'''
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 10
q = np.random.randn(n)
T = 1000
tspan = np.arange(T)
class GradientBandit:
def __init__(self, n, q) -> None:
self.n = n
self.q = q
self.Actions = np.arange(n)
def softmax(self,H):
return np.exp(H) / np.sum(np.exp(H))
def play(self, T, alpha, baseline=True):
ActionRecord = []
Reward_avg = [0]
H = np.zeros(self.n)
for t in range(1,T):
a = np.random.choice(np.arange(n), p=self.softmax(H)) # action
ActionRecord.append(a)
R = q[a] + np.random.randn() # repay
avg = (R + Reward_avg[-1] * t) / (t + 1)
Reward_avg.append(avg)
if baseline:
H = H + alpha * (R - Reward_avg[-1]) * (np.eye(self.n)[a] - self.softmax(H))
else:
H = H + alpha * (R - 0) * (np.eye(self.n)[a] - self.softmax(H))
return ActionRecord, Reward_avg
if __name__ == "__main__":
slot_machine = GradientBandit(n, q)
actions, reward_avg = slot_machine.play(T, 0.1)
actions1, reward_avg1 = slot_machine.play(T, 0.1, baseline=False)
actions2, reward_avg2 = slot_machine.play(T, 0.2)
actions3, reward_avg3 = slot_machine.play(T, 0.2, baseline=False)
plt.figure()
plt.plot(tspan, reward_avg, label='alpha=0.1,baseline')
plt.plot(tspan, reward_avg1, label='alpha=0.1,without baseline')
plt.plot(tspan, reward_avg2, label='alpha=0.2,baseline')
plt.plot(tspan, reward_avg3, label='alpha=0.2,without baseline')
plt.xlabel('Steps')
plt.ylabel("Average reward")
plt.title("n Armed Bandit")
plt.legend()
print(actions2)
plt.show()
结果如下:
可见此算法表现良好.
到此,老虎机算法告一段落,接下来是强化学习中的马尔科夫决策过程.