李宏毅深度强化学习笔记(一)
Policy Gradient
policy gradient从on policy到off policy,再加一些约束就是PPO
review policy gradient:
基础元素:Actor、Enverimrnt、Reward(后两个不是自己能控制的,能改变的只有Actor)
poliyc
π
\pi
π是一个参数为
θ
\theta
θ的网络,输入为所观察到的环境的状态,由一个矩阵或者向量表达,输出是所有行为的概率。
an episode的total reward:
R
=
∑
t
T
r
t
R=\sum_{t}^{T}r_t
R=t∑Trt
Trajectory:
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
.
.
.
,
s
T
,
a
T
}
\tau =\left \{s_1,a_1,s_2,a_2,...,s_T,a_T \right \}
τ={s1,a1,s2,a2,...,sT,aT}
p θ ( τ ) = p ( s 1 ) p θ ( a 1 ∣ s 1 ) p ( s 2 ∣ s 1 , a 1 ) p θ ( a 2 ∣ s 2 ) . . . = p ( s 1 ) ∏ t = 1 T p θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p_\theta(\tau)=p(s_1)p_\theta(a_1|s_1)p(s_2|s_1,a_1)p_\theta(a_2|s_2)...=p(s_1) \prod_{t=1}^{T}p_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) pθ(τ)=p(s1)pθ(a1∣s1)p(s2∣s1,a1)pθ(a2∣s2)...=p(s1)t=1∏Tpθ(at∣st)p(st+1∣st,at)
p
(
s
t
+
1
∣
s
t
,
a
t
)
p(s_{t+1}|s_t,a_t)
p(st+1∣st,at)与环境有关,是不受我们控制的,我们能控制的是actor的
p
θ
(
a
t
∣
s
t
)
p_\theta(a_t|s_t)
pθ(at∣st)。
穷举所有的trajectory得到expected reward:
R
ˉ
θ
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
=
E
τ
∼
p
θ
(
τ
)
[
R
(
τ
)
]
\bar{R}_\theta=\sum_{\tau}R(\tau)p_\theta(\tau)=E_{\tau \sim p_\theta(\tau)}[R(\tau)]
Rˉθ=τ∑R(τ)pθ(τ)=Eτ∼pθ(τ)[R(τ)]
轨迹 τ \tau τ的概率分布为 p θ ( τ ) p_\theta(\tau) pθ(τ), R ( τ ) R(\tau) R(τ)为 τ \tau τ这条轨迹的总reward,将根据 p θ ( τ ) p_\theta(\tau) pθ(τ)分布的 R ( τ ) R(\tau) R(τ)的平均值记作 E τ ∼ p θ ( τ ) [ R ( τ ) ] E_{\tau \sim p_\theta(\tau)}[R(\tau)] Eτ∼pθ(τ)[R(τ)]。
目的是要expected reward最大,传统的梯度下降求最小值,只需要把参数更新公式中的减号换成加号就可以求最大值,如下:
θ
=
θ
−
α
▽
θ
→
θ
=
θ
+
α
▽
θ
\theta=\theta-\alpha \bigtriangledown \theta \rightarrow \theta=\theta+\alpha \bigtriangledown \theta
θ=θ−α▽θ→θ=θ+α▽θ
求 R ˉ θ \bar{R}_\theta Rˉθ的梯度:
▽ θ R ˉ θ = ∑ τ R ( τ ) ▽ θ p θ ( τ ) = ∑ τ R ( τ ) p θ ( τ ) ▽ θ p θ ( τ ) p θ ( τ ) \bigtriangledown_\theta \bar{R}_\theta = \sum_{\tau}R(\tau)\bigtriangledown_\theta p_\theta(\tau) =\sum_{\tau}R(\tau)p_\theta(\tau)\frac{\bigtriangledown _\theta p_\theta(\tau)}{p_\theta(\tau)} ▽θRˉθ=τ∑R(τ)▽θpθ(τ)=τ∑R(τ)pθ(τ)pθ(τ)▽θpθ(τ)
根据log的求导公式,可以写成:
▽
θ
R
ˉ
θ
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
▽
θ
p
θ
(
τ
)
p
θ
(
τ
)
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
▽
θ
l
o
g
p
θ
(
τ
)
=
E
τ
∼
p
θ
(
τ
)
[
R
(
τ
)
▽
θ
l
o
g
p
θ
(
τ
)
]
=
1
N
∑
n
=
1
N
R
(
τ
n
)
▽
θ
l
o
g
p
θ
(
τ
n
)
\begin{aligned} \bigtriangledown_\theta \bar{R}_\theta &=\sum_{\tau}R(\tau)p_\theta(\tau)\frac{\bigtriangledown _\theta p_\theta(\tau)}{p_\theta(\tau)}=\sum_{\tau}R(\tau)p_\theta(\tau)\bigtriangledown_\theta log p_\theta (\tau)\\ &=E_{\tau \sim p_\theta(\tau)}[R(\tau)\bigtriangledown_\theta log p_\theta (\tau)]\\ &=\frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\bigtriangledown_\theta log p_\theta (\tau^n) \end{aligned}
▽θRˉθ=τ∑R(τ)pθ(τ)pθ(τ)▽θpθ(τ)=τ∑R(τ)pθ(τ)▽θlogpθ(τ)=Eτ∼pθ(τ)[R(τ)▽θlogpθ(τ)]=N1n=1∑NR(τn)▽θlogpθ(τn)
由上面的推导可知
p
θ
(
τ
)
p_\theta(\tau)
pθ(τ),如下式。与
θ
\theta
θ无关求导直接为0零了。
p
θ
(
τ
)
=
p
(
s
1
)
∏
t
=
1
T
p
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
p_\theta(\tau)=p(s_1) \prod_{t=1}^{T}p_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)
pθ(τ)=p(s1)t=1∏Tpθ(at∣st)p(st+1∣st,at)
l o g p θ ( τ ) = l o g ( p ( s 1 ) ) + ∑ t = 1 T l o g p θ ( a t ∣ s t ) + ∑ t = 1 T l o g p ( s t + 1 ∣ s t , a t ) logp_\theta(\tau)=log(p(s_1))+\sum_{t=1}^{T}logp_\theta(a_t|s_t)+\sum_{t=1}^{T}logp(s_{t+1}|s_t,a_t) logpθ(τ)=log(p(s1))+t=1∑Tlogpθ(at∣st)+t=1∑Tlogp(st+1∣st,at)
所以
▽
θ
R
ˉ
θ
=
1
N
∑
n
=
1
N
R
(
τ
n
)
▽
θ
l
o
g
p
θ
(
τ
n
)
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∑
t
=
1
T
n
▽
θ
l
o
g
p
θ
(
a
t
∣
s
t
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
▽
θ
l
o
g
p
θ
(
a
t
n
∣
s
t
n
)
\begin{aligned} \bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\bigtriangledown_\theta log p_\theta (\tau^n) &= \frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\sum_{t=1}^{T_n}\bigtriangledown_\theta logp_\theta(a_t|s_t)\\ &=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n) \end{aligned}
▽θRˉθ=N1n=1∑NR(τn)▽θlogpθ(τn)=N1n=1∑NR(τn)t=1∑Tn▽θlogpθ(at∣st)=N1n=1∑Nt=1∑TnR(τn)▽θlogpθ(atn∣stn)
所以梯度更新公式为:
θ
←
θ
+
α
▽
θ
R
ˉ
θ
\theta\leftarrow \theta+\alpha \bigtriangledown_\theta \bar{R}_\theta
θ←θ+α▽θRˉθ
改进方法:
- Add a Baseline:
R ( τ n ) R(\tau^n) R(τn)是第n个trajectory的整体reward, R ( τ n ) R(\tau^n) R(τn)可以当作一个权重,减少 R ( τ n ) R(\tau^n) R(τn)小的轨迹的概率。 R ( τ n ) R(\tau^n) R(τn)可能总是正的,没有负reward,
▽ θ R ˉ θ = 1 N ∑ n = 1 N ∑ t = 1 T n ( R ( τ n ) − b ) ▽ θ l o g p θ ( a t n ∣ s t n ) \bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(R(\tau^n)-b)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n) ▽θRˉθ=N1n=1∑Nt=1∑Tn(R(τn)−b)▽θlogpθ(atn∣stn)
b b b可以取 E ( R ( τ ) ) E(R(\tau)) E(R(τ)),也可以取其他,理解为让一个trajectory出现的权重有正有负, ( R ( τ n ) − b ) (R(\tau^n)-b) (R(τn)−b)为负就减少这个trajectory出现的概率。 - Assgin Suitable Credit
一个episode里 ▽ θ l o g p θ ( a t n ∣ s t n ) \bigtriangledown_\theta logp_\theta(a_t^n|s_t^n) ▽θlogpθ(atn∣stn)乘的数都是一样的都是 ( R ( τ n ) − b ) (R(\tau^n)-b) (R(τn)−b),对每一个 a t a^t at不公平,使用更能代表 a t a^t at的reward:
∑ t ′ = t T n r t ′ n \sum_{t'=t}^{T_n}r_{t'}^n t′=t∑Tnrt′n
表示从 a t a_t at的时间开始,到整个回合结束的所有reward的和。进一步将其乘上一个折扣因子( γ \gamma γ小于1),表示未来的 a t a_t at的reward对在其之前的 a t a_t at的影响,离的越远影响越小:
∑ t ′ = t T n γ t ′ − t r t ′ n \sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n t′=t∑Tnγt′−trt′n
所以梯度表达式更新如下:
▽ θ R ˉ θ = 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − t r t ′ n − b ) ▽ θ l o g p θ ( a t n ∣ s t n ) \bigtriangledown_\theta \bar{R}_\theta=\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n-b)\bigtriangledown_\theta logp_\theta(a_t^n|s_t^n) ▽θRˉθ=N1n=1∑Nt=1∑Tn(t′=t∑Tnγt′−trt′n−b)▽θlogpθ(atn∣stn)
b放在之后再讨论,可以有不同的取法。
( ∑ t ′ = t T n γ t ′ − t r t ′ n − b ) (\sum_{t'=t}^{T_n}\gamma ^{t'-t}r_{t'}^n-b) (t′=t∑Tnγt′−trt′n−b)
这个式子可以叫做优势函数。
下面给出回合更新策略梯度下降代码,因为在每一个回合结束后更新,所以上面是的梯度更新式子中的对N求和就省略了,并且以下代码没有添加beseline:
import gym
import numpy as np
import matplotlib.pylab as plt
import torch.nn as nn
import torch.nn.functional as F
import torch
from torch.distributions import Categorical
class Policy(nn.Module):
def __init__(self,s_size=4,h_size=128,a_size=2):
super(Policy,self).__init__()
self.affine1=nn.Linear(s_size,h_size)
self.dropout=nn.Dropout(p=0.6)
self.affine2=nn.Linear(h_size,a_size)
self.saved_log_prob=[]
self.rewards=[]
def forward(self,x):
x=self.affine1(x)
x=self.dropout(x)
x=F.relu(x)
action_scores=self.affine2(x)
return F.softmax(action_scores,dim=1)#转换成了概率
class Agent():
def __init__(self,episode,max_steps,gamma):
'''
:param episode: 回合数
:param max_steps: 一个回合最大步数
:param gamma: 未来回报的折扣因子
'''
self.episode=episode
self.max_steps=max_steps
self.gamma=gamma
def decide(self,observation):
state=torch.from_numpy(observation).float().unsqueeze(0)#需要增加一个维度,可以理解为神经网络输入的batch_size
pro_action=policy.forward(state)
m=Categorical(pro_action)#以pro_action的概率在pro_action的长度进行整数索引
action=m.sample()#采样得到所要执行的真实的动作
return action.item(),m.log_prob(action)#返回动作和采取的那个动作的对数概率值
def learn(self):
collect_loss=[]
collect_reward=[]
for i in range(1,self.episode+1):
observation=env.reset()
env.render()
G=[]
log_pro_actions=[]
#采集一个回合的数据
for j in range(self.max_steps):
action,log_pro_action=self.decide(observation)#action的输出是0和1
next_observation,reward,done,_=env.step(action)
G.append(reward)#采集reward
log_pro_actions.append(log_pro_action)#采集动作概率对数值
if done:
break
observation=next_observation
collect_reward.append(np.sum(G))#收集一个回合的总reward
#从后向前计算每一个动作的G_t
for k in range(len(G)-1,1,-1):
if k==len(G)-1:
G[k]=G[k]
else:
for j in range(k+1,len(G),1):
G[k]=G[k]+self.gamma**(j-k)*G[j]
#计算每一个动作的loss即对数概率乘上动作的G_t,求和得到总的loss,负loss的目的是求回报期望最大值,即求负期望的最小值,就可以利用梯度下降
loss=[-pro*r for pro,r in zip(log_pro_actions,G)]
loss=torch.cat(loss).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
collect_loss.append(loss)
if i%(100)==0:
print('iteration {:d}: reward {:.4f}'.format(i,np.mean(collect_reward)))
return collect_reward,collect_loss
env=gym.make('CartPole-v1')
env.seed(0)
# print(env.observation_space)#状态空间维数4
# print(env.action_space)#动作空间维数2
policy=Policy()
optimizer=torch.optim.Adam(policy.parameters(),lr=1e-2)
agent=Agent(episode=1000,max_steps=100,gamma=0.5)
collect_reward,collect_loss=agent.learn()
env.close()
plt.figure()
plt.plot(collect_loss)
plt.title('loss')
plt.figure()
plt.plot(collect_reward)
plt.title('reward')
plt.show()
初学,如有问题,恳请指正(抱拳)