第十一章 TPRO算法
11.1简介
本书之前介绍的基于策略的方法包括策略梯度算法和Actor-Critic算法。这些发方法虽然简单、直观,但在实际应用过程中会遇到训练不稳定的情况。基于策略的方法:参数化智能体的策略,并设计衡量策略好坏的目标函数,通过梯度上升的方法来最大化这个目标函数,使得策略最优。具体来说,假设 θ \theta θ表示策略 π θ \pi_\theta πθ的参数,定义 J ( θ ) = E s 0 [ V π θ ( s 0 ) ] = E π θ [ Σ t = 0 ∞ γ t r ( s t , a t ) ] J(\theta)=\mathbb E_{s_0}[V^{\pi_\theta}(s_0)]=\mathbb E_{\pi_\theta}[\varSigma_{t=0}^{∞}\gamma^tr(s_t,a_t)] J(θ)=Es0[Vπθ(s0)]=Eπθ[Σt=0∞γtr(st,at)],基于策略的方法的目标是找到 θ ∗ = a r g m a x θ J ( θ ) \theta^*=argmax_\theta J(\theta) θ∗=argmaxθJ(θ),策略梯度算法主要沿着 ∇ θ J ( θ ) \nabla _{\theta}J\left( \theta \right) ∇θJ(θ)方向迭代更新策略参数 θ \theta θ。但是这种算法有一个明显的缺点:当策略网络是深度模型时,沿着策略梯度更新参数,很有可能步长太大,策略突然显著变差,进而影响训练效果。
针对以上问题,我们考虑在更新时找到一块信任区域,在这个区域更新策略时能够得到某种策略性能的安全保证,这就是信任区域策略优化算法的主要思想。
11.2 策略目标
假设当前策略为
π
θ
\pi_\theta
πθ,参数为
θ
\theta
θ。我们考虑如何借助当前的
θ
\theta
θ找到一个更优的参数
θ
′
\theta '
θ′,使得
J
(
θ
′
)
>
=
J
(
θ
)
J(\theta')>=J(\theta)
J(θ′)>=J(θ)。具体来说,由于初始状态
s
0
s_0
s0的分布和策略无关,因此上述策略
π
θ
\pi_\theta
πθ下的优化目标
J
(
θ
)
J(\theta)
J(θ)可以写成再新策略
π
θ
′
\pi_{\theta'}
πθ′的期望形式:
J
(
θ
)
=
E
s
0
[
V
π
θ
(
s
0
)
]
J\left( \theta \right) =\mathbb E_{s_0}\left[ V^{\pi _{\theta}}\left( s_0 \right) \right]
J(θ)=Es0[Vπθ(s0)]
=
E
π
θ
′
[
Σ
∞
t
=
0
γ
t
V
π
θ
(
s
t
)
−
Σ
∞
t
=
1
γ
t
V
π
θ
(
s
t
)
]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tV^{\pi _{\theta}}\left( s_t \right) -\underset{t=1}{\overset{\infty}{\varSigma}}\gamma ^tV^{\pi _{\theta}}\left( s_t \right) \right]
=Eπθ′[t=0Σ∞γtVπθ(st)−t=1Σ∞γtVπθ(st)]
=
−
E
π
θ
′
[
Σ
∞
t
=
0
γ
t
(
γ
V
π
θ
(
s
t
+
1
)
−
V
π
θ
(
s
t
)
)
]
=-\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left( \gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right) \right]
=−Eπθ′[t=0Σ∞γt(γVπθ(st+1)−Vπθ(st))]
基于以上等式,我们可以推导新旧策略的目标函数之间的差距:
J
(
θ
′
)
−
J
(
θ
)
=
E
s
0
[
V
π
θ
′
(
s
0
)
]
−
E
[
V
π
θ
(
s
0
)
]
J\left( \theta ' \right) -J\left( \theta \right) =\mathbb E_{s_0}\left[ V^{\pi _{\theta '}}\left( s_0 \right) \right] -\mathbb E\left[ V^{\pi _{\theta}}\left( s_0 \right) \right]
J(θ′)−J(θ)=Es0[Vπθ′(s0)]−E[Vπθ(s0)]
=
E
π
θ
′
[
Σ
∞
t
=
0
γ
t
r
(
s
t
,
a
t
)
]
+
E
s
0
[
Σ
∞
t
=
0
γ
t
(
γ
V
π
θ
(
s
t
+
1
)
−
V
π
θ
(
s
t
)
)
]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tr\left( s_t,a_t \right) \right] +\mathbb E_{s_0}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left( \gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right) \right]
=Eπθ′[t=0Σ∞γtr(st,at)]+Es0[t=0Σ∞γt(γVπθ(st+1)−Vπθ(st))]
=
E
π
θ
′
[
Σ
∞
t
=
0
γ
t
[
r
(
s
t
,
a
t
)
+
γ
V
π
θ
(
s
t
+
1
)
−
V
π
θ
(
s
t
)
]
]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left[ r\left( s_t,a_t \right) +\gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right] \right]
=Eπθ′[t=0Σ∞γt[r(st,at)+γVπθ(st+1)−Vπθ(st)]]
将时序差分残差定义为优势函数A:
=
E
π
θ
′
[
Σ
∞
t
=
0
γ
t
A
π
θ
(
s
t
,
a
t
)
]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tA^{\pi _{\theta}}\left( s_t,a_t \right) \right]
=Eπθ′[t=0Σ∞γtAπθ(st,at)]
=
Σ
∞
t
=
0
γ
t
E
s
t
P
t
π
θ
′
E
a
t
π
θ
′
(
⋅
∣
s
)
[
A
π
θ
(
s
t
,
a
t
)
]
=\underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\mathbb E_{s_t~P_{t}^{\pi _{\theta '}}}\mathbb E_{a_t~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s_t,a_t \right) \right]
=t=0Σ∞γtEst Ptπθ′Eat πθ′(⋅∣s)[Aπθ(st,at)]
=
1
1
−
γ
E
s
ν
π
θ
′
E
a
π
θ
′
(
⋅
∣
s
)
[
A
π
θ
(
s
,
a
)
]
=\frac{1}{1-\gamma}\mathbb E_{s~\nu ^{\pi _{\theta '}}}\mathbb E_{a~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right]
=1−γ1Es νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]
最后一个等号的成立运用到了状态访问分布的定义: ν π ( s ) = ( 1 − γ ) Σ t = 0 ∞ γ t P t π ( s ) \nu ^{\pi}\left( s \right) =\left( 1-\gamma \right) \varSigma _{t=0}^{\infty}\gamma ^tP_{t}^{\pi}\left( s \right) νπ(s)=(1−γ)Σt=0∞γtPtπ(s),所以只要我们能找到一个新的策略,使得 E s ν π θ ′ E a π θ ′ ( ⋅ ∣ s ) [ A π θ ( s , a ) ] > = 0 \mathbb E_{s~\nu ^{\pi _{\theta '}}}\mathbb E_{a~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right]>=0 Es νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]>=0,就能保证策略性能单调递增,即 J ( θ ′ ) ≥ J ( θ ) J(\theta')≥J(\theta) J(θ′)≥J(θ)。
但直接求解该式是非常困难的,因为 π θ ′ \pi_{\theta'} πθ′是我们需要求解的策略,但我们又要用它来收集样本。把所有可能的新策略都拿来收集数据,然后判断哪个策略满足上述条件的做法显然是不现实的。于是TRPO做了一步近似操作,对状态访问分布进行了相应处理。具体而言,忽略两个策略之间的状态访问分布变化,直接采用旧的策略 π θ \pi_\theta πθ的状态分布,定义如下替代优化目标:
L θ ( θ ′ ) = J ( θ ) + 1 1 − γ E s ν π θ E a π θ ′ ( ⋅ ∣ s ) [ A π θ ( s , a ) ] L_\theta(\theta')=J(\theta)+\frac{1}{1-\gamma}\mathbb E_{s~\nu ^{\pi _{\theta }}}\mathbb E_{a~\pi _{\theta' }\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right] Lθ(θ′)=J(θ)+1−γ1Es νπθEa πθ′(⋅∣s)[Aπθ(s,a)]
当新旧策略非常接近时,状态访问分布非常小,这么近似是合理的。其中,动作仍然用新策略 π θ ′ \pi_\theta' πθ′采样得到,我们可以用重要性采样对动作分布进行处理:
L θ ( θ ′ ) = J ( θ ) + E s ν π θ E a π θ ′ ( ⋅ ∣ s ) [ π θ ′ ( a ∣ s ) π θ ( a ∣ s ) A π θ ( s , a ) ] L_\theta(\theta')=J(\theta)+\mathbb E_{s~\nu ^{\pi _{\theta }}}\mathbb E_{a~\pi _{\theta' }\left( \cdot |s \right)}\left[\frac{\pi_{\theta'}(a|s)}{\pi_\theta(a|s)} A^{\pi _{\theta}}\left( s,a \right) \right] Lθ(θ′)=J(θ)+Es νπθEa πθ′(⋅∣s)[πθ(a∣s)πθ′(a∣s)Aπθ(s,a)]
这样,我们就可以基于旧策略 π θ \pi_\theta πθ已经采样出的数据来估计并优化新策略 π θ ′ \pi_\theta' πθ′了。为了保证新旧策略足够接近,TRPO使用了库尔贝克—莱布勒散度来衡量策略之间的距离,并给出了整体的优化公式:
max θ L θ ( θ ′ ) \underset{\theta}\max L_\theta(\theta') θmaxLθ(θ′)
s . t . E s ν π θ k [ D K L ( π θ k ( ⋅ ∣ s ) , π θ ′ ( ⋅ ∣ s ) ) ] ≤ δ s.t. \mathbb E_{s~\nu ^{\pi _{\theta _k}}}\left[ D_{KL}\left( \pi _{\theta _k}\left( \cdot |s \right) ,\pi _{\theta '}\left( \cdot |s \right) \right) \right] \le \delta s.t.Es νπθk[DKL(πθk(⋅∣s),πθ′(⋅∣s))]≤δ
这里不等式约束定义了策略空间中的一个KL球,被称为信任区域。在这个区域中,可以认为当前学习策略和环境交互的状态分布与上一轮策略最后采样的状态分布一致,进而可以基于一步行动的重要性采样方法使当前学习策略稳定提升。TRPO 背后的原理如图 11-1 所示。
左图表示当完全不设置信任区域时,策略的梯度更新可能导致策略的性能骤降;右图表示当设置了信任区域时,可以保证每次策略的梯度更新都能带来性能的提升。
11.3 近似求解
11.4 共轭梯度
一般来说,用神经网络表示的策略函数的参数数量都是成千上万的,计算和存储黑塞矩阵H的逆矩阵会耗费大量的内存资源和时间。TRPO通过共轭梯度回避了这个问题,它的核心思想是直接计算 x = H − 1 g x=H^{-1}g x=H−1g, x x x即参数更新方向。假设满足KL距离约束的参数更新时的最大步长为 β \beta β,于是,根据KL距离约束条件,又 1 2 ( β x ) T H ( β x ) = δ \frac{1}{2}(\beta x)^TH(\beta x)=\delta 21(βx)TH(βx)=δ。求解 β \beta β,得到 β = 2 δ x T H x \beta=\sqrt{\frac{2\delta}{x^THx}} β=xTHx2δ。因此,此时参数更新方式为
θ k + 1 = θ k + 2 δ x T H x x \theta_{k+1}=\theta_{k}+\sqrt{\frac{2\delta}{x^THx}}x θk+1=θk+xTHx2δx
因此,只要可以直接计算 x = H − 1 g x=H^{-1}g x=H−1g,就可以根据该式更新参数,问题转化为解 H x = g Hx=g Hx=g。实际上 H H H为对称正定矩阵,所以我们可以使用共轭梯度法来求解。共轭梯度法的具体流程如下:
11.5 线性搜索
由于TRPO算法用到了泰勒展开的1阶和2阶近似,这并非精准求解,因此, θ ′ \theta' θ′可能未必比 θ k \theta_k θk好,或未必能满足KL的散度限制。TRPO在每次迭代的最后进行一次线性搜索,以确保能找到满足条件。具体来说,就是找到一个最小的非负整数 i i i,使得按照
θ k + 1 = θ k + α i 2 δ x T H x x \theta_{k+1}=\theta_k+\alpha^i\sqrt{\frac{2\delta}{x^THx}}x θk+1=θk+αixTHx2δx
求出的 θ k \theta_k θk依然满足最初的KL散度限制,并且确实能够提升目标函数 L θ k L_{\theta_k} Lθk,这其中 α ∈ ( 0 , 1 ) \alpha\in(0,1) α∈(0,1)是一个决定线性搜索长度的超参数。
至此,我们已经基本上清楚了TRPO算法的大致过程,他具体的算法流程如下:
- 初始化策略网络参数 θ \theta θ,价值网络参数 ω \omega ω
- for序列 e = 1 − − > E e=1-->E e=1−−>E do:
- 用当前策略 π θ \pi_\theta πθ采样轨迹 { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , … … } \left\{s_1,a_1,r_1,s_2,a_2,r_2,……\right\} {s1,a1,r1,s2,a2,r2,……}
- 根据收集到的数据和价值网络估计每个状态动作对的优势 A ( s t , a t ) A(s_t,a_t) A(st,at)
- 计算策略目标函数的梯度 g g g
- 用共轭梯度法计算 x = H − 1 g x=H^{-1}g x=H−1g
- 用线性搜索找到一个 i i i值,并更新策略网络参数 θ k + 1 = θ k + α i 2 δ x T H x x \theta_{k+1}=\theta_k+\alpha^i\sqrt{\frac{2\delta}{x^THx}}x θk+1=θk+αixTHx2δx,其中 i ∈ { 1 , 2 , 3 , … … , K } i\in\left\{1,2,3,……,K\right\} i∈{1,2,3,……,K},为能提升策略并满足KL距离限制的最小整数
- 更新价值网络参数(与Actor-Critic中的更新方法相同)
- end for
11.6 广义优势估计
def compute_advantage(gamma, lmbda, td_delta):
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
11.7 TRPO代码实践
本节将使用支持与离散和连续两种动作交互的环境来进行TRPO的实验。我们使用的第一个环境是车杆(CartPole),第二个环境是倒立摆(Inverted Pendulum)。
11.7.1 车杆环境(Cartpole)
import torch
import numpy as np
import gym
import matplotlib.pyplot as plt
import torch.nn.functional as F
import rl_utils
import copy
class PolicyNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
return F.softmax(self.fc2(x), dim=1)
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
def compute_advantage(gamma, lmbda, td_delta):
# 优势函数
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
class TRPO:
""" TRPO算法 """
def __init__(self, hidden_dim, state_space, action_space, lmbda, kl_constraint, alpha, critic_lr, gamma, device):
state_dim = state_space.shape[0]
action_dim = action_space.n
# 策略网络参数不需要优化器更新
self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)
self.gamma = gamma
self.lmbda = lmbda # GAE参数
self.kl_constraint = kl_constraint # KL距离最大限制
self.alpha = alpha # 线性搜索参数
self.device = device
def take_action(self, state): # 随机策略选择动作
state = torch.tensor([state], dtype=torch.float).to(self.device)
probs = self.actor(state)
action_dist = torch.distributions.Categorical(probs)
action = action_dist.sample()
return action.item()
def hessian_matrix_vector_product(self, states, old_action_dists, vector):
# 计算黑塞矩阵和一个向量的乘积
# print(old_action_dists)
new_action_dists = torch.distributions.Categorical(self.actor(states))
# print(new_action_dists)
kl = torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists)) # 计算平均KL距离
# print(kl)
kl_grad = torch.autograd.grad(kl, self.actor.parameters(), create_graph=True)
kl_grad_vector = torch.cat([grad.view(-1) for grad in kl_grad])
# KL距离的梯度先和向量进行点积运算
kl_grad_vector_product = torch.dot(kl_grad_vector, vector)
grad2 = torch.autograd.grad(kl_grad_vector_product, self.actor.parameters())
grad2_vector = torch.cat([grad.view(-1) for grad in grad2])
# print(grad2_vector)
return grad2_vector
def conjugate_gradient(self, grad, states, old_action_dists): # 共轭梯度法求解方程
x = torch.zeros_like(grad)
r = grad.clone()
p = grad.clone()
rdotr = torch.dot(r, r) # 计算r和r的积
for i in range(10): # 共轭梯度主循环
Hp = self.hessian_matrix_vector_product(states, old_action_dists, p)
alpha = rdotr / torch.dot(p, Hp)
x += alpha * p
r -= alpha * Hp
new_rdotr = torch.dot(r, r)
if new_rdotr < 1e-10:
break
beta = new_rdotr / rdotr
p = r + beta * p
rdotr = new_rdotr
return x
def compute_surrogate_obj(self, states, actions, advantage, old_log_probs, actor): # 计算策略目标
log_probs = torch.log(actor(states).gather(1, actions))
ratio = torch.exp(log_probs - old_log_probs)
return torch.mean(ratio * advantage)
def line_search(self, states, actions, advantage, old_log_probs, old_action_dists, max_vec): # 线性搜索
old_para = torch.nn.utils.convert_parameters.parameters_to_vector(self.actor.parameters()) # 获取策略网络中的参数 并且将其转换为一维向量
old_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)
for i in range(15): # 线性搜索主循环
coef = self.alpha**i # alpha的i次方
new_para = old_para + coef * max_vec
new_actor = copy.deepcopy(self.actor)
torch.nn.utils.convert_parameters.vector_to_parameters(new_para, new_actor.parameters())
new_action_dists = torch.distributions.Categorical(new_actor(states))
kl_div = torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists))
new_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, new_actor)
if new_obj > old_obj and kl_div < self.kl_constraint:
return new_para
return old_para
def policy_learn(self, states, actions, old_action_dists, old_log_probs, advantage): # 更新策略函数
surrogate_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)
grads = torch.autograd.grad(surrogate_obj, self.actor.parameters()) # 计算梯度
obj_grad = torch.cat([grad.view(-1) for grad in grads]).detach()
# 用共轭梯度法计算x = H^(-1)g
descent_direction = self.conjugate_gradient(obj_grad, states, old_action_dists)
Hd = self.hessian_matrix_vector_product(states, old_action_dists, descent_direction)
max_coef = torch.sqrt(2 * self.kl_constraint / (torch.dot(descent_direction, Hd) + 1e-8))
new_para = self.line_search(states, actions, advantage, old_log_probs, old_action_dists, descent_direction * max_coef) # 线性搜索
torch.nn.utils.convert_parameters.vector_to_parameters(new_para, self.actor.parameters()) # 用线性搜索后的参数更新策略
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device)
rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)
td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device)
old_log_probs = torch.log(self.actor(states).gather(1, actions)).detach()
old_action_dists = torch.distributions.Categorical(self.actor(states).detach())
critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target.detach()))
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step() # 更新价值函数
# 更新策略函数
self.policy_learn(states, actions, old_action_dists, old_log_probs, advantage)
num_episodes = 500
hidden_dim = 128
gamma = 0.98
lmbda = 0.95
critic_lr = 1e-2
kl_constraint = 0.0005
alpha = 0.5
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
env_name = 'CartPole-v0'
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
agent = TRPO(hidden_dim, env.observation_space, env.action_space, lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list))) # 所有收益
plt.plot(episodes_list, return_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9) # 平均值
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
本节代码和Actor-Critic算法最大的不同就是在策略参数更新方面,本节算法使用了一个信任区域,然后根据最优化方法找到信任区域中的最优策略
Iteration 0: 0%| | 0/50 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)
Iteration 0: 100%|██████████| 50/50 [00:03<00:00, 15.71it/s, episode=50, return=139.200]
Iteration 1: 100%|██████████| 50/50 [00:03<00:00, 13.08it/s, episode=100, return=150.500]
Iteration 2: 100%|██████████| 50/50 [00:04<00:00, 11.57it/s, episode=150, return=184.000]
Iteration 3: 100%|██████████| 50/50 [00:06<00:00, 7.60it/s, episode=200, return=183.600]
Iteration 4: 100%|██████████| 50/50 [00:06<00:00, 7.17it/s, episode=250, return=183.500]
Iteration 5: 100%|██████████| 50/50 [00:04<00:00, 10.91it/s, episode=300, return=193.700]
Iteration 6: 100%|██████████| 50/50 [00:04<00:00, 10.70it/s, episode=350, return=199.500]
Iteration 7: 100%|██████████| 50/50 [00:04<00:00, 10.89it/s, episode=400, return=200.000]
Iteration 8: 100%|██████████| 50/50 [00:04<00:00, 10.80it/s, episode=450, return=200.000]
Iteration 9: 100%|██████████| 50/50 [00:04<00:00, 11.09it/s, episode=500, return=200.000]
11.7.2 倒立摆环境
倒立摆环境和车杆环境的区别是倒立摆环境的动作是连续的 而车杆环境是离散的
import torch
import numpy as np
import gym
import matplotlib.pyplot as plt
import torch.nn.functional as F
import rl_utils
import copy
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
def compute_advantage(gamma, lmbda, td_delta):
# 优势函数
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
class PolicyNetContinuous(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNetContinuous, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc_mu = torch.nn.Linear(hidden_dim, action_dim)
self.fc_std = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
mu = 2.0 * torch.tanh(self.fc_mu(x))
std = F.softplus(self.fc_std(x))
return mu, std # 高斯分布的均值和标准差
class TRPOContinuous:
""" 处理连续动作的TRPO算法 """
def __init__(self, hidden_dim, state_space, action_space, lmbda,
kl_constraint, alpha, critic_lr, gamma, device):
state_dim = state_space.shape[0]
action_dim = action_space.shape[0]
self.actor = PolicyNetContinuous(state_dim, hidden_dim,
action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(),
lr=critic_lr)
self.gamma = gamma
self.lmbda = lmbda
self.kl_constraint = kl_constraint
self.alpha = alpha
self.device = device
def take_action(self, state):
state = torch.tensor([state], dtype=torch.float).to(self.device)
mu, std = self.actor(state)
action_dist = torch.distributions.Normal(mu, std) # 输出的直接是动作的值 将高斯分布的系数进行转换为相应的概率
action = action_dist.sample()
return [action.item()]
def hessian_matrix_vector_product(self, states, old_action_dists, vector, damping=0.1):
mu, std = self.actor(states)
new_action_dists = torch.distributions.Normal(mu, std)
kl = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists))
kl_grad = torch.autograd.grad(kl,
self.actor.parameters(),
create_graph=True)
kl_grad_vector = torch.cat([grad.view(-1) for grad in kl_grad])
kl_grad_vector_product = torch.dot(kl_grad_vector, vector)
grad2 = torch.autograd.grad(kl_grad_vector_product,
self.actor.parameters())
# print(grad2)
grad2_vector = torch.cat(
[grad.contiguous().view(-1) for grad in grad2]) # 将grad2中的所有数据向量进行拼接
print(grad2_vector)
return grad2_vector + damping * vector
def conjugate_gradient(self, grad, states, old_action_dists):
x = torch.zeros_like(grad)
r = grad.clone()
p = grad.clone()
rdotr = torch.dot(r, r)
for i in range(10):
Hp = self.hessian_matrix_vector_product(states, old_action_dists,
p)
alpha = rdotr / torch.dot(p, Hp)
x += alpha * p
r -= alpha * Hp
new_rdotr = torch.dot(r, r)
if new_rdotr < 1e-10:
break
beta = new_rdotr / rdotr
p = r + beta * p
rdotr = new_rdotr
return x
def compute_surrogate_obj(self, states, actions, advantage, old_log_probs,
actor):
mu, std = actor(states)
action_dists = torch.distributions.Normal(mu, std)
log_probs = action_dists.log_prob(actions)
ratio = torch.exp(log_probs - old_log_probs)
return torch.mean(ratio * advantage)
def line_search(self, states, actions, advantage, old_log_probs,
old_action_dists, max_vec):
old_para = torch.nn.utils.convert_parameters.parameters_to_vector(
self.actor.parameters())
old_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
for i in range(15):
coef = self.alpha**i
new_para = old_para + coef * max_vec
new_actor = copy.deepcopy(self.actor)
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, new_actor.parameters())
mu, std = new_actor(states)
new_action_dists = torch.distributions.Normal(mu, std)
kl_div = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists))
new_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, new_actor)
if new_obj > old_obj and kl_div < self.kl_constraint:
return new_para
return old_para
def policy_learn(self, states, actions, old_action_dists, old_log_probs,
advantage):
surrogate_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
grads = torch.autograd.grad(surrogate_obj, self.actor.parameters())
obj_grad = torch.cat([grad.view(-1) for grad in grads]).detach()
descent_direction = self.conjugate_gradient(obj_grad, states,
old_action_dists)
Hd = self.hessian_matrix_vector_product(states, old_action_dists,
descent_direction)
max_coef = torch.sqrt(2 * self.kl_constraint /
(torch.dot(descent_direction, Hd) + 1e-8))
new_para = self.line_search(states, actions, advantage, old_log_probs,
old_action_dists,
descent_direction * max_coef)
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, self.actor.parameters())
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'],
dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions'],
dtype=torch.float).view(-1, 1).to(self.device)
rewards = torch.tensor(transition_dict['rewards'],
dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'],
dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'],
dtype=torch.float).view(-1, 1).to(self.device)
rewards = (rewards + 8.0) / 8.0 # 对奖励进行修改,方便训练
td_target = rewards + self.gamma * self.critic(next_states) * (1 -
dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda,
td_delta.cpu()).to(self.device)
mu, std = self.actor(states)
old_action_dists = torch.distributions.Normal(mu.detach(),
std.detach())
old_log_probs = old_action_dists.log_prob(actions)
critic_loss = torch.mean(
F.mse_loss(self.critic(states), td_target.detach()))
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
self.policy_learn(states, actions, old_action_dists, old_log_probs,
advantage)
num_episodes = 2000
hidden_dim = 128
gamma = 0.9
lmbda = 0.9
critic_lr = 1e-2
kl_constraint = 0.00005
alpha = 0.5
device = torch.device("cuda") if torch.cuda.is_available() else torch.device(
"cpu")
env_name = 'Pendulum-v0'
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
agent = TRPOContinuous(hidden_dim, env.observation_space, env.action_space,
lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list)))
plt.plot(episodes_list, return_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
ction_space,
lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list)))
plt.plot(episodes_list, return_list)
plt.xlabel(‘Episodes’)
plt.ylabel(‘Returns’)
plt.title(‘TRPO on {}’.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel(‘Episodes’)
plt.ylabel(‘Returns’)
plt.title(‘TRPO on {}’.format(env_name))
plt.show()
## 11.8 总结
本章讲解了TRPO算法,并分别在离散动作和连续动作交互的环境进行了实验。TRPO算法属于在线策略学习方法,每次策略训练仅使用上一轮策略采样的数据,是基于策略的深度强化学习算法中十分有代表性的工作之一。直觉地理解,TRPO给出的观点是:由于策略的改变导致数据分布的改变,这大大影响深度模型实现的策略网络的学习效果,所以通过划定一个可信任的策略学习区域,保证策略学习的稳定性和有效性。