Policy Gradient的推导以及存在的问题
Policy Gradient Theory
策略梯度法广泛应用在model-free类型的强化学习算法中,其基本思想是通过梯度迭代方法对策略动作分布
π
(
a
∣
s
)
\pi(a|s)
π(a∣s)更新,使其达到最优策略
π
∗
(
a
∣
s
)
\pi^*(a|s)
π∗(a∣s)。其中最优策略满足条件:
V
π
∗
(
s
)
≥
V
π
(
s
)
,
∀
π
∈
Π
,
∀
s
∈
S
V^{\pi^*}(s)\ge V^{\pi}(s), \quad \forall\pi\in\Pi,\forall s\in\mathcal{S}
Vπ∗(s)≥Vπ(s),∀π∈Π,∀s∈S通常使用参数化的策略
π
θ
\pi_\theta
πθ来计算不同状态下的动作分布,强化学习的目标可以表示为一个关于策略参数
θ
\theta
θ的函数
J
(
θ
)
J(\theta)
J(θ),通常用采样轨迹
τ
\tau
τ的长期奖励
R
(
τ
)
=
∑
t
=
0
T
r
t
R(\tau)=\sum^T_{t=0}r_t
R(τ)=∑t=0Trt的期望值作为目标函数,即
J
(
θ
)
=
E
τ
[
R
(
τ
)
∣
π
θ
]
J(\theta)=\mathbb{E}_\tau[R(\tau)|\pi_\theta]
J(θ)=Eτ[R(τ)∣πθ]。那么对目标函数
J
(
θ
)
J(\theta)
J(θ)关于策略参数
θ
\theta
θ进行求导有:
∇
θ
J
(
θ
)
=
∇
θ
E
τ
[
R
(
τ
)
∣
π
θ
]
=
∇
θ
∫
τ
p
(
τ
∣
π
θ
)
R
(
τ
)
d
τ
=
∫
τ
p
(
τ
∣
π
θ
)
∇
θ
log
p
(
τ
∣
π
θ
)
R
(
τ
)
d
τ
=
E
τ
[
∇
θ
log
p
(
τ
∣
π
θ
)
R
(
τ
)
]
\begin{aligned} \nabla_\theta J(\theta)&=\nabla_\theta\mathbb{E}_\tau[R(\tau)|\pi_\theta]\\ &=\nabla_\theta\int_\tau p(\tau|\pi_\theta)R(\tau)d\tau\\ &=\int_\tau p(\tau|\pi_\theta)\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)d\tau\\ &=\mathbb{E}_\tau[\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)] \end{aligned}
∇θJ(θ)=∇θEτ[R(τ)∣πθ]=∇θ∫τp(τ∣πθ)R(τ)dτ=∫τp(τ∣πθ)∇θlogp(τ∣πθ)R(τ)dτ=Eτ[∇θlogp(τ∣πθ)R(τ)]因为有
p
(
τ
∣
π
θ
)
=
p
(
s
0
)
∏
t
=
0
T
−
1
p
(
s
t
+
1
∣
s
t
,
a
t
)
π
θ
(
a
t
∣
s
t
)
p(\tau|\pi_\theta)=p(s_0)\prod_{t=0}^{T-1} p(s_{t+1}|s_t, a_t)\pi_\theta(a_t|s_t)
p(τ∣πθ)=p(s0)∏t=0T−1p(st+1∣st,at)πθ(at∣st)所以上式继续化简:
∇
θ
J
(
θ
)
=
E
τ
[
∇
θ
log
p
(
τ
∣
π
θ
)
R
(
τ
)
]
=
E
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
∑
t
=
0
T
−
1
r
t
]
=
E
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
(
∑
t
′
=
0
t
−
1
r
t
′
+
∑
t
′
=
t
T
−
1
r
t
′
)
]
=
E
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
∑
t
′
=
t
T
−
1
r
t
′
]
\begin{aligned} \nabla_\theta J(\theta)&=\mathbb{E}_\tau[\nabla_\theta\log p(\tau|\pi_\theta)R(\tau)]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum_{t=0}^{T-1}r_t\right]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\left(\sum_{t'=0}^{t-1}r_{t'}+\sum^{T-1}_{t'=t}r_{t'}\right)\right]\\ &=\mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum^{T-1}_{t'=t}r_{t'}\right]\\ \end{aligned}
∇θJ(θ)=Eτ[∇θlogp(τ∣πθ)R(τ)]=E[t=0∑T−1∇θlogπθ(at∣st)t=0∑T−1rt]=E[t=0∑T−1∇θlogπθ(at∣st)(t′=0∑t−1rt′+t′=t∑T−1rt′)]=E[t=0∑T−1∇θlogπθ(at∣st)t′=t∑T−1rt′]最后一步是因为
t
t
t时刻之前的奖励与
a
t
a_t
at无关。从另一个角度看,将
t
t
t时刻的奖励
r
t
r_t
rt看做随机变量,那么其期望为:
∇
θ
E
τ
[
r
t
∣
π
θ
]
=
∇
θ
∫
p
(
s
0
,
a
0
,
⋯
s
t
,
a
t
∣
π
θ
)
r
t
d
(
s
0
:
t
,
a
0
:
t
)
=
∫
p
(
s
0
,
a
0
,
⋯
s
t
,
a
t
∣
π
θ
)
∑
t
′
=
0
t
∇
θ
log
π
θ
(
a
t
′
∣
s
t
′
)
r
t
d
(
s
0
:
t
,
a
0
:
t
)
=
E
τ
[
∑
t
′
=
0
t
∇
θ
log
π
θ
(
a
t
′
∣
s
t
′
)
r
t
]
\begin{aligned} \nabla_\theta\mathbb{E}_\tau[r_t|\pi_\theta]&=\nabla_\theta\int p(s_0,a_0,\cdots s_t, a_t|\pi_\theta)r_td(s_{0:t},a_{0:t})\\ &=\int p(s_0,a_0,\cdots s_t, a_t|\pi_\theta)\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_td(s_{0:t},a_{0:t})\\ &=\mathbb{E}_\tau\left[\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_t\right] \end{aligned}
∇θEτ[rt∣πθ]=∇θ∫p(s0,a0,⋯st,at∣πθ)rtd(s0:t,a0:t)=∫p(s0,a0,⋯st,at∣πθ)t′=0∑t∇θlogπθ(at′∣st′)rtd(s0:t,a0:t)=Eτ[t′=0∑t∇θlogπθ(at′∣st′)rt]将其带入目标函数导数中有:
∇
θ
E
τ
[
R
(
τ
)
∣
π
θ
]
=
∇
θ
E
τ
[
∑
t
=
0
T
−
1
r
t
∣
π
θ
]
=
∑
t
=
0
T
−
1
∇
θ
E
τ
[
r
t
∣
π
θ
]
=
E
τ
[
∑
t
=
0
T
−
1
(
∑
t
′
=
0
t
∇
θ
log
π
θ
(
a
t
′
∣
s
t
′
)
r
t
)
]
=
E
τ
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
∑
t
′
=
t
T
−
1
r
t
′
]
\begin{aligned} \nabla_\theta\mathbb{E}_\tau[R(\tau)|\pi_\theta]&=\nabla_\theta\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}r_t|\pi_\theta\right]=\sum^{T-1}_{t=0}\nabla_\theta\mathbb{E}_\tau[r_t|\pi_\theta]\\ &=\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\left(\sum^{t}_{t'=0}\nabla_\theta\log\pi_\theta(a_{t'}|s_{t'})r_t\right)\right]\\ &=\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)\sum^{T-1}_{t'=t}r_{t'}\right] \end{aligned}
∇θEτ[R(τ)∣πθ]=∇θEτ[t=0∑T−1rt∣πθ]=t=0∑T−1∇θEτ[rt∣πθ]=Eτ[t=0∑T−1(t′=0∑t∇θlogπθ(at′∣st′)rt)]=Eτ[t=0∑T−1∇θlogπθ(at∣st)t′=t∑T−1rt′]结果与上式保持一致。
存在的问题
求得了目标函数的梯度后,就可以进行迭代更新策略参数: θ ← θ + α ∇ θ J ( θ ) \theta\leftarrow\theta+\alpha \nabla_\theta J(\theta) θ←θ+α∇θJ(θ),这里存在的问题主要有:
- 步长 α \alpha α的选择很重要,过大可能导致更新后的策略性能下降,导致策略无法提升到最优策略。所以要合适地选择更新步长使得迭代策略使得目标函数值是递增的;
- 上式求解的目标函数的梯度通常利用采样得到,这会导致大的偏差,导致训练过程的不稳定;
- 通常还要考虑样本efficiency问题;
根据上式求解目标函数的梯度需要在采样轨迹
τ
\tau
τ上的积分,这是很困难的,所以通常采用MC采样方法对其进行估计。这一估计值是无偏估计,但是往往有很高的偏差,为了减少偏差,同时仍保持无偏估计,可以将上式目标函数梯度改写为:
E
τ
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
(
∑
t
′
=
t
T
−
1
r
t
′
−
b
(
s
t
)
)
]
\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)\left(\sum^{T-1}_{t'=t}r_{t'}-b(s_t)\right)\right]
Eτ[t=0∑T−1∇θlogπθ(at∣st)(t′=t∑T−1rt′−b(st))]其中
b
(
s
t
)
b(s_t)
b(st)是baseline函数,函数值只与
s
t
s_t
st有关。若上式满足无偏估计,则要求
E
τ
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
b
(
s
t
)
]
=
0
\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)\right]=0
Eτ[∑t=0T−1∇θlogπθ(at∣st)b(st)]=0,证明如下:
E
τ
[
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
b
(
s
t
)
]
=
∑
t
=
0
T
−
1
E
τ
[
∇
θ
log
π
θ
(
a
t
∣
s
t
)
b
(
s
t
)
]
=
∑
t
=
0
T
−
1
E
s
0
:
t
,
a
0
:
t
−
1
[
E
s
t
+
1
:
T
−
1
,
a
t
:
T
−
1
[
∇
θ
log
π
θ
(
a
t
∣
s
t
)
b
(
s
t
)
]
]
=
∑
t
=
0
T
−
1
E
s
0
:
t
,
a
0
:
t
−
1
[
b
(
s
t
)
E
s
t
+
1
:
T
−
1
,
a
t
:
T
−
1
[
∇
θ
log
π
θ
(
a
t
∣
s
t
)
]
]
=
∑
t
=
0
T
−
1
E
s
0
:
t
,
a
0
:
t
−
1
[
b
(
s
t
)
∫
a
t
∇
θ
π
θ
(
a
t
∣
s
t
)
d
a
t
]
=
∑
t
=
0
T
−
1
E
s
0
:
t
,
a
0
:
t
−
1
[
b
(
s
t
)
⋅
0
]
=
0
\begin{aligned} &\mathbb{E}_\tau\left[\sum^{T-1}_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_\tau[\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[\mathbb{E}_{s_{t+1:T-1},a_{t:T-1}}[\nabla_\theta\log\pi_\theta(a_t|s_t)b(s_t)]\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\mathbb{E}_{s_{t+1:T-1},a_{t:T-1}}[\nabla_\theta\log\pi_\theta(a_t|s_t)]\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\int_{a_t}\nabla_\theta\pi_\theta(a_t|s_t)da_t\right]\\ &=\sum^{T-1}_{t=0}\mathbb{E}_{s_{0:t}, a_{0:t-1}}\left[b(s_t)\cdot 0\right]\\ &=0 \end{aligned}
Eτ[t=0∑T−1∇θlogπθ(at∣st)b(st)]=t=0∑T−1Eτ[∇θlogπθ(at∣st)b(st)]=t=0∑T−1Es0:t,a0:t−1[Est+1:T−1,at:T−1[∇θlogπθ(at∣st)b(st)]]=t=0∑T−1Es0:t,a0:t−1[b(st)Est+1:T−1,at:T−1[∇θlogπθ(at∣st)]]=t=0∑T−1Es0:t,a0:t−1[b(st)∫at∇θπθ(at∣st)dat]=t=0∑T−1Es0:t,a0:t−1[b(st)⋅0]=0