参考资料:https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
Intro to Policy Optimization
本部分着重推导策略梯度的数学公式
关于Policy Gradient的简单求导
参数: π θ \pi_{\theta} πθ, 目标函数:最大化 J ( π θ ) = E τ ∼ π θ [ R ( τ ) ] J(\pi_{\theta})=E_{\tau\sim\pi_{\theta}}[R(\tau)] J(πθ)=Eτ∼πθ[R(τ)]
梯度上升:
θ
k
+
1
=
θ
k
+
α
▽
θ
J
(
π
θ
)
∣
θ
k
\theta_{k+1}=\theta_{k}+\alpha \triangledown_{\theta}J(\pi_{\theta})|_{\theta_{k}}
θk+1=θk+α▽θJ(πθ)∣θk
- ▽ θ J ( π θ ) \triangledown_{\theta}J(\pi_{\theta}) ▽θJ(πθ)被称作是Policy Gradient,以这种方式优化策略的方法被称作是policy gradient algorithms(包括Vanilla Policy Gradient和TRPO算法)
求导的具体步骤:
1.轨迹
τ
=
(
s
0
,
a
0
,
.
.
.
,
s
T
+
1
)
\tau=(s_0,a_0,...,s_{T+1})
τ=(s0,a0,...,sT+1)的概率
P
(
τ
∣
θ
)
=
ρ
0
(
s
0
)
∏
t
=
0
T
P
(
s
t
+
1
∣
s
t
,
a
t
)
π
θ
(
a
t
∣
s
t
)
P(\tau|\theta)=\rho_{0}(s_{0})\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t)
P(τ∣θ)=ρ0(s0)t=0∏TP(st+1∣st,at)πθ(at∣st)
2.求导技巧
▽
θ
P
(
τ
∣
θ
)
=
P
(
τ
∣
θ
)
▽
θ
l
o
g
P
(
τ
∣
θ
)
\triangledown_{\theta}P(\tau|\theta)=P(\tau|\theta)\triangledown_{\theta}logP(\tau|\theta)
▽θP(τ∣θ)=P(τ∣θ)▽θlogP(τ∣θ)
3.关于轨迹的log概率
l
o
g
P
(
τ
∣
θ
)
=
ρ
0
(
s
0
)
∏
t
=
0
T
P
(
s
t
+
1
∣
s
t
,
a
t
)
π
θ
(
a
t
∣
s
t
)
=
l
o
g
ρ
0
(
s
0
)
+
∑
t
=
0
T
(
l
o
g
P
(
s
t
+
1
∣
s
t
,
a
t
)
+
l
o
g
π
θ
(
a
t
∣
s
t
)
)
\begin{aligned} logP(\tau|\theta)&=\rho_{0}(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t)\\ &=log\rho_{0}(s_0)+\sum_{t=0}^{T}(logP(s_{t+1}|s_{t},a_{t})+log\pi_{\theta}(a_t|s_t)) \end{aligned}
logP(τ∣θ)=ρ0(s0)t=0∏TP(st+1∣st,at)πθ(at∣st)=logρ0(s0)+t=0∑T(logP(st+1∣st,at)+logπθ(at∣st))
4.关于log概率求导
-
ρ
0
(
s
0
)
,
P
(
s
t
+
1
∣
s
t
,
a
t
)
与
π
θ
无
关
,
所
以
关
于
其
求
导
为
零
\rho_{0}(s_0),P(s_{t+1}|s_{t},a_{t})与\pi_{\theta}无关,所以关于其求导为零
ρ0(s0),P(st+1∣st,at)与πθ无关,所以关于其求导为零
▽ θ l o g P ( τ ∣ θ ) = ▽ θ ρ 0 ( s 0 ) ∏ t = 0 T P ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) = ▽ θ l o g ρ 0 ( s 0 ) + ▽ θ ∑ t = 0 T ( l o g P ( s t + 1 ∣ s t , a t ) + l o g π θ ( a t ∣ s t ) ) = ∑ t = 0 T ▽ θ l o g π θ ( a t ∣ s t ) \begin{aligned} \triangledown_{\theta}logP(\tau|\theta)&=\triangledown_{\theta}\rho_{0}(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t)\\ &=\triangledown_{\theta}log\rho_{0}(s_0)+\triangledown_{\theta}\sum_{t=0}^{T}(logP(s_{t+1}|s_{t},a_{t})+log\pi_{\theta}(a_t|s_t))\\ &=\sum_{t=0}^{T}\triangledown_{\theta}log\pi_{\theta}(a_t|s_t) \end{aligned} ▽θlogP(τ∣θ)=▽θρ0(s0)t=0∏TP(st+1∣st,at)πθ(at∣st)=▽θlogρ0(s0)+▽θt=0∑T(logP(st+1∣st,at)+logπθ(at∣st))=t=0∑T▽θlogπθ(at∣st)
5.综上所述
▽
θ
J
(
π
θ
)
=
▽
θ
E
τ
∼
π
θ
[
R
(
τ
)
]
=
▽
θ
∫
τ
P
(
τ
∣
θ
)
R
(
τ
)
=
∫
τ
▽
θ
P
(
τ
∣
θ
)
R
(
τ
)
=
∫
τ
P
(
τ
∣
θ
)
▽
θ
l
o
g
P
(
τ
∣
θ
)
R
(
τ
)
=
E
τ
∼
π
θ
[
▽
θ
l
o
g
π
θ
(
a
t
∣
s
t
)
R
(
τ
)
]
\begin{aligned} \triangledown_{\theta}J(\pi_{\theta})&=\triangledown_{\theta}E_{\tau\sim\pi_{\theta}}[R(\tau)]\\ &=\triangledown_{\theta}\int_{\tau}P(\tau|\theta)R(\tau)\\ &=\int_{\tau}\triangledown_{\theta}P(\tau|\theta)R(\tau)\\ &=\int_{\tau}P(\tau|\theta)\triangledown_{\theta}logP(\tau|\theta)R(\tau)\\ &=E_{\tau\sim\pi_{\theta}}[\triangledown_{\theta}log\pi_{\theta}(a_t|s_t)R(\tau)] \end{aligned}
▽θJ(πθ)=▽θEτ∼πθ[R(τ)]=▽θ∫τP(τ∣θ)R(τ)=∫τ▽θP(τ∣θ)R(τ)=∫τP(τ∣θ)▽θlogP(τ∣θ)R(τ)=Eτ∼πθ[▽θlogπθ(at∣st)R(τ)]
⇒
▽
θ
J
(
π
θ
)
=
E
τ
∼
π
θ
[
∑
t
=
0
T
▽
θ
l
o
g
π
θ
(
a
t
∣
s
t
)
R
(
τ
)
]
\Rightarrow\triangledown_{\theta}J(\pi_{\theta})=E_{\tau\sim\pi_{\theta}}[\sum_{t=0}^{T}\triangledown_{\theta}log\pi_{\theta}(a_t|s_t)R(\tau)]
⇒▽θJ(πθ)=Eτ∼πθ[t=0∑T▽θlogπθ(at∣st)R(τ)]