Policy Optimization

参考资料:https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html

Intro to Policy Optimization

本部分着重推导策略梯度的数学公式

关于Policy Gradient的简单求导

参数: π θ \pi_{\theta} πθ, 目标函数:最大化 J ( π θ ) = E τ ∼ π θ [ R ( τ ) ] J(\pi_{\theta})=E_{\tau\sim\pi_{\theta}}[R(\tau)] J(πθ)=Eτπθ[R(τ)]

梯度上升:
θ k + 1 = θ k + α ▽ θ J ( π θ ) ∣ θ k \theta_{k+1}=\theta_{k}+\alpha \triangledown_{\theta}J(\pi_{\theta})|_{\theta_{k}} θk+1=θk+αθJ(πθ)θk

  • ▽ θ J ( π θ ) \triangledown_{\theta}J(\pi_{\theta}) θJ(πθ)被称作是Policy Gradient,以这种方式优化策略的方法被称作是policy gradient algorithms(包括Vanilla Policy Gradient和TRPO算法)

求导的具体步骤:

1.轨迹 τ = ( s 0 , a 0 , . . . , s T + 1 ) \tau=(s_0,a_0,...,s_{T+1}) τ=(s0,a0,...,sT+1)的概率
P ( τ ∣ θ ) = ρ 0 ( s 0 ) ∏ t = 0 T P ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) P(\tau|\theta)=\rho_{0}(s_{0})\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t) P(τθ)=ρ0(s0)t=0TP(st+1st,at)πθ(atst)
2.求导技巧
▽ θ P ( τ ∣ θ ) = P ( τ ∣ θ ) ▽ θ l o g P ( τ ∣ θ ) \triangledown_{\theta}P(\tau|\theta)=P(\tau|\theta)\triangledown_{\theta}logP(\tau|\theta) θP(τθ)=P(τθ)θlogP(τθ)
3.关于轨迹的log概率
l o g P ( τ ∣ θ ) = ρ 0 ( s 0 ) ∏ t = 0 T P ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) = l o g ρ 0 ( s 0 ) + ∑ t = 0 T ( l o g P ( s t + 1 ∣ s t , a t ) + l o g π θ ( a t ∣ s t ) ) \begin{aligned} logP(\tau|\theta)&=\rho_{0}(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t)\\ &=log\rho_{0}(s_0)+\sum_{t=0}^{T}(logP(s_{t+1}|s_{t},a_{t})+log\pi_{\theta}(a_t|s_t)) \end{aligned} logP(τθ)=ρ0(s0)t=0TP(st+1st,at)πθ(atst)=logρ0(s0)+t=0T(logP(st+1st,at)+logπθ(atst))
4.关于log概率求导

  • ρ 0 ( s 0 ) , P ( s t + 1 ∣ s t , a t ) 与 π θ 无 关 , 所 以 关 于 其 求 导 为 零 \rho_{0}(s_0),P(s_{t+1}|s_{t},a_{t})与\pi_{\theta}无关,所以关于其求导为零 ρ0(s0),P(st+1st,at)πθ
    ▽ θ l o g P ( τ ∣ θ ) = ▽ θ ρ 0 ( s 0 ) ∏ t = 0 T P ( s t + 1 ∣ s t , a t ) π θ ( a t ∣ s t ) = ▽ θ l o g ρ 0 ( s 0 ) + ▽ θ ∑ t = 0 T ( l o g P ( s t + 1 ∣ s t , a t ) + l o g π θ ( a t ∣ s t ) ) = ∑ t = 0 T ▽ θ l o g π θ ( a t ∣ s t ) \begin{aligned} \triangledown_{\theta}logP(\tau|\theta)&=\triangledown_{\theta}\rho_{0}(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_t|s_t)\\ &=\triangledown_{\theta}log\rho_{0}(s_0)+\triangledown_{\theta}\sum_{t=0}^{T}(logP(s_{t+1}|s_{t},a_{t})+log\pi_{\theta}(a_t|s_t))\\ &=\sum_{t=0}^{T}\triangledown_{\theta}log\pi_{\theta}(a_t|s_t) \end{aligned} θlogP(τθ)=θρ0(s0)t=0TP(st+1st,at)πθ(atst)=θlogρ0(s0)+θt=0T(logP(st+1st,at)+logπθ(atst))=t=0Tθlogπθ(atst)

5.综上所述
▽ θ J ( π θ ) = ▽ θ E τ ∼ π θ [ R ( τ ) ] = ▽ θ ∫ τ P ( τ ∣ θ ) R ( τ ) = ∫ τ ▽ θ P ( τ ∣ θ ) R ( τ ) = ∫ τ P ( τ ∣ θ ) ▽ θ l o g P ( τ ∣ θ ) R ( τ ) = E τ ∼ π θ [ ▽ θ l o g π θ ( a t ∣ s t ) R ( τ ) ] \begin{aligned} \triangledown_{\theta}J(\pi_{\theta})&=\triangledown_{\theta}E_{\tau\sim\pi_{\theta}}[R(\tau)]\\ &=\triangledown_{\theta}\int_{\tau}P(\tau|\theta)R(\tau)\\ &=\int_{\tau}\triangledown_{\theta}P(\tau|\theta)R(\tau)\\ &=\int_{\tau}P(\tau|\theta)\triangledown_{\theta}logP(\tau|\theta)R(\tau)\\ &=E_{\tau\sim\pi_{\theta}}[\triangledown_{\theta}log\pi_{\theta}(a_t|s_t)R(\tau)] \end{aligned} θJ(πθ)=θEτπθ[R(τ)]=θτP(τθ)R(τ)=τθP(τθ)R(τ)=τP(τθ)θlogP(τθ)R(τ)=Eτπθ[θlogπθ(atst)R(τ)]
⇒ ▽ θ J ( π θ ) = E τ ∼ π θ [ ∑ t = 0 T ▽ θ l o g π θ ( a t ∣ s t ) R ( τ ) ] \Rightarrow\triangledown_{\theta}J(\pi_{\theta})=E_{\tau\sim\pi_{\theta}}[\sum_{t=0}^{T}\triangledown_{\theta}log\pi_{\theta}(a_t|s_t)R(\tau)] θJ(πθ)=Eτπθ[t=0Tθlogπθ(atst)R(τ)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值