Policy-based PPO

PPO

PPO在PG的基础上引入了重要性采样,反推得到损失函数

特点

1.策略 π θ ′ \pi_{\theta'} πθ

PPO引入了另一个策略 π θ ′ \pi_{\theta'} πθ来采样,并用 π θ ′ \pi_{\theta'} πθ采集的样本训练 π θ \pi_{\theta} πθ

2.重要性采样

E x ∽ p [ f ( x ) ] = E x ∽ q [ f ( x ) p ( x ) q ( x ) ] \mathbb{E}_{x \backsim p} [f(x)] = \mathbb{E}_{x \backsim q} [f(x) \frac{p(x)} {q(x)}] Exp[f(x)]=Exq[f(x)q(x)p(x)]
注:重要性采样和on-policy、off-policy无关

3.两个网络

critic网络:输出价值v
actor网络:输出动作

3.约束项

KL散度,用来度量两个概率分布相似度的指标

更新公式

PPO

J P P O θ ′ ( θ ) = J θ ′ ( θ ) − β K L ( θ , θ ′ ) J θ ′ ( θ ) = E ( s t , a t ) ∽ π θ ′ [ A θ ′ ( s t , a t ) p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) ] J^{\theta'}_{PPO} (\theta) = J^{\theta'} (\theta) - \beta{KL(\theta,\theta')} \\ J^{\theta'}(\theta)=\mathbb{E}_{(s_t,a_t)\backsim\pi_{\theta'}}[A^{\theta'}(s_t,a_t)\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}] JPPOθ(θ)=Jθ(θ)βKL(θ,θ)Jθ(θ)=E(st,at)πθ[Aθ(st,at)pθ(atst)pθ(atst)]
优势函数也用 θ ′ \theta' θ的原因:跟环境互动的是 π θ ′ \pi_{\theta'} πθ,期望奖励和也是在这个策略下求的

技巧

1.惩罚

J P P O θ k ( θ ) = J θ k ( θ ) − β K L ( θ , θ k ) J θ k ( θ ) ≈ ∑ ( s t , a t ) [ A θ k ( s t , a t ) p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) ] J^{\theta_k}_{PPO} (\theta) = J^{\theta_k} (\theta) - \beta{KL(\theta,\theta_k)} \\ J^{\theta_k}(\theta) \approx \sum_{(s_t,a_t)} [A^{\theta_k}(s_t,a_t)\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}] JPPOθk(θ)=Jθk(θ)βKL(θ,θk)Jθk(θ)(st,at)[Aθk(st,at)pθ(atst)pθ(atst)]

2.裁剪

J P P O 2 θ k ( θ ) ≈ ∑ ( s t , a t ) m i n ( p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) A θ k ( s t , a t ) , c l i p ( p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) , 1 − ε , 1 + ε ) A θ k ( s t , a t ) ) J^{\theta_k}_{PPO2} (\theta) \approx \sum_{(s_t,a_t)} min \Big( \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)} A^{\theta_k}(s_t,a_t), clip \Big( \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}, 1-\varepsilon, 1+\varepsilon\Big) A^{\theta_k}(s_t,a_t) \Big) JPPO2θk(θ)(st,at)min(pθ(atst)pθ(atst)Aθk(st,at),clip(pθ(atst)pθ(atst),1ε,1+ε)Aθk(st,at))

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值