PPO
PPO在PG的基础上引入了重要性采样,反推得到损失函数
特点
1.策略 π θ ′ \pi_{\theta'} πθ′
PPO引入了另一个策略 π θ ′ \pi_{\theta'} πθ′来采样,并用 π θ ′ \pi_{\theta'} πθ′采集的样本训练 π θ \pi_{\theta} πθ
2.重要性采样
E
x
∽
p
[
f
(
x
)
]
=
E
x
∽
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
\mathbb{E}_{x \backsim p} [f(x)] = \mathbb{E}_{x \backsim q} [f(x) \frac{p(x)} {q(x)}]
Ex∽p[f(x)]=Ex∽q[f(x)q(x)p(x)]
注:重要性采样和on-policy、off-policy无关
3.两个网络
critic网络:输出价值v
actor网络:输出动作
3.约束项
KL散度,用来度量两个概率分布相似度的指标
更新公式
PPO
J
P
P
O
θ
′
(
θ
)
=
J
θ
′
(
θ
)
−
β
K
L
(
θ
,
θ
′
)
J
θ
′
(
θ
)
=
E
(
s
t
,
a
t
)
∽
π
θ
′
[
A
θ
′
(
s
t
,
a
t
)
p
θ
(
a
t
∣
s
t
)
p
θ
′
(
a
t
∣
s
t
)
]
J^{\theta'}_{PPO} (\theta) = J^{\theta'} (\theta) - \beta{KL(\theta,\theta')} \\ J^{\theta'}(\theta)=\mathbb{E}_{(s_t,a_t)\backsim\pi_{\theta'}}[A^{\theta'}(s_t,a_t)\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}]
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)Jθ′(θ)=E(st,at)∽πθ′[Aθ′(st,at)pθ′(at∣st)pθ(at∣st)]
优势函数也用
θ
′
\theta'
θ′的原因:跟环境互动的是
π
θ
′
\pi_{\theta'}
πθ′,期望奖励和也是在这个策略下求的
技巧
1.惩罚
J P P O θ k ( θ ) = J θ k ( θ ) − β K L ( θ , θ k ) J θ k ( θ ) ≈ ∑ ( s t , a t ) [ A θ k ( s t , a t ) p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) ] J^{\theta_k}_{PPO} (\theta) = J^{\theta_k} (\theta) - \beta{KL(\theta,\theta_k)} \\ J^{\theta_k}(\theta) \approx \sum_{(s_t,a_t)} [A^{\theta_k}(s_t,a_t)\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}] JPPOθk(θ)=Jθk(θ)−βKL(θ,θk)Jθk(θ)≈(st,at)∑[Aθk(st,at)pθ′(at∣st)pθ(at∣st)]
2.裁剪
J P P O 2 θ k ( θ ) ≈ ∑ ( s t , a t ) m i n ( p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) A θ k ( s t , a t ) , c l i p ( p θ ( a t ∣ s t ) p θ ′ ( a t ∣ s t ) , 1 − ε , 1 + ε ) A θ k ( s t , a t ) ) J^{\theta_k}_{PPO2} (\theta) \approx \sum_{(s_t,a_t)} min \Big( \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)} A^{\theta_k}(s_t,a_t), clip \Big( \frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}, 1-\varepsilon, 1+\varepsilon\Big) A^{\theta_k}(s_t,a_t) \Big) JPPO2θk(θ)≈(st,at)∑min(pθ′(at∣st)pθ(at∣st)Aθk(st,at),clip(pθ′(at∣st)pθ(at∣st),1−ε,1+ε)Aθk(st,at))