文章目录
Proximal Policy Optimization
什么是off-policy(异策略)
学习的agent和与环境互动的agent不是同一个,那么就是off-policy,反之就是on-policy(同策略)
on-policy缺点
Policy Gradient就是一个on-policy的算法,其奖励梯度公示如下:
∇
θ
R
θ
‾
=
E
τ
~
p
θ
(
τ
)
[
R
(
τ
)
∇
θ
l
o
g
(
p
θ
(
τ
)
)
]
\nabla_\theta\overline{R_\theta} = E_{\tau~p_\theta(\tau)}[R(\tau) \nabla_{\theta}log(p_\theta(\tau))]
∇θRθ=Eτ~pθ(τ)[R(τ)∇θlog(pθ(τ))]
可以看到奖励的梯度和网络的参数𝜽是有关的,当采集了一堆数据更新完参数𝜽之后,此时需要重新采集数据更新p𝜽(𝜏)。因此很多的时间都花在采集数据上,没法重复利用。
Importance Sampling(重要性采样)
Q:在无法做积分的情况下,如何求E𝜏~p[f(x)] = ∫f(x)p(x)dx?
A:做采样,对x进行随机采样,根据下面的公式估算期望:
E
τ
~
p
[
f
(
x
)
]
=
1
N
∑
i
=
1
N
f
(
x
i
)
E_{\tau~p}[f(x)] = \frac {1}{N} \sum_{i=1}^N f(x^i)
Eτ~p[f(x)]=N1i=1∑Nf(xi)
Q:若无法从p中进行采样,但能从q中进行采样,又当如何?
A:看如下公式:(其实对应的就是off-policy的思想,更新参数的agent(p)和采集数据的agent(q)分开来
∫
f
(
x
)
p
(
x
)
d
x
=
∫
f
(
x
)
q
(
x
)
p
(
x
)
q
(
x
)
d
x
=
E
τ
~
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
\int f(x)p(x)dx = \int f(x)q(x) \frac{p(x)}{q(x)}dx = E_{\tau~q}[f(x) \frac{p(x)}{q(x)}]
∫f(x)p(x)dx=∫f(x)q(x)q(x)p(x)dx=Eτ~q[f(x)q(x)p(x)]
问题
理论上q(x)可以取任何的分布,但是一般需要保证p和q是比较接近的,可以从方差进行考虑,很容易证明下式:
V
a
r
τ
~
p
[
f
(
x
)
]
≠
V
a
r
τ
~
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
Var_{\tau~p}[f(x)]≠Var_{\tau~q}[f(x) \frac{p(x)}{q(x)}]
Varτ~p[f(x)]=Varτ~q[f(x)q(x)p(x)]
并且p和q相差越大方差的值差距越大。
on-policy -> off-policy
∇ θ R θ ‾ = E τ ~ p θ ( τ ) [ R ( τ ) ∇ θ l o g ( p θ ( τ ) ) ] = E τ ~ p θ ′ ( τ ) [ p θ ( τ ) p θ ′ ( τ ) R ( τ ) ∇ θ l o g ( p θ ( τ ) ) ] \nabla_\theta\overline{R_\theta} = E_{\tau~p_\theta(\tau)}[R(\tau) \nabla_{\theta}log(p_\theta(\tau))] = E_{\tau~p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)} R(\tau) \nabla_{\theta}log(p_\theta(\tau))] ∇θRθ=Eτ~pθ(τ)[R(τ)∇θlog(pθ(τ))]=Eτ~pθ′(τ)[pθ′(τ)pθ(τ)R(τ)∇θlog(pθ(τ))]
与环境做互动的agent使用网络参数𝜽’,用来收集数据
使用advantage function:(因为advantage是评估做互动的agent的action有多好,所以上标𝜽–>𝜽’
g
r
a
d
i
e
n
t
f
o
r
u
p
d
a
t
e
=
E
(
s
t
,
a
t
)
~
π
θ
[
A
θ
(
s
t
,
a
t
)
∇
θ
l
o
g
p
θ
(
a
t
n
∣
s
t
n
)
]
=
E
(
s
t
,
a
t
)
~
π
θ
′
[
P
θ
(
s
t
,
a
t
)
P
θ
′
(
s
t
,
a
t
)
A
θ
′
(
s
t
,
a
t
)
∇
θ
l
o
g
p
θ
(
a
t
n
∣
s
t
n
)
]
=
E
(
s
t
,
a
t
)
~
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
P
θ
(
s
t
)
P
θ
′
(
s
t
)
A
θ
′
(
s
t
,
a
t
)
∇
θ
l
o
g
p
θ
(
a
t
n
∣
s
t
n
)
]
gradient\ for\ update = E_{(s_t,a_t)~π_\theta}[A^\theta(s_t,a_t) \nabla_{\theta} logp_\theta(a_t^n|s_t^n)] = E_{(s_t,a_t)~π_{\theta'}}[\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)} A^{\theta'}(s_t,a_t) \nabla_{\theta} logp_\theta(a_t^n|s_t^n)] \\ = E_{(s_t,a_t)~π_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)} \frac{P_\theta(s_t)}{P_{\theta'}(s_t)} A^{\theta'}(s_t,a_t) \nabla_{\theta} logp_\theta(a_t^n|s_t^n)]
gradient for update=E(st,at)~πθ[Aθ(st,at)∇θlogpθ(atn∣stn)]=E(st,at)~πθ′[Pθ′(st,at)Pθ(st,at)Aθ′(st,at)∇θlogpθ(atn∣stn)]=E(st,at)~πθ′[Pθ′(at∣st)Pθ(at∣st)Pθ′(st)Pθ(st)Aθ′(st,at)∇θlogpθ(atn∣stn)]
假设模型是𝜽的时候,你看到 st的概率,跟模型是𝜽′的时候,你看到 st的概率是差不多的,那么得到:
g
r
a
d
i
e
n
t
f
o
r
u
p
d
a
t
e
=
E
(
s
t
,
a
t
)
~
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
A
θ
′
(
s
t
,
a
t
)
∇
θ
l
o
g
p
θ
(
a
t
n
∣
s
t
n
)
]
gradient\ for\ update = E_{(s_t,a_t)~π_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)} A^{\theta'}(s_t,a_t) \nabla_{\theta} logp_\theta(a_t^n|s_t^n)]
gradient for update=E(st,at)~πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)∇θlogpθ(atn∣stn)]
根据公式:
∇
θ
p
θ
(
τ
)
=
∇
θ
l
o
g
(
p
θ
(
τ
)
)
\nabla_{\theta}p_\theta(\tau) = \nabla_{\theta}log(p_\theta(\tau))
∇θpθ(τ)=∇θlog(pθ(τ))
从梯度公式可以反推得到目标函数:
J
θ
′
(
θ
)
=
E
(
s
t
,
a
t
)
~
π
θ
′
[
P
θ
(
a
t
∣
s
t
)
P
θ
′
(
a
t
∣
s
t
)
A
θ
′
(
s
t
,
a
t
)
]
J^{\theta'}(\theta) = E_{(s_t,a_t)~π_{\theta'}}[\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)} A^{\theta'}(s_t,a_t)]
Jθ′(θ)=E(st,at)~πθ′[Pθ′(at∣st)Pθ(at∣st)Aθ′(st,at)]
TRPO(Trust Region Policy Optimization)——PPO前身
J T R P O θ ′ ( θ ) = J θ ′ ( θ ) , K L ( θ , θ ′ ) < δ J_{TRPO}^{\theta'}(\theta) = J^{\theta'}(\theta) , \ \ \ \ \ KL(\theta, \theta') < \delta JTRPOθ′(θ)=Jθ′(θ), KL(θ,θ′)<δ
PPO(近端策略优化)
通过重要性采样把 on-policy 换成 off-policy,但 **pθ(at∣st)**跟 **pθ′(at∣st)**这两个分布差太多的话,重要性采样的结果就会不好。怎么避免它差太多呢?这个就是 Proximal Policy Optimization (PPO)
在做的事情。注意,由于在 PPO 中 θ′是θold,即 behavior policy 也是 θ,所以 PPO 是 on-policy 的算法。
为了保证𝜽和𝜽’相似,加入一个KL散度作为约束(类似正则化):
J
P
P
O
θ
′
(
θ
)
=
J
θ
′
(
θ
)
−
β
K
L
(
θ
,
θ
′
)
J_{PPO}^{\theta'}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta')
JPPOθ′(θ)=Jθ′(θ)−βKL(θ,θ′)
Q: 为什么不直接算𝜽和𝜽’的距离,而选择KL散度?
A: 二者距离越近并不代表他们的动作能够足够的相似,有时微小的参数变化都能引起动作的大不相同。重点关注的应该是动作上的相似度,所以其实KL衡量的是动作的距离
PPO Algorithm
PPO-Penalty
J P P O θ k ( θ ) = J θ k ( θ ) − β K L ( θ , θ k ) J_{PPO}^{\theta^k}(\theta) = J^{\theta^k}(\theta) - \beta KL(\theta, \theta^k) JPPOθk(θ)=Jθk(θ)−βKL(θ,θk)
PPO-Clip
J P P O θ k ( θ ) = ∑ ( s t , a t ) m i n ( P θ ( a t ∣ s t ) P θ k ( a t ∣ s t ) A θ k ( s t , a t ) , c l i p ( P θ ( a t ∣ s t ) P θ k ( a t ∣ s t ) , 1 − ϵ , 1 + ϵ ) A θ k ( s t , a t ) ) J_{PPO}^{\theta^k}(\theta) = \sum_{(s_t,a_t)} min(\frac{P_\theta(a_t|s_t)}{P_{\theta^k}(a_t|s_t)} A^{\theta^k}(s_t,a_t), \ clip(\frac{P_\theta(a_t|s_t)}{P_{\theta^k}(a_t|s_t)},1-\epsilon,1+\epsilon)A^{\theta^k}(s_t,a_t)) JPPOθk(θ)=(st,at)∑min(Pθk(at∣st)Pθ(at∣st)Aθk(st,at), clip(Pθk(at∣st)Pθ(at∣st),1−ϵ,1+ϵ)Aθk(st,at))
直观理解:
- advantage–A前面乘的那一项就是用来评估θ和θ’多像的
- A>0,说明这个动作好,希望分子大,增加选取的概率,但最大也不会超过1+ε
- A<0,说明这个动作差,希望分子小,但最小也不会小于1-ε