文章目录
前言
Proximal Policy Optimization(PPO)即近端策略优化,是Policy Graident的一种改进算法,属于Importce Sampling的方法,将Policy Gradient中On-policy的训练过程转化为Off-policy。
一、梯度下降?
1 梯度
可以简单理解为一个多元函数在空间的某个点上最大方向导数的矢量,梯度的本意是一个向量(矢量),表示某一函数在该点处的方向导数沿着该方向取得最大值,即函数在该点处沿着该方向(此梯度的方向)变化最快,变化率最大(为该梯度的模)。
f
(
x
,
y
)
f\left( x,y \right)
f(x,y) 的梯度为
(
f
x
,
f
y
)
\left( f_x,f_y \right)
(fx,fy) ,给定具体的点(X,Y)则可以计算出
(
f
X
,
f
Y
)
\left( f_X,f_Y \right)
(fX,fY) ,该梯度为一个向量,若是一元函数则在给定坐标后,梯度为一个数值。
比如:
z
=
f
(
x
,
y
)
=
x
2
−
y
2
z\ =\ f\left( x,y \right) =x^2-y^2
z = f(x,y)=x2−y2 函数图像为
其梯度为
(
2
x
,
−
2
y
)
\left( 2x,-2y \right)
(2x,−2y) 在坐标轴上的图像为
2 梯度下降
二、Proximal Policy Optimization (PPO)
1.策略梯度
Policy Gradient基于策略迭代,直接通过采样状态、动作、奖励,然后最大化奖励的期望。
当采样足够充分时,奖励的期望可以近似为N回合的奖励的平均值:
R
ˉ
θ
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
\bar{R}_\theta =\sum_{\tau}{R\left( \tau \right) P\left( \tau |\theta \right) \approx \frac{1}{N}\sum_{n=1}^N{R\left( \tau ^n \right)}}
Rˉθ=τ∑R(τ)P(τ∣θ)≈N1n=1∑NR(τn)
策略梯度:设计一个网络,其输入是state,输出是对应各个action的概率,并策略梯度(PolicyGradient)进行迭代训练。定义一个回合为:
τ
=
s
1
,
a
1
,
r
1
,
⋯
,
s
T
,
a
T
,
r
T
\tau =s_1,a_1,r_1,\cdots ,s_T,a_T,r_T
τ=s1,a1,r1,⋯,sT,aT,rT
其中
R
(
τ
)
=
∑
t
=
1
T
r
t
R\left( \tau \right) =\sum_{t=1}^T{r_t}
R(τ)=t=1∑Trt 对R平均求梯度
∇
R
ˉ
θ
=
∑
τ
R
(
τ
)
∇
P
(
τ
∣
θ
)
\nabla \bar{R}_{\theta}=\sum_{\tau}{R\left( \tau \right) \nabla P\left( \tau |\theta \right)}
∇Rˉθ=τ∑R(τ)∇P(τ∣θ)
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
⋅
∇
P
(
τ
∣
θ
)
P
(
τ
∣
θ
)
\ \ \ \ =\sum_{\tau}{R\left( \tau \right) P\left( \tau |\theta \right) \cdot \frac{\nabla P\left( \tau |\theta \right)}{P\left( \tau |\theta \right)}}
=τ∑R(τ)P(τ∣θ)⋅P(τ∣θ)∇P(τ∣θ)
=
∑
τ
R
(
τ
)
P
(
τ
∣
θ
)
⋅
∇
log
P
(
τ
∣
θ
)
\ \ \ \ =\sum_{\tau}{R\left( \tau \right) P\left( \tau |\theta \right) \cdot \nabla \log P\left( \tau |\theta \right)}
=τ∑R(τ)P(τ∣θ)⋅∇logP(τ∣θ)
=
1
N
∑
n
=
1
N
R
(
τ
n
)
⋅
∇
log
P
(
τ
n
∣
θ
)
\ \ \ \ =\frac{1}{N}\sum_{n=1}^N{R\left( \tau ^n \right) \cdot \nabla \log P\left( \tau ^n|\theta \right)}
=N1n=1∑NR(τn)⋅∇logP(τn∣θ)
其中P可以写为
P
(
τ
n
∣
θ
)
=
P
(
s
1
)
P
(
a
1
∣
s
1
,
θ
)
P
(
r
1
,
s
2
∣
s
1
,
a
1
)
P
(
a
2
∣
s
2
,
θ
)
⋯
P
(
a
t
∣
s
t
,
θ
)
P
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
P\left( \tau ^n|\theta \right) =P\left( s_1 \right) P\left( a_1|s_1,\theta \right) P\left( r_1,s_2|s_1,a_1 \right) P\left( a_2|s_2,\theta \right) \cdots P\left( a_t|s_t,\theta \right) P\left( r_t,s_{t+1}|s_t,a_t \right)
P(τn∣θ)=P(s1)P(a1∣s1,θ)P(r1,s2∣s1,a1)P(a2∣s2,θ)⋯P(at∣st,θ)P(rt,st+1∣st,at)
=
P
(
s
1
)
∏
t
P
(
a
t
∣
s
t
,
θ
)
P
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
=P\left( s_1 \right) \prod_t{P\left( a_t|s_t,\theta \right)}P\left( r_t,s_{t+1}|s_t,a_t \right)
=P(s1)t∏P(at∣st,θ)P(rt,st+1∣st,at) 代入前式中,
∇
log
p
(
τ
n
∣
θ
)
=
∇
log
(
p
(
s
1
)
∏
t
p
(
a
t
∣
s
t
,
θ
)
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
)
\nabla \log p\left( \tau ^n|\theta \right) =\nabla \log \left( p\left( s_1 \right) \prod_t{p\left( a_t|s_t,\theta \right)}p\left( r_t,s_{t+1}|s_t,a_t \right) \right)
∇logp(τn∣θ)=∇log(p(s1)t∏p(at∣st,θ)p(rt,st+1∣st,at))
=
∇
log
p
(
s
1
)
+
∑
t
=
1
T
∇
log
p
(
a
t
∣
s
t
,
θ
)
+
∑
t
=
1
T
∇
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
=\nabla \log p\left( s_1 \right) +\sum_{t=1}^T{\nabla \log p\left( a_t|s_t,\theta \right)}+\sum_{t=1}^T{\nabla p\left( r_t,s_{t+1}|s_t,a_t \right)}
=∇logp(s1)+t=1∑T∇logp(at∣st,θ)+t=1∑T∇p(rt,st+1∣st,at)
=
∑
t
=
1
T
∇
log
p
(
a
t
∣
s
t
,
θ
)
=\sum_{t=1}^T{\nabla \log p\left( a_t|s_t,\theta \right)}
=t=1∑T∇logp(at∣st,θ)
随即
∇
R
ˉ
θ
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
∇
log
p
(
a
t
∣
s
t
,
θ
)
\nabla \bar{R}_{\theta}=\frac{1}{N}\sum_{n=1}^N{\sum_{t=1}^{T_n}{R\left( \tau ^n \right)}}\nabla \log p\left( a_t|s_t,\theta \right)
∇Rˉθ=N1n=1∑Nt=1∑TnR(τn)∇logp(at∣st,θ)
最小化N回合采样出的action与网络输出的action交叉熵基础上乘以R,即
−
∑
n
=
1
N
R
(
τ
n
)
⋅
a
i
log
p
i
-\sum_{n=1}^N{R\left( \tau ^n \right) \cdot a_i\log p_i}
−n=1∑NR(τn)⋅ailogpi
交叉熵公式为
H
p
(
q
)
=
∑
x
q
(
x
)
log
2
(
1
p
(
x
)
)
=
−
∑
x
q
(
x
)
log
2
p
(
x
)
H_p\left( q \right) =\sum_x{q\left( x \right) \log _2\left( \frac{1}{p\left( x \right)} \right)}=-\sum_x{q\left( x \right) \log _2p\left( x \right)}
Hp(q)=x∑q(x)log2(p(x)1)=−x∑q(x)log2p(x)
2.Actor-Critic Model
这个改进针对每一个状态、动作元组对进行替换
R
(
τ
n
)
→
∑
t
=
t
′
T
n
γ
t
−
t
′
r
t
n
R\left( \tau ^n \right) \rightarrow \sum_{t=t'}^{T_n}{\gamma ^{t-t'}r_{t}^{n}}
R(τn)→t=t′∑Tnγt−t′rtn 梯度公式将会被改写为
∇
R
ˉ
θ
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
∑
t
=
t
′
T
n
γ
t
−
t
′
r
t
n
∇
log
p
(
a
t
∣
s
t
,
θ
)
\nabla \bar{R}_{\theta}=\frac{1}{N}\sum_{n=1}^N{\sum_{t=1}^{T_n}{\sum_{t=t'}^{T_n}{\gamma ^{t-t'}r_{t}^{n}\nabla \log p\left( a_t|s_t,\theta \right)}}}
∇Rˉθ=N1n=1∑Nt=1∑Tnt=t′∑Tnγt−t′rtn∇logp(at∣st,θ)
此时还存在过估计(Overestimate)问题,因为有些状态-动作采样不到,而另一些则反复采样。这样当输出softmax后,选择概率差距被进一步放大,则需要进行下一个改进。
(1)Baseline:常超参数
(2)Critic网络
Actor-Critic Model:
∇
R
ˉ
θ
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
A
θ
(
a
t
∣
s
t
)
∇
log
p
(
a
t
∣
s
t
,
θ
)
\nabla \bar{R}_{\theta}=\frac{1}{N}\sum_{n=1}^N{\sum_{t=1}^{T_n}{A^{\theta}\left( a_t|s_t \right) \nabla \log p\left( a_t|s_t,\theta \right)}}
∇Rˉθ=N1n=1∑Nt=1∑TnAθ(at∣st)∇logp(at∣st,θ)
The value of the advantage function represents the additional gain brought by taking action a.
3 Import sampling
上面的梯度
∇
R
ˉ
θ
\nabla \bar{R}_{\theta}
∇Rˉθ表明采样的轨迹为参数θ的功劳,但是当更新了参数后,新的参数下概率也会发生变化,即状态-动作对的分布发生了变化,原先的样本也就不能用了。也就是说同样的策略不能一边采样一边更新,一个解决办法就是采取异策略,这么做的好处是一个策略采集到的样本可以被反复使用,梯度上升多次。具体的做法就是importance sampling。
假设一个连续随机变量X概率密度为p(x),函数f(x)的期望可以表示为
E
x
−
p
[
f
(
x
)
]
=
∫
f
(
x
)
p
(
x
)
d
x
E_{x-p}\left[ f\left( x \right) \right] =\int{f\left( x \right) p\left( x \right) dx}
Ex−p[f(x)]=∫f(x)p(x)dx 设另一个概率密度函数为q(x),则对于上述函数期望也可以写为
E
x
−
p
[
f
(
x
)
]
=
∫
f
(
x
)
⋅
p
(
x
)
d
x
E_{x-p}\left[ f\left( x \right) \right] =\int{f\left( x \right) \cdot p\left( x \right) dx}
Ex−p[f(x)]=∫f(x)⋅p(x)dx
=
∫
f
(
x
)
⋅
p
(
x
)
q
(
x
)
q
(
x
)
d
x
=\int{f\left( x \right) \cdot \frac{p\left( x \right)}{q\left( x \right)}q\left( x \right) dx}
=∫f(x)⋅q(x)p(x)q(x)dx
=
E
x
−
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
=E_{x-q}\left[ f\left( x \right) \frac{p\left( x \right)}{q\left( x \right)} \right]
=Ex−q[f(x)q(x)p(x)]
其中p(x)/q(x)称为importance weight,类比问题将f(x)当作A(at|st),而p(x)/q(x)为两种策略对于当前状态采取当前动作对应概率之比。
假如采样足够充分的话,可以认为:
E
x
−
p
[
f
(
x
)
]
=
E
x
−
q
[
f
(
x
)
p
(
x
)
q
(
x
)
]
E_{x-p}\left[ f\left( x \right) \right] =E_{x-q}\left[ f\left( x \right) \frac{p\left( x \right)}{q\left( x \right)} \right]
Ex−p[f(x)]=Ex−q[f(x)q(x)p(x)]
3 PPO
Importance sampling将On-policy转化为Off-policy,在利用q(x)充分采样的情况下更新p(x),这个过程可以在一回合重复N次,而不再是1次,这样大幅度减少了原始PG算法在线学习进行采样状态-动作-奖励元组对时间,同时保证了训练效果。
∇
R
ˉ
θ
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
p
θ
(
a
t
∣
s
t
)
p
θ
′
(
a
t
s
t
)
A
θ
(
a
t
∣
s
t
)
∇
log
p
(
a
t
∣
s
t
,
θ
)
\nabla \bar{R}_{\theta}=\frac{1}{N}\sum_{n=1}^N{\sum_{t=1}^{T_n}{\frac{p_{\theta}\left( a_t|s_t \right)}{p_{\theta '}\left( a_ts_t \right)}A^{\theta}\left( a_t|s_t \right) \nabla \log p\left( a_t|s_t,\theta \right)}}
∇Rˉθ=N1n=1∑Nt=1∑Tnpθ′(atst)pθ(at∣st)Aθ(at∣st)∇logp(at∣st,θ)
其中
p
θ
(
a
t
∣
s
t
)
p
θ
′
(
a
t
s
t
)
\frac{p_{\theta}\left( a_t|s_t \right)}{p_{\theta '}\left( a_ts_t \right)}
pθ′(atst)pθ(at∣st)有时进行修正操作,如Clipped,将比值限制在(1-ε,1+ε)内。
令
r
t
(
θ
)
=
π
θ
(
a
t
∣
s
t
)
π
θ
o
l
d
(
a
t
∣
s
t
)
r_t\left( \theta \right) =\frac{\pi _{\theta}\left( a_t|s_t \right)}{\pi _{\theta _{old}}\left( a_t|s_t \right)}
rt(θ)=πθold(at∣st)πθ(at∣st)
L c l i p ( θ ) = E ^ t [ min ( r t ( θ ) A t , c l i p ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A ^ t ) ] L^{clip}\left( \theta \right) =\hat{E}_t\left[ \min \left( r_t\left( \theta \right) A_t,clip\left( r_t\left( \theta \right) ,1-\epsilon ,1+\epsilon \right) \hat{A}_t \right) \right] Lclip(θ)=E^t[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
其中,ε为超参数一般设置为0.2,然后对Lclip进行SGD优化。
A
^
t
>
0
\hat{A}_t>0
A^t>0 为优势函数,其>0时表示策略更好,应该进一步优化,当
总结
控制新旧策略的比值
r
t
(
θ
)
r_t\left( \theta \right)
rt(θ)
防止更新影响agent学习效果。