Policy-based RL
思路
基于MC采样的更新方法:
特点
无偏但是噪声大,噪声是因为它是随机采样的,好的结果和坏的结果差距较大。
解决噪声问题
use temporal causality
在时序上处理(REINFORCE)
上式梯度更新变为下式,某时刻的奖励只与当前时刻相关,这样可以减少无必要的相关性:
include a baseline
再将上式变为下式,减去一个bias,这个bias可以取值为期望,这样就可以平均一些很离谱的价值:
可以将b取为:
方法
MC policy gradient
(采样) 这里首先假设一个马尔科夫过程,我们对这个马尔科夫链进行采样如下
τ
=
(
s
0
,
a
0
,
r
1
,
…
s
T
−
1
,
a
T
−
1
,
r
T
,
s
T
)
∼
(
π
θ
,
P
(
s
t
+
1
∣
s
t
,
a
t
)
)
\tau=\left(s_{0}, a_{0}, r_{1}, \ldots s_{T-1}, a_{T-1}, r_{T}, s_{T}\right) \sim\left(\pi_{\theta}, P\left(s_{t+1} \mid s_{t}, a_{t}\right)\right)
τ=(s0,a0,r1,…sT−1,aT−1,rT,sT)∼(πθ,P(st+1∣st,at))
(要优化的函数)
J
(
θ
)
=
E
π
θ
[
∑
t
=
0
T
−
1
R
(
s
t
,
a
t
)
]
=
∑
τ
P
(
τ
;
θ
)
R
(
τ
)
J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T-1} R\left(s_{t}, a_{t}\right)\right]=\sum_{\tau} P(\tau ; \theta) R(\tau)
J(θ)=Eπθ[∑t=0T−1R(st,at)]=∑τP(τ;θ)R(τ)
(其中
R
(
τ
)
=
∑
t
=
0
T
−
1
R
(
s
t
,
a
t
)
R(\tau)=\sum_{t=0}^{T-1} R\left(s_{t}, a_{t}\right)
R(τ)=∑t=0T−1R(st,at),
P
(
τ
;
θ
)
=
μ
(
s
0
)
∏
t
=
0
T
−
1
π
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
P(\tau ; \theta)=\mu\left(s_{0}\right) \prod_{t=0}^{T-1} \pi_{\theta}\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right)
P(τ;θ)=μ(s0)∏t=0T−1πθ(at∣st)p(st+1∣st,at))
(要优化的目标)
θ
∗
=
arg
max
θ
J
(
θ
)
=
arg
max
θ
∑
τ
P
(
τ
;
θ
)
R
(
τ
)
\theta^{*}=\underset{\theta}{\arg \max } J(\theta)=\underset{\theta}{\arg \max } \sum_{\tau} P(\tau ; \theta) R(\tau)
θ∗=θargmaxJ(θ)=θargmax∑τP(τ;θ)R(τ)
(用于优化的梯度)
∇
θ
J
(
θ
)
=
∑
τ
P
(
τ
;
θ
)
R
(
τ
)
∇
θ
log
P
(
τ
;
θ
)
\nabla_{\theta} J(\theta)=\sum_{\tau} P(\tau ; \theta) R(\tau) \nabla_{\theta} \log P(\tau ; \theta)
∇θJ(θ)=∑τP(τ;θ)R(τ)∇θlogP(τ;θ)
(用MC蒙特卡洛采样的方法近似梯度)
∇
θ
J
(
θ
)
≈
1
m
∑
i
=
1
m
R
(
τ
i
)
∇
θ
log
P
(
τ
i
;
θ
)
\nabla_{\theta} J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} R\left(\tau_{i}\right) \nabla_{\theta} \log P\left(\tau_{i} ; \theta\right)
∇θJ(θ)≈m1∑i=1mR(τi)∇θlogP(τi;θ)
(分解核函数)
∇
θ
log
P
(
τ
;
θ
)
=
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
∣
s
t
)
\nabla_{\theta} \log P(\tau ; \theta) =\sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)
∇θlogP(τ;θ)=∑t=0T−1∇θlogπθ(at∣st)
(最后的近似梯度,amazing!!!)
∇
θ
J
(
θ
)
≈
1
m
∑
i
=
1
m
R
(
τ
i
)
∑
t
=
0
T
−
1
∇
θ
log
π
θ
(
a
t
i
∣
s
t
)
\nabla_{\theta} J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} R\left(\tau_{i}\right) \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_\theta\left(a_{t}^{i} \mid s_{t}\right)
∇θJ(θ)≈m1∑i=1mR(τi)∑t=0T−1∇θlogπθ(ati∣st)
从上面MC近似的梯度来看,这里并不一定需要model-base。