Policy Gradient
设trajectory为
T
r
a
j
e
c
t
o
r
y
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
.
.
.
,
s
t
,
a
t
}
Trajectory \quad τ= \{ s_1, a_1, s_2, a_2,...,s_t,a_t\}
Trajectoryτ={s1,a1,s2,a2,...,st,at}
设actor的参数为
θ
\theta
θ,根据
θ
\theta
θ可以计算某一个轨迹
τ
τ
τ的发生的概率
p
θ
(
τ
)
=
p
(
s
1
)
∏
t
=
1
T
p
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
s
t
,
a
t
)
p_\theta(τ)= p(s_1)\prod_{t=1}^Tp_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)
pθ(τ)=p(s1)t=1∏Tpθ(at∣st)p(st+1∣st,at)
设轨迹
τ
τ
τ的总奖励为
R
(
τ
)
=
∑
t
=
0
T
r
t
R(τ)=\sum_{t=0}^Tr_t
R(τ)=t=0∑Trt
则在参数
θ
\theta
θ下总奖励
R
R
R的期望值为
E
(
R
)
=
∑
τ
p
θ
(
τ
)
R
(
τ
)
=
E
τ
∼
p
θ
(
R
(
τ
)
)
\mathbb{E}(R)=\sum_τp_\theta(τ)R(τ)=\mathbb{E}_{τ\sim p_\theta}(R(τ))
E(R)=τ∑pθ(τ)R(τ)=Eτ∼pθ(R(τ))
其梯度为
∇
E
τ
∼
p
θ
(
R
(
τ
)
)
=
∑
τ
R
(
τ
)
∇
p
θ
(
τ
)
=
∑
τ
R
(
τ
)
p
θ
(
τ
)
∇
log
p
θ
(
τ
)
=
E
τ
∼
p
θ
(
τ
)
(
R
(
τ
)
∇
log
p
θ
(
τ
)
)
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
p
θ
(
τ
n
)
,
n
是
轨
迹
个
数
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
[
p
(
s
1
n
)
∏
t
=
1
T
p
θ
(
a
t
n
∣
s
t
n
)
p
(
s
t
+
1
n
∣
s
t
n
,
a
t
n
)
]
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∑
t
=
1
T
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
R
(
τ
n
)
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
\begin{aligned} \nabla\mathbb{E}_{τ\sim p_\theta}(R(τ))&=\sum_τR(τ)\nabla p_\theta(τ)\\ &=\sum_τR(τ)p_\theta(τ)\nabla \log p_\theta(τ)\\ &=\mathbb{E}_{τ\sim p_\theta(τ)}(R(τ)\nabla \log p_\theta(τ))\\ &\approx \frac{1}{N}\sum_{n=1}^NR(τ^n)\nabla \log p_\theta(τ^n),\quad n是轨迹个数 \\ &=\frac{1}{N}\sum_{n=1}^NR(τ^n)\nabla \log [p(s_1^n)\prod_{t=1}^Tp_\theta(a_t^n|s_t^n)p(s_{t+1}^n|s_t^n,a_t^n)]\\ &=\frac{1}{N}\sum_{n=1}^NR(τ^n) \sum_{t=1}^T\nabla \log p_\theta(a_t^n|s_t^n)\\ &=\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^TR(τ^n) \nabla \log p_\theta(a_t^n|s_t^n) \end{aligned}
∇Eτ∼pθ(R(τ))=τ∑R(τ)∇pθ(τ)=τ∑R(τ)pθ(τ)∇logpθ(τ)=Eτ∼pθ(τ)(R(τ)∇logpθ(τ))≈N1n=1∑NR(τn)∇logpθ(τn),n是轨迹个数=N1n=1∑NR(τn)∇log[p(s1n)t=1∏Tpθ(atn∣stn)p(st+1n∣stn,atn)]=N1n=1∑NR(τn)t=1∑T∇logpθ(atn∣stn)=N1n=1∑Nt=1∑TR(τn)∇logpθ(atn∣stn)
Tip 1: Baseline
我们想让导致
R
(
τ
n
)
>
0
R(τ^n)>0
R(τn)>0的动作的概率变大,让导致
R
(
τ
n
)
<
0
R(τ^n)<0
R(τn)<0的动作的概率变小,这体现在上式的正负上。然而有时候
R
(
τ
n
)
R(τ^n)
R(τn)可能一直大于0,这时候采样到的动作的概率都会提升,只是有的提升大,有的提升小,这时候该怎么办呢,可以减去一个baseline。
∇
E
τ
∼
p
θ
(
R
(
τ
)
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
[
R
(
τ
n
)
−
b
]
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
\nabla\mathbb{E}_{τ\sim p_\theta}(R(τ))=\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T[R(τ^n)-b] \nabla \log p_\theta(a_t^n|s_t^n)
∇Eτ∼pθ(R(τ))=N1n=1∑Nt=1∑T[R(τn)−b]∇logpθ(atn∣stn)
可以取
b
=
1
N
∑
R
(
τ
n
)
b=\frac{1}{N}\sum R(τ^n)
b=N1∑R(τn)即平均Return,
当
R
(
τ
n
)
>
b
R(τ^n)>b
R(τn)>b时,让它的概率上升;当
R
(
τ
n
)
<
b
R(τ^n)<b
R(τn)<b时,让它的概率下降。
这个b可以随着训练不断更新。
Tip 2: Assign suitable credit
我们不能根据Return
R
(
τ
n
)
R(τ^n)
R(τn)来判断某一个动作的好坏,因为它只能衡量这一串动作整体的好坏。更好的衡量标准是:用
a
t
a_t
at这个动作之后各步的reward来衡量,即
∇
θ
E
τ
∼
p
θ
(
R
(
τ
)
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
[
∑
t
′
=
t
T
γ
t
′
−
t
r
t
′
n
−
b
]
∇
log
p
θ
(
a
t
n
∣
s
t
n
)
\nabla_\theta \mathbb{E}_{τ\sim p_\theta}(R(τ))=\frac{1}{N}\sum_{n=1}^N\sum_{t=1}^T[\sum_{t'=t}^Tγ^{t'-t}r_{t'}^n-b] \nabla \log p_\theta(a_t^n|s_t^n)
∇θEτ∼pθ(R(τ))=N1n=1∑Nt=1∑T[t′=t∑Tγt′−trt′n−b]∇logpθ(atn∣stn)
此时的b可以是state-value function
V
(
s
)
V(s)
V(s)。
References:
EasyRL( https://datawhalechina.github.io/easy-rl )