文章目录
准备函数
把状态s
和动作a
串起来就得到了轨迹𝞽
T
r
a
j
e
c
t
o
r
y
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
.
.
.
,
s
t
,
a
t
}
Trajectory \ \tau = \{s_1, a_1, s_2, a_2, ..., s_t, a_t\}
Trajectory τ={s1,a1,s2,a2,...,st,at}
某轨迹出现的概率和网络参数𝜽
有关的。具体:
P
r
o
b
a
b
i
l
i
t
y
p
θ
(
τ
)
=
p
(
s
1
)
p
θ
(
a
1
∣
s
1
)
p
(
s
2
∣
a
1
,
s
1
)
p
θ
(
a
2
∣
s
2
)
p
(
s
3
∣
a
2
,
s
2
)
.
.
.
=
p
(
s
1
)
∏
t
=
1
T
p
θ
(
a
t
∣
s
t
)
p
(
s
t
+
1
∣
a
t
,
s
t
)
Probability\ p_\theta(\tau) = p(s_1)p_\theta(a_1|s_1)p(s_2|a_1,s_1)p_\theta(a_2|s_2)p(s_3|a_2,s_2)... = p(s_1)\prod_{t=1}^{T}p_\theta(a_t|s_t)p(s_{t+1}|a_t,s_t)
Probability pθ(τ)=p(s1)pθ(a1∣s1)p(s2∣a1,s1)pθ(a2∣s2)p(s3∣a2,s2)...=p(s1)t=1∏Tpθ(at∣st)p(st+1∣at,st)
注意这其中并不是所有的概率都和𝜽
有关。
actor做出动作 at 时还会输出相应的 rt ,将其求和得到这一轨迹的Reward:
R
e
w
a
r
d
R
θ
(
τ
)
=
∑
t
=
1
T
r
t
Reward\ R_\theta(\tau) = \sum_{t=1}^Tr_t
Reward Rθ(τ)=t=1∑Trt
最终目的是为了最大化R,因此需要找一个衡量指标评价R的大小。在每一个episode,此时𝜃
是固定的,R是关于随机变量𝜏
的函数,可以对其求期望:
R
θ
‾
=
E
τ
~
p
θ
(
τ
)
[
R
θ
(
τ
)
]
=
∑
τ
=
1
n
R
θ
(
τ
)
p
θ
(
τ
)
\overline{R_\theta} = E_{\tau~p_\theta(\tau)}[R_\theta(\tau)] = \sum_{\tau=1}^n R_\theta(\tau)p_\theta(\tau)
Rθ=Eτ~pθ(τ)[Rθ(τ)]=τ=1∑nRθ(τ)pθ(τ)
梯度下降/上升
找R的最大值,因此要梯度上升。首先求梯度(tips: 因为在计算梯度的时候,收集到的所有数据应该是使用同一个网络计算来的,因此此时可以将R𝜃(𝜏)仅看作是和𝜏有关,和𝜃无关。
∇
θ
R
θ
‾
=
∇
θ
∑
τ
=
1
n
R
θ
(
τ
)
p
θ
(
τ
)
=
∇
θ
∑
τ
=
1
n
R
(
τ
)
p
θ
(
τ
)
=
∑
τ
=
1
n
R
(
τ
)
∇
θ
p
θ
(
τ
)
=
∑
τ
=
1
n
R
(
τ
)
p
θ
(
τ
)
∇
θ
l
o
g
(
p
θ
(
τ
)
)
=
E
τ
~
p
θ
(
τ
)
[
R
(
τ
)
∇
θ
l
o
g
(
p
θ
(
τ
)
)
]
\nabla_\theta\overline{R_\theta} = \nabla_\theta\sum_{\tau=1}^n R_\theta(\tau)p_\theta(\tau) = \nabla_\theta\sum_{\tau=1}^n R(\tau)p_\theta(\tau) = \\ \sum_{\tau=1}^n R(\tau)\nabla_{\theta}p_\theta(\tau) = \sum_{\tau=1}^n R(\tau) p_\theta(\tau) \nabla_{\theta}log(p_\theta(\tau)) = E_{\tau~p_\theta(\tau)}[R(\tau) \nabla_{\theta}log(p_\theta(\tau))]
∇θRθ=∇θτ=1∑nRθ(τ)pθ(τ)=∇θτ=1∑nR(τ)pθ(τ)=τ=1∑nR(τ)∇θpθ(τ)=τ=1∑nR(τ)pθ(τ)∇θlog(pθ(τ))=Eτ~pθ(τ)[R(τ)∇θlog(pθ(τ))]
因为𝜏的分布事先是不知道的,所有我们只能够通过玩游戏,然后收集数据,从中随机采样N个轨迹𝜏求平均来估算,那么上式:
∇
θ
R
θ
‾
=
E
τ
~
p
θ
(
τ
)
[
R
(
τ
)
∇
θ
l
o
g
(
p
θ
(
τ
)
]
≈
1
N
∑
n
=
1
n
R
(
τ
n
)
∇
θ
l
o
g
(
p
θ
(
τ
n
)
)
\nabla_\theta\overline{R_\theta} = E_{\tau~p_\theta(\tau)}[R(\tau) \nabla_{\theta}log(p_\theta(\tau)] ≈ \frac 1N\sum_{n=1}^n R(\tau^n) \nabla_{\theta}log(p_\theta(\tau^n))
∇θRθ=Eτ~pθ(τ)[R(τ)∇θlog(pθ(τ)]≈N1n=1∑nR(τn)∇θlog(pθ(τn))
其中
l
o
g
(
p
θ
(
τ
n
)
)
=
l
o
g
(
p
(
s
1
)
∏
t
=
1
T
n
p
θ
(
a
t
n
∣
s
t
n
)
p
(
s
t
+
1
n
∣
a
t
n
,
s
t
n
)
)
=
l
o
g
(
p
(
s
1
)
)
+
∑
t
=
1
T
n
(
l
o
g
(
p
(
s
t
+
1
n
∣
a
t
n
,
s
t
n
)
)
)
+
∑
t
=
1
T
n
(
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
)
log(p_\theta(\tau^n)) = log(p(s_1)\prod_{t=1}^{T_n}p_\theta(a_t^n|s_t^n) p(s_{t+1}^n|a_t^n,s_t^n)) = \\ log(p(s_1)) + \sum_{t=1}^{T_n} (log(p(s_{t+1}^n|a_t^n,s_t^n))) + \sum_{t=1}^{T_n} (log(p_\theta(a_t^n|s_t^n)))
log(pθ(τn))=log(p(s1)t=1∏Tnpθ(atn∣stn)p(st+1n∣atn,stn))=log(p(s1))+t=1∑Tn(log(p(st+1n∣atn,stn)))+t=1∑Tn(log(pθ(atn∣stn)))
前两项和𝜃无关,因此
∇
θ
R
θ
‾
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
θ
l
o
g
(
p
θ
(
τ
n
)
)
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∑
t
=
1
T
n
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
\nabla_\theta\overline{R_\theta} ≈ \frac 1N\sum_{n=1}^N R(\tau^n) \nabla_{\theta}log(p_\theta(\tau^n)) = \frac 1N\sum_{n=1}^N R(\tau^n) \sum_{t=1}^{T_n} \nabla_{\theta}log(p_\theta(a_t^n|s_t^n)) = \frac 1N\sum_{n=1}^N \sum_{t=1}^{T_n} R(\tau^n) \nabla_{\theta}log(p_\theta(a_t^n|s_t^n))
∇θRθ≈N1n=1∑NR(τn)∇θlog(pθ(τn))=N1n=1∑NR(τn)t=1∑Tn∇θlog(pθ(atn∣stn))=N1n=1∑Nt=1∑TnR(τn)∇θlog(pθ(atn∣stn))
技巧
Add a baseline or normalization
∇ θ R θ ‾ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( R ( τ n ) − b ) ∇ θ l o g ( p θ ( a t n ∣ s t n ) ) \nabla_\theta\overline{R_\theta} ≈ \frac 1N\sum_{n=1}^N \sum_{t=1}^{T_n} (R(\tau^n)-b) \nabla_{\theta}log(p_\theta(a_t^n|s_t^n)) ∇θRθ≈N1n=1∑Nt=1∑Tn(R(τn)−b)∇θlog(pθ(atn∣stn))
其中b为R(𝜏n)的期望。
Advantage Function
引入一个定义:
R
(
τ
n
)
−
b
≜
A
θ
(
s
t
,
a
t
)
R(\tau^n)-b ≜ A^\theta(s_t,a_t)
R(τn)−b≜Aθ(st,at)
Advantage Function 的意义是在状态s_t时采取动作a_t相较于采取其他动作有多好(相对值)
并且一般A是由一个network estimate出来的,这个network叫critic
更进一步其实可以对R(𝜏n)进行归一化。
∇
θ
R
θ
‾
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
(
R
(
τ
n
)
−
μ
)
σ
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
\nabla_\theta\overline{R_\theta} ≈ \frac 1N\sum_{n=1}^N \sum_{t=1}^{T_n} \frac{(R(\tau^n)-\mu)}{\sigma}\nabla_{\theta}log(p_\theta(a_t^n|s_t^n))
∇θRθ≈N1n=1∑Nt=1∑Tnσ(R(τn)−μ)∇θlog(pθ(atn∣stn))
Assign suitable credit
上面的式子对于某一个轨迹执行的所有的action乘以相同的R(𝜏n),但其实这之中有的action是好的,有的是坏的。又由于前一个动作一般都会影响之后的执行过程,因此我们使用本次得到的reward和之后所有的reward的和作为本次action带来的作用
,具体来说:
∇
θ
R
θ
‾
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
∑
t
′
=
t
T
n
(
γ
t
′
−
t
r
t
)
−
μ
σ
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
\nabla_\theta\overline{R_\theta} ≈ \frac 1N\sum_{n=1}^N \sum_{t=1}^{T_n} \frac{\sum_{t'=t}^{T_n}(\gamma^{t'-t}r_t)-\mu}{\sigma} \nabla_{\theta}log(p_\theta(a_t^n|s_t^n))
∇θRθ≈N1n=1∑Nt=1∑Tnσ∑t′=tTn(γt′−trt)−μ∇θlog(pθ(atn∣stn))
直观理解:(at时产生的reward)+(之后的action产生的reward*discount)作为衡量at好坏的指标。一次episode做的动作不再是同一个加权值
代码执行过程
∇ θ R θ ‾ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ∑ t ′ = t T n ( γ t ′ − t r t ) − μ σ ∇ θ l o g ( p θ ( a t n ∣ s t n ) ) \nabla_\theta\overline{R_\theta} ≈ \frac 1N \sum_{n=1}^N \sum_{t=1}^{T_n} \frac{\sum_{t'=t}^{T_n}(\gamma^{t'-t}r_t)-\mu}{\sigma} \nabla_{\theta}log(p_\theta(a_t^n|s_t^n)) ∇θRθ≈N1n=1∑Nt=1∑Tnσ∑t′=tTn(γt′−trt)−μ∇θlog(pθ(atn∣stn))
1个𝜃执行N个episode;1个episode对应Tn个𝝯𝜃log(p𝜃(atn|stn))
在第n个episode中,对应𝜏n,一共进行了Tn个step,每个step对应一组:
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
和
∑
t
′
=
t
T
n
(
γ
t
′
−
t
r
t
)
−
μ
σ
\nabla_{\theta}log(p_\theta(a_t^n|s_t^n))\ 和 \ \ \frac{\sum_{t'=t}^{T_n}(\gamma^{t'-t}r_t)-\mu}{\sigma}
∇θlog(pθ(atn∣stn)) 和 σ∑t′=tTn(γt′−trt)−μ
将每个step二者相乘得到的结果进行相加,得到:
∑
t
=
1
T
n
∑
t
′
=
t
T
n
(
γ
t
′
−
t
r
t
)
−
μ
σ
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
\sum_{t=1}^{T_n} \frac{\sum_{t'=t}^{T_n}(\gamma^{t'-t}r_t)-\mu}{\sigma} \nabla_{\theta}log(p_\theta(a_t^n|s_t^n))
t=1∑Tnσ∑t′=tTn(γt′−trt)−μ∇θlog(pθ(atn∣stn))
存入到buffer
当执行完N(一般N = batch_size)个episode之后,将buffer里面的所有数据相加,得到:
∑
n
=
1
N
∑
t
=
1
T
n
∑
t
′
=
t
T
n
(
γ
t
′
−
t
r
t
)
−
μ
σ
∇
θ
l
o
g
(
p
θ
(
a
t
n
∣
s
t
n
)
)
\sum_{n=1}^N \sum_{t=1}^{T_n} \frac{\sum_{t'=t}^{T_n}(\gamma^{t'-t}r_t)-\mu}{\sigma} \nabla_{\theta}log(p_\theta(a_t^n|s_t^n))
n=1∑Nt=1∑Tnσ∑t′=tTn(γt′−trt)−μ∇θlog(pθ(atn∣stn))
这就是我们进行更新时要用到的gradient。使用它对𝜃进行更新之后,重复上面的过程。