Policy-based
将神经网络作为一个Actor,输入是观测observation,表示形式是一个向量或一个矩阵。输出是每个行为对应的概率,类似于分类问题中的判断类别,会对应每个类别有个概率,如下如所示:
考虑一个episode
τ
=
{
s
1
,
a
1
,
r
1
,
s
2
,
a
2
,
r
2
,
.
.
.
,
s
T
,
a
T
,
r
T
,
}
\tau=\{s_{1},a_{1}, r_{1},s_{2},a_{2}, r_{2},...,s_{T},a_{T}, r_{T},\}
τ={s1,a1,r1,s2,a2,r2,...,sT,aT,rT,}。对于参数为
θ
\theta
θ的Actor,产生这个episode的概率为:
p
(
τ
∣
θ
)
=
p
(
s
1
)
p
(
a
1
∣
s
1
,
θ
)
p
(
r
1
,
s
2
∣
s
1
,
a
1
)
p
(
a
2
∣
s
2
,
θ
)
p
(
r
2
,
s
3
∣
s
2
,
a
2
)
.
.
.
=
p
(
s
1
)
∏
t
=
1
T
n
p
(
a
t
∣
s
t
,
θ
)
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
(
1
)
p(\tau|\theta)=p(s_{1})p(a_{1}|s_{1},\theta)p(r_{1},s_{2}|s_{1},a_{1})p(a_{2}|s_{2},\theta)p(r_{2},s_{3}|s_{2},a_{2})...\\ =p(s_{1})\prod_{t=1}^{T_{n}}p(a_{t}|s_{t},\theta)p(r_{t},s_{t+1}|s_{t},a_{t})\ \ \ \ (1)
p(τ∣θ)=p(s1)p(a1∣s1,θ)p(r1,s2∣s1,a1)p(a2∣s2,θ)p(r2,s3∣s2,a2)...=p(s1)t=1∏Tnp(at∣st,θ)p(rt,st+1∣st,at) (1)
其中
p
(
s
1
)
p(s_{1})
p(s1)和
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
p(r_{t},s_{t+1}|s_{t},a_{t})
p(rt,st+1∣st,at)部分不是由actor决定的,
p
(
a
t
∣
s
t
,
θ
)
p(a_{t}|s_{t},\theta)
p(at∣st,θ)是actor对于属于观测
s
t
s_{t}
st所预测的结果为
a
t
a_{t}
at的概率。对于这个
τ
\tau
τ,产生的奖励值为
R
(
τ
)
=
∑
t
=
1
T
n
r
t
R(\tau)=\sum_{t=1}^{T_{n}}r_{t}
R(τ)=∑t=1Tnrt
使用actor玩
N
N
N次游戏,也就是在
p
(
τ
∣
θ
)
p(\tau|\theta)
p(τ∣θ)分布下
N
N
N次抽样
τ
\tau
τ,得到
N
N
N个episode
{
τ
1
,
τ
2
,
.
.
.
,
τ
N
}
\{\tau^{1},\tau^{2},...,\tau^{N}\}
{τ1,τ2,...,τN},得到的奖励的期望为:
R
ˉ
θ
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
\bar{R}_{\theta}=\sum_{\tau}R(\tau)p(\tau|\theta)
Rˉθ=τ∑R(τ)p(τ∣θ)
我们的优化目标是最大化期望奖励:
θ
∗
=
arg
max
θ
R
ˉ
θ
\theta^{*}=\arg\max_{\theta} \bar{R}_{\theta}
θ∗=argθmaxRˉθ
求解梯度:
∇
R
ˉ
θ
=
∑
τ
R
(
τ
)
∇
p
(
τ
∣
θ
)
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
∇
p
(
τ
∣
θ
)
p
(
τ
∣
θ
)
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
∇
log
p
(
τ
∣
θ
)
≈
1
N
∑
i
=
1
N
R
(
τ
n
)
∇
log
p
(
τ
n
∣
θ
)
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
[
p
(
s
1
n
)
∏
t
=
1
T
p
(
a
t
n
∣
s
t
n
,
θ
)
p
(
r
t
n
,
s
t
+
1
n
∣
s
t
n
,
a
t
n
)
]
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∑
t
=
1
T
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
#
i
g
n
o
r
e
t
h
e
t
e
r
m
n
o
t
r
e
l
a
t
e
d
θ
=
1
N
∑
n
=
1
N
∑
t
=
1
T
R
(
τ
n
)
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\nabla\bar{R}_{\theta}=\sum_{\tau}R(\tau)\nabla p(\tau|\theta)=\sum_{\tau}R(\tau)p(\tau|\theta)\frac{\nabla p(\tau|\theta)}{p(\tau|\theta)}\\ =\sum_{\tau}R(\tau)p(\tau|\theta)\nabla \log p(\tau|\theta)\approx \frac{1}{N}\sum_{i=1}^{N} R(\tau^{n})\nabla \log p(\tau^{n}|\theta)\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n})\nabla \log[ p(s_{1}^{n})\prod_{t=1}^{T}p(a_{t}^{n}|s_{t}^{n},\theta)p(r_{t}^{n},s_{t+1}^{n}|s_{t}^{n},a_{t}^{n})]\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T}\nabla\log p(a_{t}^{n}|s_{t}^{n},\theta) \ \# ignore\ the\ term\ not\ related\ \theta\\ = \frac{1}{N}\sum_{n=1}^{N} \sum_{t=1}^{T}R(\tau^{n}) \nabla\log p(a_{t}^{n}|s_{t}^{n},\theta)
∇Rˉθ=τ∑R(τ)∇p(τ∣θ)=τ∑R(τ)p(τ∣θ)p(τ∣θ)∇p(τ∣θ)=τ∑R(τ)p(τ∣θ)∇logp(τ∣θ)≈N1i=1∑NR(τn)∇logp(τn∣θ)=N1n=1∑NR(τn)∇log[p(s1n)t=1∏Tp(atn∣stn,θ)p(rtn,st+1n∣stn,atn)]=N1n=1∑NR(τn)t=1∑T∇logp(atn∣stn,θ) #ignore the term not related θ=N1n=1∑Nt=1∑TR(τn)∇logp(atn∣stn,θ)
使用梯度提升更新参数:
θ
←
θ
+
η
R
θ
ˉ
\theta \leftarrow \theta+\eta\bar{R_{\theta}}
θ←θ+ηRθˉ
Actor参数
θ
\theta
θ的优化可以从分类的角度去优化。
将每一个
τ
\tau
τ分解产生多个
(
s
,
a
)
(s,a)
(s,a),每一个
(
s
,
a
)
(s,a)
(s,a)都是一个训练数据。
最大化优化交叉熵:
max
∑
i
=
1
3
y
i
^
log
y
i
\max \sum_{i=1}^{3}\hat{y_{i}}\log{y_{i}}
max∑i=13yi^logyi
对于一个数据
(
s
,
a
=
l
e
f
t
)
(s,a=left)
(s,a=left),对应的交叉熵为:
log
p
(
a
=
l
e
f
t
∣
s
)
\log p(a=left|s)
logp(a=left∣s)
此时对于
N
N
N次
τ
\tau
τ,对应的梯度为:
1
N
∑
n
=
1
N
∑
t
=
1
T
n
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)
N1n=1∑Nt=1∑Tn∇logp(atn∣stn,θ)
每一个训练数据要通过
R
(
τ
)
R(\tau)
R(τ)进行加权,因为奖励大的数据占的权重也大,经过加权之后的误差与上面同奖励得到的梯度一致了
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}R(\tau^{n})\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)
N1n=1∑Nt=1∑TnR(τn)∇logp(atn∣stn,θ)
R
(
τ
)
R(\tau)
R(τ)通常都是正数,为了防止某些动作没有被抽样到,减去一个噪声常数
b
b
b,确保模型能够发生各种行为:
1
N
∑
n
=
1
N
∑
t
=
1
T
n
(
R
(
τ
n
)
−
b
)
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}(R(\tau^{n})-b)\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)
N1n=1∑Nt=1∑Tn(R(τn)−b)∇logp(atn∣stn,θ)
如果一开始模型抽样到的所有行为都会产生了正反馈调节,那么这些行为后续出现的概率将增大,其他行为的概率将会减小,进而使得接下来的更新更偏向于上一轮抽样到的样本。减去一个噪声常数确保了Actor对一些奖励小的行为进行抑制,确保大的奖励才能更新,消除了不公平现象。