model-free强化学习-Policy-based

Policy-based

将神经网络作为一个Actor,输入是观测observation,表示形式是一个向量或一个矩阵。输出是每个行为对应的概率,类似于分类问题中的判断类别,会对应每个类别有个概率,如下如所示:
在这里插入图片描述
考虑一个episode τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s T , a T , r T , } \tau=\{s_{1},a_{1}, r_{1},s_{2},a_{2}, r_{2},...,s_{T},a_{T}, r_{T},\} τ={s1,a1,r1,s2,a2,r2,...,sT,aT,rT,}。对于参数为 θ \theta θ的Actor,产生这个episode的概率为:
p ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) . . . = p ( s 1 ) ∏ t = 1 T n p ( a t ∣ s t , θ ) p ( r t , s t + 1 ∣ s t , a t )      ( 1 ) p(\tau|\theta)=p(s_{1})p(a_{1}|s_{1},\theta)p(r_{1},s_{2}|s_{1},a_{1})p(a_{2}|s_{2},\theta)p(r_{2},s_{3}|s_{2},a_{2})...\\ =p(s_{1})\prod_{t=1}^{T_{n}}p(a_{t}|s_{t},\theta)p(r_{t},s_{t+1}|s_{t},a_{t})\ \ \ \ (1) p(τθ)=p(s1)p(a1s1,θ)p(r1,s2s1,a1)p(a2s2,θ)p(r2,s3s2,a2)...=p(s1)t=1Tnp(atst,θ)p(rt,st+1st,at)    (1)
其中 p ( s 1 ) p(s_{1}) p(s1) p ( r t , s t + 1 ∣ s t , a t ) p(r_{t},s_{t+1}|s_{t},a_{t}) p(rt,st+1st,at)部分不是由actor决定的, p ( a t ∣ s t , θ ) p(a_{t}|s_{t},\theta) p(atst,θ)是actor对于属于观测 s t s_{t} st所预测的结果为 a t a_{t} at的概率。对于这个 τ \tau τ,产生的奖励值为 R ( τ ) = ∑ t = 1 T n r t R(\tau)=\sum_{t=1}^{T_{n}}r_{t} R(τ)=t=1Tnrt
使用actor玩 N N N次游戏,也就是在 p ( τ ∣ θ ) p(\tau|\theta) p(τθ)分布下 N N N次抽样 τ \tau τ,得到 N N N个episode { τ 1 , τ 2 , . . . , τ N } \{\tau^{1},\tau^{2},...,\tau^{N}\} {τ1,τ2,...,τN},得到的奖励的期望为:
R ˉ θ = ∑ τ R ( τ ) p ( τ ∣ θ ) \bar{R}_{\theta}=\sum_{\tau}R(\tau)p(\tau|\theta) Rˉθ=τR(τ)p(τθ)
我们的优化目标是最大化期望奖励:
θ ∗ = arg ⁡ max ⁡ θ R ˉ θ \theta^{*}=\arg\max_{\theta} \bar{R}_{\theta} θ=argθmaxRˉθ
求解梯度:
∇ R ˉ θ = ∑ τ R ( τ ) ∇ p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ∇ p ( τ ∣ θ ) p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ∇ log ⁡ p ( τ ∣ θ ) ≈ 1 N ∑ i = 1 N R ( τ n ) ∇ log ⁡ p ( τ n ∣ θ ) = 1 N ∑ n = 1 N R ( τ n ) ∇ log ⁡ [ p ( s 1 n ) ∏ t = 1 T p ( a t n ∣ s t n , θ ) p ( r t n , s t + 1 n ∣ s t n , a t n ) ] = 1 N ∑ n = 1 N R ( τ n ) ∑ t = 1 T ∇ log ⁡ p ( a t n ∣ s t n , θ )   # i g n o r e   t h e   t e r m   n o t   r e l a t e d   θ = 1 N ∑ n = 1 N ∑ t = 1 T R ( τ n ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \nabla\bar{R}_{\theta}=\sum_{\tau}R(\tau)\nabla p(\tau|\theta)=\sum_{\tau}R(\tau)p(\tau|\theta)\frac{\nabla p(\tau|\theta)}{p(\tau|\theta)}\\ =\sum_{\tau}R(\tau)p(\tau|\theta)\nabla \log p(\tau|\theta)\approx \frac{1}{N}\sum_{i=1}^{N} R(\tau^{n})\nabla \log p(\tau^{n}|\theta)\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n})\nabla \log[ p(s_{1}^{n})\prod_{t=1}^{T}p(a_{t}^{n}|s_{t}^{n},\theta)p(r_{t}^{n},s_{t+1}^{n}|s_{t}^{n},a_{t}^{n})]\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T}\nabla\log p(a_{t}^{n}|s_{t}^{n},\theta) \ \# ignore\ the\ term\ not\ related\ \theta\\ = \frac{1}{N}\sum_{n=1}^{N} \sum_{t=1}^{T}R(\tau^{n}) \nabla\log p(a_{t}^{n}|s_{t}^{n},\theta) Rˉθ=τR(τ)p(τθ)=τR(τ)p(τθ)p(τθ)p(τθ)=τR(τ)p(τθ)logp(τθ)N1i=1NR(τn)logp(τnθ)=N1n=1NR(τn)log[p(s1n)t=1Tp(atnstn,θ)p(rtn,st+1nstn,atn)]=N1n=1NR(τn)t=1Tlogp(atnstn,θ) #ignore the term not related θ=N1n=1Nt=1TR(τn)logp(atnstn,θ)
使用梯度提升更新参数: θ ← θ + η R θ ˉ \theta \leftarrow \theta+\eta\bar{R_{\theta}} θθ+ηRθˉ

Actor参数 θ \theta θ的优化可以从分类的角度去优化。
将每一个 τ \tau τ分解产生多个 ( s , a ) (s,a) (s,a),每一个 ( s , a ) (s,a) (s,a)都是一个训练数据。
在这里插入图片描述
在这里插入图片描述
最大化优化交叉熵: max ⁡ ∑ i = 1 3 y i ^ log ⁡ y i \max \sum_{i=1}^{3}\hat{y_{i}}\log{y_{i}} maxi=13yi^logyi
对于一个数据 ( s , a = l e f t ) (s,a=left) (s,a=left),对应的交叉熵为: log ⁡ p ( a = l e f t ∣ s ) \log p(a=left|s) logp(a=lefts)
此时对于 N N N τ \tau τ,对应的梯度为:
1 N ∑ n = 1 N ∑ t = 1 T n ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta) N1n=1Nt=1Tnlogp(atnstn,θ)
每一个训练数据要通过 R ( τ ) R(\tau) R(τ)进行加权,因为奖励大的数据占的权重也大,经过加权之后的误差与上面同奖励得到的梯度一致了
1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}R(\tau^{n})\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta) N1n=1Nt=1TnR(τn)logp(atnstn,θ)
R ( τ ) R(\tau) R(τ)通常都是正数,为了防止某些动作没有被抽样到,减去一个噪声常数 b b b,确保模型能够发生各种行为:
1 N ∑ n = 1 N ∑ t = 1 T n ( R ( τ n ) − b ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}(R(\tau^{n})-b)\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta) N1n=1Nt=1Tn(R(τn)b)logp(atnstn,θ)
如果一开始模型抽样到的所有行为都会产生了正反馈调节,那么这些行为后续出现的概率将增大,其他行为的概率将会减小,进而使得接下来的更新更偏向于上一轮抽样到的样本。减去一个噪声常数确保了Actor对一些奖励小的行为进行抑制,确保大的奖励才能更新,消除了不公平现象。

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页