# model-free强化学习-Policy-based

## Policy-based

p ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) . . . = p ( s 1 ) ∏ t = 1 T n p ( a t ∣ s t , θ ) p ( r t , s t + 1 ∣ s t , a t )      ( 1 ) p(\tau|\theta)=p(s_{1})p(a_{1}|s_{1},\theta)p(r_{1},s_{2}|s_{1},a_{1})p(a_{2}|s_{2},\theta)p(r_{2},s_{3}|s_{2},a_{2})...\\ =p(s_{1})\prod_{t=1}^{T_{n}}p(a_{t}|s_{t},\theta)p(r_{t},s_{t+1}|s_{t},a_{t})\ \ \ \ (1)

R ˉ θ = ∑ τ R ( τ ) p ( τ ∣ θ ) \bar{R}_{\theta}=\sum_{\tau}R(\tau)p(\tau|\theta)

θ ∗ = arg ⁡ max ⁡ θ R ˉ θ \theta^{*}=\arg\max_{\theta} \bar{R}_{\theta}

∇ R ˉ θ = ∑ τ R ( τ ) ∇ p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ∇ p ( τ ∣ θ ) p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ∇ log ⁡ p ( τ ∣ θ ) ≈ 1 N ∑ i = 1 N R ( τ n ) ∇ log ⁡ p ( τ n ∣ θ ) = 1 N ∑ n = 1 N R ( τ n ) ∇ log ⁡ [ p ( s 1 n ) ∏ t = 1 T p ( a t n ∣ s t n , θ ) p ( r t n , s t + 1 n ∣ s t n , a t n ) ] = 1 N ∑ n = 1 N R ( τ n ) ∑ t = 1 T ∇ log ⁡ p ( a t n ∣ s t n , θ )   # i g n o r e   t h e   t e r m   n o t   r e l a t e d   θ = 1 N ∑ n = 1 N ∑ t = 1 T R ( τ n ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \nabla\bar{R}_{\theta}=\sum_{\tau}R(\tau)\nabla p(\tau|\theta)=\sum_{\tau}R(\tau)p(\tau|\theta)\frac{\nabla p(\tau|\theta)}{p(\tau|\theta)}\\ =\sum_{\tau}R(\tau)p(\tau|\theta)\nabla \log p(\tau|\theta)\approx \frac{1}{N}\sum_{i=1}^{N} R(\tau^{n})\nabla \log p(\tau^{n}|\theta)\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n})\nabla \log[ p(s_{1}^{n})\prod_{t=1}^{T}p(a_{t}^{n}|s_{t}^{n},\theta)p(r_{t}^{n},s_{t+1}^{n}|s_{t}^{n},a_{t}^{n})]\\ = \frac{1}{N}\sum_{n=1}^{N} R(\tau^{n}) \sum_{t=1}^{T}\nabla\log p(a_{t}^{n}|s_{t}^{n},\theta) \ \# ignore\ the\ term\ not\ related\ \theta\\ = \frac{1}{N}\sum_{n=1}^{N} \sum_{t=1}^{T}R(\tau^{n}) \nabla\log p(a_{t}^{n}|s_{t}^{n},\theta)

Actor参数 θ \theta 的优化可以从分类的角度去优化。

1 N ∑ n = 1 N ∑ t = 1 T n ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)

1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}R(\tau^{n})\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)
R ( τ ) R(\tau) 通常都是正数，为了防止某些动作没有被抽样到，减去一个噪声常数 b b ,确保模型能够发生各种行为：
1 N ∑ n = 1 N ∑ t = 1 T n ( R ( τ n ) − b ) ∇ log ⁡ p ( a t n ∣ s t n , θ ) \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}(R(\tau^{n})-b)\nabla \log p(a^{n}_{t}|s^{n}_{t},\theta)

06-13 4万+

11-02 269
07-05 777
07-12 182
09-04 1557
10-17 3149
10-25 6254
10-14 1425
03-19 1291
05-24 9451