[RL 9] Trust Region Policy Optimization (ICML, 2015)

Trust Region Policy Optimization (ICML, 2015)

1 Introduction

  1. policy optimization categories
    1. policy iteration (GPI)
    2. PG (e.g. TRPO)
    3. derivative-free(无导数) optimization methods

2 Preliminaries

  1. Consider an infinite-horizon discounted MDP
    1. instead an average reward one
  2. objective measurement η ( π ~ ) \eta(\tilde{\pi}) η(π~) (Kakade, 2002, TODO)
    η ( π ~ ) = η ( π ) + E s 0 , a 0 , ⋯ ∼ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0,a0,π~[t=0γtAπ(st,at)]=η(π)+sρπ~(s)aπ~(as)Aπ(s,a)
    1. improving the expectation on right hand side leads to policy improvment
    2. but ρ π ~ ( s ) \rho_{\tilde{\pi}}(s) ρπ~(s) is hard to estimate
    3. so use local approximation instead
  3. local approximation objective measurement L π ( π ~ ) L_{\pi} (\tilde{\pi}) Lπ(π~) (Kakade, 2002, TODO)
    L π ( π ~ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ(π~)=η(π)+sρπ(s)aπ~(as)Aπ(s,a)
    1. match η ( π ~ ) \eta(\tilde{\pi}) η(π~) to first order
    2. lower bound
    3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

  1. Theorem 1: lower bound with General Stochastic Policies
    η ( π ~ ) ≥ L π ( π ~ ) − C D K L max ⁡ ( π , π ~ )  where  C = 4 ϵ γ ( 1 − γ ) 2 , ϵ = max ⁡ s , a ∣ A π ( s , a ) ∣ \begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)Lπ(π~)CDKLmax(π,π~) where C=(1γ)24ϵγ,ϵ=s,amaxAπ(s,a)
    • KL can be seen as penalty
    • C C C is a large constant, thus leads to a small step size
  2. Algorithm 1: monotonic policy iteration
    • advantage should be accurate
    • can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

  1. Approximations
    1. trust region constrain
      • step size in Algorithm 1 could be very small
      • change KL penalty to a constrain
    2. average KL
      • max operation involving traversal the whole state space
      • use average KL instead
        D ˉ K L ρ o l d ( θ 1 , θ 2 ) : = E s ∼ ρ o l d [ D K L ( π θ 1 ( ⋅ ∣ s ) ∥ π θ 2 ( ⋅ ∣ s ) ) ] \bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right] DˉKLρold(θ1,θ2):=Esρold[DKL(πθ1(s)πθ2(s))]
  2. optimization problem up to now
    maximize ⁡ θ ∑ s ρ θ old  ( s ) ∑ a π θ ( a ∣ s ) A θ old  ( s , a )  subject to  D ˉ K L ρ old  ( θ old  , θ ) ≤ δ \begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximizesρθold (s)aπθ(as)Aθold (s,a) subject to DˉKLρold (θold ,θ)δ

5 Sample-Based Estimation of the Objective and Constraint

maximize ⁡ θ E s ∼ ρ θ o l d , a ∼ q [ π θ ( a ∣ s ) q ( a ∣ s ) Q θ o l d ( s , a ) ]  subject to  E s ∼ ρ θ o l d [ D K L ( π θ o l d ( ⋅ ∣ s ) ∥ π θ ( ⋅ ∣ s ) ) ] ≤ δ \begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximizeEsρθold,aq[q(as)πθ(as)Qθold(s,a)] subject to Esρθold[DKL(πθold(s)πθ(s))]δ

  1. s ∼ ρ θ o l d s \sim \rho_{\theta \mathrm{old}} sρθold
    • by defination of expectation
  2. a ∼ q a \sim q aq
    • by important sampling
  3. Q θ o l d ( s , a ) Q_{\theta_{\mathrm{old}}}(s, a) Qθold(s,a)
    • by the fact that
      1. A ( s , a ) = Q ( s , a ) − V ( s ) A(s, a)=Q(s, a)-V(s) A(s,a)=Q(s,a)V(s)
      2. ∑ s V θ o l d ( s ) = c o n s t a n t \sum_{s}V^{\theta_{old}}(s)=constant sVθold(s)=constant
    • so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
    • estimation methods
      1. single path
      2. Vine
        • better estimation
        • but need more simulator
        • limit to the system where states can be recover

6 Practical Algorithm

  1. Update Q by collected trajectoriess
  2. Calulate objective and its constrain
  3. Solve the constrain problem by Conjugate gradient alg TODO
    • references
      1. paper Appendix C
      2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

  1. TRPO is related to prior methods (e.g. natural policy
    gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.
    • These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
  2. can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
  3. the method we proposed is scalable and has strong theoretical foundations.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值