[RL 10] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

1 Introduction

  1. references for brittle DRL
  2. motivation
    • how do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performance?

3 Ablaction study on code-level optimizations

  • Code-level optimizations (a number of nine)
  • Results on the first four
    • reward normalization, Adam annealing, and network initialization each significantly impact
  • PS
    • learning rate is needed to be tuned before eperiences start
    • these optimizations can be implemented for any policy gradient method

4 Algorithmic effect of code-level optimizations

  1. Algorithm core
    Enforcing a trust region is a core algorithmic property of different policy gradient methods.
  2. Trust Region in TRPO and PPO
    1. TRPO
      constrains the KL divergence between successive policies
    2. PPO
      enforce a trust region with a different objective
      1. the trust region enforced is heavily dependent on the method with which the clipped PPO objective is ptimized, rather than on the objective itself.
      2. If non-zero objective is selected then there is no trust region constrain, thus the size of the step we take is determined solely by the steepness of the surrogate landscape
      3. Finally, policy can end up moving arbitrarily far from the trust region
  3. Results
    1. TRPO
      • precisely enforces this KL trust region
    2. PPO
      1. both PPO and PPO-M fail to maintain a ratio-based trust region
      2. both PPO and PPO-M constraint the KL well
      3. KL trust region enforced differs between PPO and PPO-M
        • while PPO-M KL trends up as the number of iterations increases, PPO KL peaks halfway through training before trending down again.

5 TRPO vs PPO

  1. PPO-M/PPO performances equally to TRPO/TRPO+
  2. Code-level optimizations contribute most to improvement over TRPO
  3. PPO clipping do not work
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值