[RL 10] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

最新推荐文章于 2021-09-26 13:38:38 发布

xyp99

最新推荐文章于 2021-09-26 13:38:38 发布

阅读量531

点赞数

分类专栏： DRL 算法

本文链接：https://blog.csdn.net/xyp99/article/details/109381081

版权

16 篇文章 3 订阅

订阅专栏

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

references for brittle DRL
motivation
- how do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performance?

Code-level optimizations (a number of nine)
Results on the first four
- reward normalization, Adam annealing, and network initialization each significantly impact
PS
- learning rate is needed to be tuned before eperiences start
- these optimizations can be implemented for any policy gradient method

Algorithm core
Enforcing a trust region is a core algorithmic property of different policy gradient methods.
Trust Region in TRPO and PPO
1. TRPO
  constrains the KL divergence between successive policies
2. PPO
  enforce a trust region with a different objective
  1. the trust region enforced is heavily dependent on the method with which the clipped PPO objective is ptimized, rather than on the objective itself.
  2. If non-zero objective is selected then there is no trust region constrain, thus the size of the step we take is determined solely by the steepness of the surrogate landscape
  3. Finally, policy can end up moving arbitrarily far from the trust region
Results
1. TRPO
  - precisely enforces this KL trust region
2. PPO
  1. both PPO and PPO-M fail to maintain a ratio-based trust region
  2. both PPO and PPO-M constraint the KL well
  3. KL trust region enforced differs between PPO and PPO-M
    - while PPO-M KL trends up as the number of iterations increases, PPO KL peaks halfway through training before trending down again.

关注