Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)
1 Introduction
- references for brittle DRL
- motivation
- how do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performance?
3 Ablaction study on code-level optimizations
- Code-level optimizations (a number of nine)
- Results on the first four
- reward normalization, Adam annealing, and network initialization each significantly impact
- PS
- learning rate is needed to be tuned before eperiences start
- these optimizations can be implemented for any policy gradient method
4 Algorithmic effect of code-level optimizations
- Algorithm core
Enforcing a trust region is a core algorithmic property of different policy gradient methods. - Trust Region in TRPO and PPO
- TRPO
constrains the KL divergence between successive policies - PPO
enforce a trust region with a different objective- the trust region enforced is heavily dependent on the method with which the clipped PPO objective is ptimized, rather than on the objective itself.
- If non-zero objective is selected then there is no trust region constrain, thus the size of the step we take is determined solely by the steepness of the surrogate landscape
- Finally, policy can end up moving arbitrarily far from the trust region
- TRPO
- Results
- TRPO
- precisely enforces this KL trust region
- PPO
- both PPO and PPO-M fail to maintain a ratio-based trust region
- both PPO and PPO-M constraint the KL well
- KL trust region enforced differs between PPO and PPO-M
- while PPO-M KL trends up as the number of iterations increases, PPO KL peaks halfway through training before trending down again.
- TRPO
5 TRPO vs PPO
- PPO-M/PPO performances equally to TRPO/TRPO+
- Code-level optimizations contribute most to improvement over TRPO
- PPO clipping do not work