What’s problem and challenges?
There are many sources of possible instability and variance that can lead to difficulties with reproducing deep policy gradient methods such as DDPG and TRPO.
What’s the proposed solution?
Common Hyper-Parameters
The authors investigate two policy gradient algorithms: DDPG and TRPO. Three main performance measures commonly used in the literature are: Maximum Average Return, Maximum Return, Standard Deviation of Returns, and Average Return. Additionally, they also highlight that there can be difficulty in properly fine-tuning
hyper-parameter settings, leading to large variations of reported results across a wide range of works as different hyper-parameters are used.
Policy Network Architecture: They use two MuJoCo physics simulator tasks from OpenAI gym (Hopper-v1 and Half-Cheetah-v1) for their experimental tasks. The network architecture will performance quiet differently for different environment and they found to be the best (400,300) is not the one which is used in reporting results for baselines results in [1, 2]. Additionally, DDPG is quite unstable no matter the network architecture.
Batch Size: For DDPG, the mini-batch sizes of 128 perform best and it is intuitive that a larger batch size would perform better in this time frame as more samples are seen.
TRPO
Regularization Coefficient : The regularization coefficient (RC) (or conjugate gradient damping factor) has no difference for Half-Cheetah-v1, though it seems to have a more significant effect on Hopper.
Generalized Advantage Estimation : They investigate using λ = 1.0 and λ = 0.97 for this. Then they find that for longer iterations, a lower GAE λ does in fact improve results for longer sequences in Half-Cheetah and mildly for Hopper.
Step Size : The step size (SS) (effectively the learning rate of TRPO) is the same as the KL-divergence bound for the conjugate gradient steps. The default value of 0.01 appears to work generally the best for both Hopper and Half-Cheetah.
DDPG
Reward Scale : The rewards for SOME tasks were rescaled by a factor to improve the stability of DDPG.
Actor-Critic Learning Rates : They use 0.001, 0.0001 (for the critic, and actor respectively). Interestingly, we find that the actor and critic learning rates for DDPG have less of an effect on the Hopper environment than the Half-Cheetah environment. This brings into consideration that keeping other parameters fixed, DDPG is not only susceptible to the learning rates, but there are other sources of variation and randomness in the DDPG algorithm.
General Variance
Their results show that for both DDPG and TRPO, taking two different average across 5 experiment runs do not necessarily produce the same result, and in fact, there is high variance in the obtained results. This emphasizes the need for averaging many runs together when reporting results using a different randoms seed for each. In this way, future works should attempt to negate the effect of random seeds and environment stochasticity when reporting their results.
Conclusion
Their analysis shows that due to the under-reporting of hyper-parameters, different works often report different baseline results and performance measures for both TRPO and DDPG. This leads to an unfair comparison of baselines in continuous control environments. Their experiments can help the resercher fine-tun these algorithms with provided hyper-parameter presets.
source code: https://github.com/Breakend/ReproducibilityInContinuousPolicyGradientMethods
Reference:
[1] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
[2] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. arXiv preprint arXiv:1706.00387, 2017.