PR17.10.2:Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

What’s problem and challenges?

There are many sources of possible instability and variance that can lead to difficulties with reproducing deep policy gradient methods such as DDPG and TRPO.

What’s the proposed solution?

Common Hyper-Parameters

The authors investigate two policy gradient algorithms: DDPG and TRPO. Three main performance measures commonly used in the literature are: Maximum Average Return, Maximum Return, Standard Deviation of Returns, and Average Return. Additionally, they also highlight that there can be difficulty in properly fine-tuning
hyper-parameter settings, leading to large variations of reported results across a wide range of works as different hyper-parameters are used.

Policy Network Architecture: They use two MuJoCo physics simulator tasks from OpenAI gym (Hopper-v1 and Half-Cheetah-v1) for their experimental tasks. The network architecture will performance quiet differently for different environment and they found to be the best (400,300) is not the one which is used in reporting results for baselines results in [1, 2]. Additionally, DDPG is quite unstable no matter the network architecture.

Batch Size: For DDPG, the mini-batch sizes of 128 perform best and it is intuitive that a larger batch size would perform better in this time frame as more samples are seen.

TRPO

Regularization Coefficient : The regularization coefficient (RC) (or conjugate gradient damping factor) has no difference for Half-Cheetah-v1, though it seems to have a more significant effect on Hopper.

Generalized Advantage Estimation : They investigate using λ = 1.0 and λ = 0.97 for this. Then they find that for longer iterations, a lower GAE λ does in fact improve results for longer sequences in Half-Cheetah and mildly for Hopper.

Step Size : The step size (SS) (effectively the learning rate of TRPO) is the same as the KL-divergence bound for the conjugate gradient steps. The default value of 0.01 appears to work generally the best for both Hopper and Half-Cheetah.

DDPG

Reward Scale : The rewards for SOME tasks were rescaled by a factor to improve the stability of DDPG.

Actor-Critic Learning Rates : They use 0.001, 0.0001 (for the critic, and actor respectively). Interestingly, we find that the actor and critic learning rates for DDPG have less of an effect on the Hopper environment than the Half-Cheetah environment. This brings into consideration that keeping other parameters fixed, DDPG is not only susceptible to the learning rates, but there are other sources of variation and randomness in the DDPG algorithm.

General Variance

Their results show that for both DDPG and TRPO, taking two different average across 5 experiment runs do not necessarily produce the same result, and in fact, there is high variance in the obtained results. This emphasizes the need for averaging many runs together when reporting results using a different randoms seed for each. In this way, future works should attempt to negate the effect of random seeds and environment stochasticity when reporting their results.
这里写图片描述
这里写图片描述

Conclusion

Their analysis shows that due to the under-reporting of hyper-parameters, different works often report different baseline results and performance measures for both TRPO and DDPG. This leads to an unfair comparison of baselines in continuous control environments. Their experiments can help the resercher fine-tun these algorithms with provided hyper-parameter presets.

source code: https://github.com/Breakend/ReproducibilityInContinuousPolicyGradientMethods
Reference:
[1] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.

[2] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. arXiv preprint arXiv:1706.00387, 2017.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值