[ICML 2015] Massively Parallel Methods for Deep Reinforcement Learning

http://arxiv.org/abs/1507.04296

Introduction

Existing work on distributed deep learning has focused exclusively on supervised and unsupervised learning. In this paper we develop a new architecture for the reinforcement learning paradigm, which we called Gorila (General Reinforcement Learning Architecture), aiming to solve a single-agent problem more efficiently by exploiting parallel computation.

Distributed Architecture

在这里插入图片描述
4 components.

  • Actor - Generate experiences.
  • Replay Memory
    • First, a local replay memory stores each actor’s experience locally on that actor’s machine.
    • Second, a global replay memory aggregates the experience into a distributed database.
  • Learner
    • Generate gradients and send them to the parameter server.
    • Receive the the parameters of the Q-network periodically from the parameter server.
  • Parameter server - Applies these gradients to modify the parameters, using an asynchronous stochastic gradient descent algorithm.

Bundled mode. The simplest overall instantiation of Gorila, in which there is a one-to-one correspondence between actors, replay memory, and learners. The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server.

Gorila DQN. The DQN algorithm is extended to the distributed implementation in Gorila as follows. The parameter server maintains the current parameters θ + \theta+ θ+ and the actors and learners contain replicas of the current Q-network Q ( s , a ; θ ) Q(s, a; \theta) Q(s,a;θ) that are synchronized from the parameter server before every acting step. The learner additionally maintains the target Q-network Q ( s , a ; θ − ) Q(s, a; \theta−) Q(s,a;θ). The learner’s target network is updated from the parameter server θ + \theta+ θ+ after every N N N gradient updates in the central parameter server.

Stability. While the DQN training algorithm was designed to ensure stability of training neural networks with reinforcement learning, training using a large cluster of machines running multiple other tasks poses additional challenges. The Gorila DQN implementation uses additional safeguards to ensure stability in the presence of disappearing nodes, slowdowns in network traffic, and slowdowns of individual machines. One such safeguard is a parameter that determines the maximum time delay between the local parameters and the parameters in the parameter server. All gradients older than the threshold are discarded by the parameter server. Additionally, each actor/learner keeps a running average and standard deviation of the absolute DQN loss for the data it sees and discards gradients with absolute loss higher than the mean plus several standard deviations. Finally, we used the AdaGrad update rule.

Experiments

We evaluated Gorila by conducting experiments on 49 Atari 2600 games using the Arcade Learning Environment.

In all experiments, Gorila DQN used: N p a r a m = 31 N_{param} = 31 Nparam=31 and N l e a r n = N a c t = 100 N_{learn} = N_{act} = 100 Nlearn=Nact=100. We use the bundled mode. Replay memory size D D D = 1 million frames and used ϵ \epsilon ϵ-greedy as the behaviour policy with ϵ \epsilon ϵ annealed from 1 to 0.1 over the first one million global updates. Each learner syncs the parameters θ − \theta− θ of its target network after every 60K parameter updates performed in the parameter server.

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值