【A2C】深度强化学习的异步方法

Asynchronous Methods for Deep Reinforcement Learning

深度强化学习的异步方法

2016年2月4日提交

https://arxiv.org/abs/1602.01783

Abstract 

摘要

        We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
        我们提出了一个概念上简单且轻量级的深度强化学习框架,该框架使用
异步梯度下降来优化深度神经网络控制器。我们提出了四种标准强化学习算法的异步变体,并表明并行演员学习者对训练具有稳定作用,允许所有四种方法成功训练神经网络控制器。性能最好的方法是Actor-Critic的异步变体,它超越了Atari领域的当前最先进技术,同时在单个多核CPU而不是GPU上训练一半的时间。此外,我们表明,异步演员评论家成功的各种连续电机控制问题,以及在一个新的任务导航随机3D迷宫使用视觉输入。

1 Introduction 

1 导言

        Deep neural networks provide rich representations that can enable reinforcement learning (RL) algorithms to perform effectively. However, it was previously thought that the combination of simple online RL algorithms with deep neural networks was fundamentally unstable. Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 20132015; Van Hasselt et al., 2015; Schulman et al., 2015a). These approaches share a common idea: the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 20132015; Van Hasselt et al., 2015) from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.
        深度神经网络提供了丰富的表示,可以使强化学习(RL)算法有效地执行。然而,以前人们认为简单的在线RL算法与深度神经网络的结合从根本上是不稳定的。相反,已经提出了各种解决方案来稳定算法(Riedmiller,2005; Mnih等人,2013,2015;货车Hasselt等人,2015; Schulman等人,2015年a)。这些方法有一个共同的想法:在线RL代理遇到的观察数据序列是非平稳的,并且在线RL更新是强相关的。通过将代理的数据存储在经验重放存储器中,可以对数据进行批处理(Riedmiller,2005; Schulman等人,2015 a)或随机采样(Mnih等人,2013,2015;货车Hasselt等人,2015年,在不同的时间段。 以这种方式在内存上聚合减少了非平稳性和去相关更新,但同时将方法限制为非策略强化学习算法。

        Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.
        基于经验回放的深度强化学习算法在Atari 2600等具有挑战性的领域取得了前所未有的成功。然而,经验重放有几个缺点:它使用更多的内存和计算每个真实的交互;它需要离政策学习算法,可以从旧的策略生成的数据更新。

        In this paper we provide a very different paradigm for deep reinforcement learning. Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actor-critic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.
        在本文中,我们为深度强化学习提供了一个非常不同的范式。我们在环境的多个实例上并行异步执行多个代理,而不是体验重播。这种并行性还将代理的数据解相关为更稳定的过程,因为在任何给定的时间步,并行代理将经历各种不同的状态。这个简单的想法使得更大范围的基本策略上的RL算法,如Sarsa,n-step方法和actor-critic方法,以及非策略RL算法,如Q学习,能够使用深度神经网络稳健有效地应用。

        Our parallel reinforcement learning paradigm also offers practical benefits. Whereas previous approaches to deep reinforcement learning rely heavily on specialized hardware such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015; Schaul et al., 2015) or massively distributed architectures (Nair et al., 2015), our experiments run on a single machine with a standard multi-core CPU. When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. The best of the proposed methods, asynchronous advantage actor-critic (A3C), also mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs. We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.
        我们的并行强化学习范式也提供了实际的好处。而之前的深度强化学习方法严重依赖于GPU等专用硬件(Mnih等人,2015;货车Hasselt等人,2015; Schaul等人,2015)或大规模分布式架构(Nair等人,2015),我们的实验运行在一个标准的多核CPU的单机上。当应用于各种Atari 2600领域时,在许多游戏中,异步强化学习可以获得更好的结果,比以前基于GPU的算法所用的时间更短,使用的资源也比大规模分布式方法少得多。所提出的方法中最好的一种,异步优势演员-评论家(A3 C),也掌握了各种连续的运动控制任务,以及学习纯粹从视觉输入探索3D迷宫的一般策略。 我们相信,A3C在2D和3D游戏、离散和连续动作空间上的成功,以及它训练前馈和循环代理的能力,使它成为迄今为止最通用和最成功的强化学习代理。

2 Related Work 

2 相关工作

        The General Reinforcement Learning Architecture (Gorila) of (Nair et al., 2015) performs asynchronous training of reinforcement learning agents in a distributed setting. In Gorila, each process contains an actor that acts in its own copy of the environment, a separate replay memory, and a learner that samples data from the replay memory and computes gradients of the DQN loss (Mnih et al., 2015) with respect to the policy parameters. The gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed intervals. By using 100 separate actor-learner processes and 30 parameter server instances, a total of 130 machines, Gorila was able to significantly outperform DQN over 49 Atari games. On many games Gorila reached the score achieved by DQN over 20 times faster than DQN. We also note that a similar way of parallelizing DQN was proposed by (Chavez et al., 2015).
        通用强化学习架构(Gorila)(Nair等人,2015)在分布式环境中执行强化学习代理的异步训练。在Gorila中,每个进程包含在其自己的环境副本中行动的行动者、单独的重放存储器以及从重放存储器采样数据并计算DQN损失的梯度的学习器(Mnih等人,2015年,在政策参数方面。梯度被异步地发送到中央参数服务器,其更新模型的中央副本。更新后的策略参数以固定的间隔发送给参与者-学习者。通过使用100个独立的参与者-学习者过程和30个参数服务器实例,总共130台机器,Gorila能够在49个Atari游戏中显著优于DQN。在许多比赛中,Gorila达到DQN的得分比DQN快20倍以上。我们还注意到,并行化DQN的类似方式由Chavez等人提出,2015年)的报告。

        In earlier work, (Li & Schuurmans, 2011) applied the Map Reduce framework to parallelizing batch reinforcement learning methods with linear function approximation. Parallelism was used to speed up large matrix operations but not to parallelize the collection of experience or stabilize learning. (Grounds & Kudenko, 2008) proposed a parallel version of the Sarsa algorithm that uses multiple separate actor-learners to accelerate training. Each actor-learner learns separately and periodically sends updates to weights that have changed significantly to the other learners using peer-to-peer communication.
        在早期的工作中,(Li & Schuurmans,2011)应用Map Reduce框架来并行化具有线性函数近似的批量强化学习方法。并行性用于加速大型矩阵运算,但不能并行收集经验或稳定学习。(Grounds & Kudenko,2008)提出了一种并行版本的Sarsa算法,该算法使用多个独立的参与者-学习者来加速训练。每个参与者-学习者单独学习,并使用对等通信定期向其他学习者发送已发生显著变化的权重的更新。

        (Tsitsiklis, 1994) studied convergence properties of Q-learning in the asynchronous optimization setting. These results show that Q-learning is still guaranteed to converge when some of the information is outdated as long as outdated information is always eventually discarded and several other technical assumptions are satisfied. Even earlier, (Bertsekas, 1982) studied the related problem of distributed dynamic programming.
        (Tsitsiklis,1994)研究了异步优化设置中Q学习的收敛特性。这些结果表明,当一些信息过时时,Q学习仍然保证收敛,只要过时的信息最终总是被丢弃,并且满足其他几个技术假设。甚至更早,(Bertsekas,1982)研究了分布式动态规划的相关问题。

Another related area of work is in evolutionary methods, which are often straightforward to parallelize by distributing fitness evaluations over multiple machines or threads (Tomassini, 1999). Such parallel evolutionary approaches have recently been applied to some visual reinforcement learning tasks. In one example, (Koutník et al., 2014) evolved convolutional neural network controllers for the TORCS driving simulator by performing fitness evaluations on 8 CPU cores in parallel.
        另一个相关的工作领域是进化方法,通过在多个机器或线程上分布适应性评估,通常可以直接并行化(Tomassini,1999)。这种并行进化方法最近被应用于一些视觉强化学习任务。在一个实例中,(Koutník等人,2014)通过在8个CPU内核上并行执行适应度评估,为TORCS驾驶模拟器进化了卷积神经网络控制器。

3 Reinforcement Learning Background

3 强化学习背景

 

 

4 Asynchronous RL Framework

4 异步RL框架

        We now present multi-threaded asynchronous variants of one-step Sarsa, one-step Q-learning, n-step Q-learning, and advantage actor-critic. The aim in designing these methods was to find RL algorithms that can train deep neural network policies reliably and without large resource requirements. While the underlying RL methods are quite different, with actor-critic being an on-policy policy search method and Q-learning being an off-policy value-based method, we use two main ideas to make all four algorithms practical given our design goal.
        我们现在介绍一步Sarsa,一步Q学习,n步Q学习和优势演员评论家的多线程异步变体。设计这些方法的目的是找到可以可靠地训练深度神经网络策略并且不需要大量资源的RL算法。虽然底层的强化学习方法有很大的不同,其中actor-critic是一种基于策略的策略搜索方法,Q-learning是一种基于值的非策略方法,但我们使用两个主要思想来实现我们的设计目标。

        First, we use asynchronous actor-learners, similarly to the Gorila framework (Nair et al., 2015), but instead of using separate machines and a parameter server, we use multiple CPU threads on a single machine. Keeping the learners on a single machine removes the communication costs of sending gradients and parameters and enables us to use Hogwild! (Recht et al., 2011) style updates for training.
        首先,我们使用异步的参与者-学习者,类似于Gorila框架(Nair等人,2015),但我们不是使用单独的机器和参数服务器,而是在一台机器上使用多个CPU线程。将学习器保持在一台机器上,消除了发送梯度和参数的通信成本,使我们能够使用Hogwild!(Recht等人,2011年)的风格更新培训。

        Second, we make the observation that multiple actors-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm.
        其次,我们观察到,多个行动者-学习者并行运行可能会探索环境的不同部分。此外,可以明确地使用不同的探索政策,在每个演员学习者,以最大限度地提高这种多样性。通过在不同的线程中运行不同的探索策略,由并行应用在线更新的多个参与者-学习者对参数做出的总体改变在时间上可能比应用在线更新的单个代理更不相关。因此,我们不使用重放存储器,而是依赖于采用不同探索策略的并行演员来执行DQN训练算法中经验重放所承担的稳定作用。

        In addition to stabilizing learning, using multiple parallel actor-learners has multiple practical benefits. First, we obtain a reduction in training time that is roughly linear in the number of parallel actor-learners. Second, since we no longer rely on experience replay for stabilizing learning we are able to use on-policy reinforcement learning methods such as Sarsa and actor-critic to train neural networks in a stable way. We now describe our variants of one-step Q-learning, one-step Sarsa, n-step Q-learning and advantage actor-critic.
        除了稳定学习之外,使用多个并行的行动者-学习者还有很多实际的好处。首先

  • 14
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
### 回答1: A2C(Advantage Actor-Critic)是一种深度强化学习算法,它结合了Actor-Critic和Advantage学习的优点,可以用于解决连续动作空间的问题。以下是一个使用Python实现A2C算法的简单示例: ```python import gym import torch import torch.nn as nn import torch.optim as optim # 定义Actor-Critic网络 class ActorCritic(nn.Module): def __init__(self, input_size, output_size): super(ActorCritic, self).__init__() self.common = nn.Sequential( nn.Linear(input_size, 128), nn.ReLU() ) self.actor = nn.Linear(128, output_size) self.critic = nn.Linear(128, 1) def forward(self, x): x = self.common(x) policy = nn.functional.softmax(self.actor(x), dim=-1) value = self.critic(x) return policy, value # 定义A2C算法 def a2c(env, model, optimizer, gamma=0.99, num_steps=5): rewards = [] values = [] log_probs = [] entropys = [] obs = env.reset() done = False while not done: for _ in range(num_steps): obs = torch.FloatTensor(obs) policy, value = model(obs) action = torch.multinomial(policy, 1).item() log_prob = torch.log(policy[action]) entropy = -torch.sum(policy * torch.log(policy)) obs, reward, done, _ = env.step(action) rewards.append(reward) values.append(value) log_probs.append(log_prob) entropys.append(entropy) if done: break _, next_value = model(torch.FloatTensor(obs)) returns = [] advantages = [] R = next_value for r in reversed(rewards): returns.insert(0, R) R = r + gamma * R for v, R in zip(values, returns): advantages.append(R - v) policy_loss = 0 value_loss = 0 entropy_loss = 0 for log_prob, advantage, value, entropy in zip(log_probs, advantages, values, entropys): policy_loss -= log_prob * advantage value_loss += nn.functional.mse_loss(value, torch.FloatTensor([R])) entropy_loss -= entropy loss = policy_loss + 0.5 * value_loss + 0.01 * entropy_loss optimizer.zero_grad() loss.backward() optimizer.step() return sum(rewards) # 使用A2C算法训练CartPole-v1游戏 env = gym.make('CartPole-v1') model = ActorCritic(env.observation_space.shape[0], env.action_space.n) optimizer = optim.Adam(model.parameters(), lr=0.001) for i in range(1000): reward = a2c(env, model, optimizer) print(f"Episode {i}: reward {reward}") ``` 这是一个简单的A2C实现,其中Actor-Critic网络使用了一个共享的中间层,输入是状态,输出是动作策略和状态值。在训练过程中,先通过Actor-Critic网络选择动作和计算状态值,然后使用这些信息计算Advantage和Policy梯度,最后通过Adam优化器更新网络参数。在训练过程中,每个episode的奖励都会被记录下来,可以用来评估算法的性能。 ### 回答2: 深度强化学习(Deep Reinforcement Learning)是一种机器学习方法,结合了深度学习强化学习的技术。A2C(Advantage Actor-Critic)是深度强化学习中的一种算法模型,它可以用Python语言进行实现。 A2C是一种基于策略梯度的强化学习算法,其核心思想是通过增强代理(Agent)的策略,来最大化其在环境中获得的累积奖励。A2C的优势在于其可以充分利用计算资源,实现多个代理的并行运行,加快训练速度。 在Python中实现A2C,我们首先需要定义神经网络模型,用于估计代理的动作策略。这个模型可以是一个深度神经网络,接收环境状态作为输入,输出各个动作的概率分布。然后,我们可以使用强化学习的基本原理,在代理与环境之间进行交互,采样得到经验轨迹(experience trajectory)。接着,利用这些经验轨迹,我们可以计算代理执行动作的预期回报,并使用策略梯度方法来更新神经网络模型的参数,提高代理的策略。A2C算法使用Actor-Critic结构,其中Actor用于执行动作,Critic用于估计预期回报并提供策略改进的信号。 实际编程中,可以使用Python中的强化学习框架,如TensorFlow、PyTorch等,来实现A2C算法。例如,可以定义一个神经网络模型的类,利用框架的API构建网络结构,然后编写A2C算法的训练循环,在每个时间步更新网络参数,并与环境进行交互。 总而言之,深度强化学习A2C算法的实现需要定义神经网络模型、构建训练循环、利用策略梯度方法更新网络参数,并结合强化学习的基本原理进行代理与环境的交互。Python语言为如此复杂的任务提供了灵活和高效的开发环境和工具。 ### 回答3: 深度强化学习中的A2C指的是Advantage Actor-Critic的缩写,它是一种使用深度神经网络进行策略优化的算法。这种算法结合了Actor-Critic方法和优势函数(Advantage)的概念,旨在通过优势函数的估计来引导智能体的学习过程。 在A2C中,智能体被建模为一个 Actor(策略网络)和 Critic(值函数网络) 的组合。Actor负责产生动作的策略,而Critic则通过估计状态-动作值函数(或者优势函数)来评估当前策略的优劣。这两个网络共同协作,不断通过与环境进行交互来更新参数,使得策略不断得到优化。 具体来说,A2C使用了基于梯度的优化方法,通过最大化 Critic 网络预测的累积回报来更新 Actor 网络的参数。同时,Actor网络还会通过 Policy Gradient算法来进行更新,使得策略能够更好地适应环境的变化。此外,A2C还使用了经验回放机制,即将智能体的经历存储在一个回放缓冲区中,用于提高采样数据的效率。 在Python中实现A2C算法时,可以使用深度学习框架如PyTorch或TensorFlow来构建Actor和Critic网络,以及定义损失函数和优化器。此外,还需要设计一个与环境进行交互的循环,不断地采样、更新网络参数,并进行策略评估和改进。 总的来说,A2C 是一种深度强化学习算法,通过 Actor-Critic 结构和优势函数的引导,能够在与环境交互的过程中不断优化智能体的策略。在Python中实现A2C算法时,需要使用深度学习框架,定义网络架构、损失函数和优化器,并设计交互循环来进行参数更新和策略改进。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值