【强化学习】14 —— A3C(Asynchronous Advantage Actor Critic)

A3C算法( Asynchronous Methods for Deep Reinforcement Learning)于2016年被谷歌DeepMind团队提出。A3C是一种非常有效的深度强化学习算法,在围棋、星际争霸等复杂任务上已经取得了很好的效果。接下来,我们先从A3C的名称入手,去解析这个算法。
在这里插入图片描述

Diagram of A3C high-level architecture.

A3C代表了异步优势动作评价(Asynchronous Advantage Actor Critic)

  • 异步(Asynchronous):因为算法涉及并行执行一组环境。与DQN不同,DQN中单个神经网络代表的单个智能体与单个环境交互,而A3C利用上述多个化身来更有效地学习。在A3C中,有一个全局网络(global network)和多个工作智能体(worker),每个智能体都有自己的网络参数集。这些智能体中的每一个都与它自己的环境副本交互,同时其他智能体与它们的环境交互(并行训练)。这比单个智能体(除了加速完成更多工作)更好的原因在于,每个智能体的经验独立于其他智能体的经验。这样,可用于训练的整体经验多样化

  • 优势(Advantage):因为策略梯度的更新使用优势函数

  • 动作评价(Actor Critic):因为这是一种动作评价(actor-critic)方法,它涉及一个在学得的状态值函数帮助下进行更新的策略 ∇ θ ′ log ⁡ π ( a t ∣ s t ; θ ′ ) A ( s t , a t ; θ v ) A ( s t , a t ; θ v ) = ∑ i = 0 k − 1 γ i r t + i + γ k V ( s t + k ; θ v ) − V ( s t ; θ v ) \begin{gathered}\nabla_{\theta'}\log\pi(a_t|s_t;\theta')A(s_t,a_t;\theta_v)\\\\A(s_t,a_t;\theta_v)=\sum_{i=0}^{k-1}\gamma^ir_{t+i}+\gamma^kV(s_{t+k};\theta_v)-V(s_t;\theta_v)\end{gathered} θlogπ(atst;θ)A(st,at;θv)A(st,at;θv)=i=0k1γirt+i+γkV(st+k;θv)V(st;θv)

    • 可以用 k k k步的bootstrap进行更新。

下图是一个基于16个环境平行训练的图示说明。
在这里插入图片描述

Actor-Critic Methods: A3C and A2C

A3C图示说明

  • 16个并行环境
  • θ \theta θ指的是策略的参数(actor), θ v \theta_v θv指的是值函数的参数(critic),两者梯度分别更新, α \alpha α α v \alpha_v αv则是相应的学习率。
  • 该算法为了鼓励探索,在策略更新中加入了一个熵奖励正则化项(嵌入在 d θ d\theta dθ中)。
  • 使用Hogwild!作为更新方法。Hogwild!是一种并行更新的方法,其中多个线程可能会同时更新共享参数。这种并行更新可能会导致线程间的冲突,但在这里作者认为这不会造成太大问题。
  • 在计算策略的优势时,算法采用了前向视角(forward view)的n步回报,而不是后向视角(backward view)。前向视角与后向视角的区别在于如何计算多步的奖励。后向视角的计算需要用到资格迹(eligibility traces),详情参考Sutton的圣经。

A3C算法伪代码
在这里插入图片描述
A3C算法实现
在这里插入图片描述

Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)

  1. 每个worker从global network复制参数
  2. 不同的worker与环境去做互动
  3. 不同的worker计算出各自的gradient
  4. 不同的worker把各自的gradient传回给global network
  5. global network接收到gradient后进行参数更新

Tensorflow版本代码

代码结构:

  • AC_Network这个类包含了创建网络本身的所有Tensorflow操作。
  • Worker这个类包含了AC_Network的一个副本,一个环境类,以及与环境交互和更新全局网络的所有逻辑。
  • 用于建立Worker实例并并行运行它们的高级代码。

Pytorch版本代码

参考了莫烦python——https://github.com/MorvanZhou/pytorch-A3C/tree/master
以及https://github.com/cyoon1729/Policy-Gradient-Methods/blob/master/a3c/a3c.py

  1. 莫烦python理想的结果如下所示
    在这里插入图片描述
    但在实际运行中出现下面两幅图的情况,reward达到峰值后迅速下降,猜测可能是worker学习到不好的策略,同步给global,使得原本好的策略持续变坏?
    在这里插入图片描述在这里插入图片描述

另一个版本的代码收敛较快,运行良好。也是利用交叉熵去作为正则化项的。在这里插入图片描述

w0 | episode: 978 391.0
w5 | episode: 979 396.0
w3 | episode: 980 399.0
w7 | episode: 981 500.0
w4 | episode: 982 383.0
w1 | episode: 983 500.0
w6 | episode: 984 500.0
w2 | episode: 985 500.0
w0 | episode: 986 500.0
w5 | episode: 987 500.0
w3 | episode: 988 500.0
w4 | episode: 989 500.0
w7 | episode: 990 500.0
w1 | episode: 991 500.0
w6 | episode: 992 500.0
w2 | episode: 993 500.0
w0 | episode: 994 500.0
w5 | episode: 995 500.0
w3 | episode: 996 500.0
w7 | episode: 997 500.0
w4 | episode: 998 500.0
w1 | episode: 999 500.0
w6 | episode: 1000 500.0

代码:

import torch
import torch.nn.functional as F
import torch.multiprocessing as mp
import gymnasium as gym
import numpy as np
import util
from queue import Empty

GLOBAL_MAX_EPISODE = 1000
GAMMA = 0.98

class SharedAdam(torch.optim.Adam):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.99), eps=1e-8,
                 weight_decay=0):
        super(SharedAdam, self).__init__(params, lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        # State initialization
        for group in self.param_groups:
            for p in group['params']:
                state = self.state[p]
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p.data)
                state['exp_avg_sq'] = torch.zeros_like(p.data)

                # share in memory
                state['exp_avg'].share_memory_()
                state['exp_avg_sq'].share_memory_()

class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# 输入是某个状态,输出则是状态的价值。
class ValueNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim):
        super(ValueNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

class A3Cagent:
    def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr, env, numOfCPU):
        # global network
        self.global_actor = PolicyNet(state_dim, hidden_dim, action_dim)
        self.global_critic = ValueNet(state_dim, hidden_dim)
        # share the global parameters in multiprocessing
        self.global_actor.share_memory()
        self.global_critic.share_memory()
        self.global_critic_optimizer = SharedAdam(self.global_critic.parameters(), lr=critic_lr, betas=(0.92, 0.999))
        self.global_actor_optimizer = SharedAdam(self.global_actor.parameters(), lr=actor_lr, betas=(0.92, 0.999))
        # self.global_critic_optimizer = torch.optim.Adam(self.global_critic.parameters(), lr=critic_lr)
        # self.global_actor_optimizer = torch.optim.Adam(self.global_actor.parameters(), lr=actor_lr)
        self.env = env
        self.global_episode = mp.Value('i', 0)
        self.global_episode_reward = mp.Value('d', 0.)
        self.res_queue = mp.Queue()
        # worker
        self.workers = [Worker(i, self.global_actor, self.global_critic, self.global_critic_optimizer,
                          self.global_actor_optimizer,  self.env, state_dim, hidden_dim, action_dim,
                          self.global_episode, self.global_episode_reward, self.res_queue) for i in range(numOfCPU)]

    def train(self):
        [w.start() for w in self.workers]
        res = []
        while True:
            r = self.res_queue.get()
            if r is not None:
                res.append(r)
            else:
                break
        [w.join() for w in self.workers]


class Worker(mp.Process):
    def __init__(self, name, global_actor, global_critic, global_critic_optimizer, global_actor_optimizer,
                 env, state_dim, hidden_dim, action_dim, global_episode, global_episode_reward, res_queue):
        super(Worker, self).__init__()
        self.id = name
        self.name = 'w%02i' % name
        self.env = env
        # self.env.seed(name)
        self.global_episode = global_episode
        self.global_episode_reward = global_episode_reward
        self.res_queue = res_queue
        self.global_actor = global_actor
        self.global_critic = global_critic
        self.global_critic_optimizer = global_critic_optimizer
        self.global_actor_optimizer = global_actor_optimizer
        self.local_actor = PolicyNet(state_dim, hidden_dim, action_dim)
        self.local_critic = ValueNet(state_dim, hidden_dim)

    # 根据动作概率分布随机采样
    def take_action(self, state):
        # state = torch.tensor(np.array([state]), dtype=torch.float)
        state = torch.FloatTensor(state)
        logits = self.local_actor(state)
        dist = F.softmax(logits, dim=0)
        probs = torch.distributions.Categorical(dist)
        return probs.sample().detach().item()

    def update(self, transition_dict):
        states = torch.tensor(np.array(transition_dict['states']), dtype=torch.float)
        actions = torch.tensor(transition_dict['actions']).view(-1, 1)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1)
        # next_states = torch.tensor(np.array(transition_dict['next_states']), dtype=torch.float)
        # terminateds = torch.tensor(transition_dict['terminateds'], dtype=torch.float).view(-1, 1)
        # truncateds = torch.tensor(transition_dict['truncateds'], dtype=torch.float).view(-1, 1)

        # compute value target
        discounted_rewards = [torch.sum(torch.FloatTensor([GAMMA**i for i in range(rewards[j:].size(0))]) \
             * rewards[j:]) for j in range(rewards.size(0))]  # sorry, not the most readable code.
        value_targets = rewards.view(-1, 1) + torch.FloatTensor(discounted_rewards).view(-1, 1)
        critic_loss = F.mse_loss(self.local_critic(states), value_targets.detach())

        # compute policy loss with entropy bonus
        logits = self.local_actor(states)
        dists = F.softmax(logits, dim=1)
        # assert not torch.any(torch.isnan(dists))
        probs = torch.distributions.Categorical(dists)
        # compute entropy bonus
        entropy = []
        for dist in dists:
            entropy.append(-torch.sum(dist.mean() * torch.log(dist + 1e-8)))
        entropy = torch.stack(entropy).sum()

        advantage = value_targets - self.local_critic(states)
        actor_loss = -probs.log_prob(actions.view(actions.size(0))).view(-1, 1) * advantage.detach()
        actor_loss = actor_loss.mean() - entropy * 0.001

        self.global_critic_optimizer.zero_grad()
        critic_loss.backward()
        # propagate local gradients to global parameters
        for local_params, global_params in zip(self.local_critic.parameters(), self.global_critic.parameters()):
            global_params._grad = local_params._grad
        self.global_critic_optimizer.step()

        self.global_actor_optimizer.zero_grad()
        actor_loss.backward()
        # propagate local gradients to global parameters
        for local_params, global_params in zip(self.local_actor.parameters(), self.global_actor.parameters()):
            global_params._grad = local_params._grad
        self.global_actor_optimizer.step()

    def sync_with_global(self):
        self.local_critic.load_state_dict(self.global_critic.state_dict())
        self.local_actor.load_state_dict(self.global_actor.state_dict())

    def run(self):
        while self.global_episode.value < GLOBAL_MAX_EPISODE:
            state, _ = self.env.reset(seed=self.id)
            # state = self.env.reset()
            terminated = False
            truncated = False
            episodeReward = 0
            transition_dict = {
                'states': [],
                'actions': [],
                'next_states': [],
                'rewards': [],
                'terminateds': [],
                'truncateds': []
            }
            while self.global_episode.value < GLOBAL_MAX_EPISODE:
                action = self.take_action(state)
                # next_state, reward, terminated, info = self.env.step(action)
                next_state, reward, terminated, truncated, info = self.env.step(action)
                transition_dict['states'].append(state)
                transition_dict['actions'].append(action)
                transition_dict['next_states'].append(next_state)
                transition_dict['rewards'].append(reward)
                transition_dict['terminateds'].append(terminated)
                transition_dict['truncateds'].append(truncated)
                episodeReward += reward
                if terminated or truncated:
                    with self.global_episode.get_lock():
                        self.global_episode.value += 1
                    print(self.name + " | episode: " + str(self.global_episode.value) + " " + str(episodeReward))
                    self.res_queue.put(episodeReward)
                    self.update(transition_dict)
                    self.sync_with_global()
                    break
                state = next_state


def test01():
    env = gym.make("CartPole-v1")
    agent = A3Cagent(state_dim=env.observation_space.shape[0],
                     hidden_dim=256,
                     action_dim=2,
                     actor_lr=1e-3,
                     critic_lr=1e-3,
                     env=env,
                     numOfCPU=8)
    [w.start() for w in agent.workers]
    returnLists1 = []
    while True:
        try:
            r = agent.res_queue.get(timeout=3)  # 设置3秒的超时
            if r is not None:
                returnLists1.append(r)
            else:
                break
        except Empty:  # 当队列中没有数据可供获取时,get方法会抛出Empty异常
            print("No data in queue, breaking...")
            break
    [w.join() for w in agent.workers]
    ReturnList = []
    ReturnList.append(util.smooth([returnLists1], sm=50))
    labelList = ['A3C']
    util.PlotReward(len(returnLists1), ReturnList, labelList, 'CartPole-v1')
    np.save("D:\LearningRL\Hands-on-RL\A3C_CartPole\ReturnData\A3C_v2_core8_4.npy", returnLists1)
    env.close()

在这里插入图片描述
A3C的运行速度非常快。在实践中,我使用了两种的Adam优化器进行比较,不过好像差别不是很大。

参考

[1] 伯禹AI
[2] https://www.davidsilver.uk/teaching/
[3] 动手学强化学习
[4] Reinforcement Learning
[5] Asynchronous Methods for Deep Reinforcement Learning
[6] 第9章演员-评论员算法
[7] Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)
[8] Actor-Critic Methods: A3C and A2C
[9] https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/intro-A3C

  • 6
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

yuan〇

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值