深度强化学习(7)多智能体强化学习IPPO、MADDPG


多智能体的情形相比于单智能体更加复杂,因为每个智能体在和环境交互的同时也在和其他智能体进行直接或者间接的交互。多智能体强化学习可以分为以下类别

  • 集中式强化学习
    由一个全局学习单元承担学习任务,以整个多智能体系统的整体状态为输入,输出各个智能体的动作。
  • 独立强化学习
    每个智能体都是独立的学习主体,只考虑自身的观测环境和策略利益
  • 社会强化学习
    独立强化学习与社会/经济模型的结合,模拟人类社会中人类个体的交互过程,用社会学和管理学的方法调节智能体之间的关系
  • 群体强化学习
    集中训练-分布执行CTDE范式,融合集中学习和独立学习的优势。在训练阶段,智能体利用全局信息集中学习;在执行阶段,智能体仅使用自身观测状态和局部信息选择动作

7.1 IPPO算法

每个智能体独立使用单智能体算法PPO进行训练
算法流程

  • 对于N个智能体,为每个智能体初始化各自的策略以及价值函数
  • for \text{for} for训练轮数 k = 0 , 1 , 2 , ⋯ do k=0,1,2,\cdots\text{do} k=0,1,2,do
    • 所有智能体在环境中交互分别获得各自的的一条轨迹数据
    • 对每个智能体,基于当前的价值函数用GAE计算优质函数的估计
    • 对每个智能体,通过最大化其PPO-截断的目标来更新其策略
    • 对每个智能体,通过均方误差损失函数优化其价值函数

Combat环境

Combat 是一个在二维的格子世界上进行的两个队伍的对战模拟游戏,每个智能体的动作集合为:向四周移动 1 格,攻击周围 3x3 格范围内其他敌对智能体,或者不采取任何行动。起初每个智能体有 3 点生命值,如果智能体在敌人的攻击范围内被攻击到了,则会扣 1 生命值,生命值掉为 0 则死亡,最后存活的队伍获胜。每个智能体的攻击有一轮的冷却时间。
在游戏中,我们能够控制一个队伍的所有智能体与另一个队伍的智能体对战。另一个队伍的智能体使用固定的算法:攻击在范围内最近的敌人,如果攻击范围内没有敌人,则向敌人靠近。
在这里插入图片描述

代码实现

导入Combat环境

git clone https://github.com/boyu-ai/ma-gym.git
import torch
import torch.nn.functional as F
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import sys
sys.path.append("../ma-gym")
from ma_gym.envs.combat.combat import Combat

PPO算法

# PPO算法
class PolicyNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = torch.nn.Linear(hidden_dim, action_dim)

    def forward(self, x):
        x = F.relu(self.fc2(F.relu(self.fc1(x))))
        return F.softmax(self.fc3(x), dim=1)

class ValueNet(torch.nn.Module):
    def __init__(self, state_dim, hidden_dim):
        super(ValueNet, self).__init__()
        self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = torch.nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = F.relu(self.fc2(F.relu(self.fc1(x))))
        return self.fc3(x)


def compute_advantage(gamma, lmbda, td_delta):
    td_delta = td_delta.detach().numpy()
    advantage_list = []
    advantage = 0.0
    for delta in td_delta[::-1]:
        advantage = gamma * lmbda * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()
    return torch.tensor(advantage_list, dtype=torch.float)


# PPO,采用截断方式
class PPO:
    def __init__(self, state_dim, hidden_dim, action_dim,
                 actor_lr, critic_lr, lmbda, eps, gamma, device):
        self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
        self.critic = ValueNet(state_dim, hidden_dim).to(device)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr)
        self.gamma = gamma
        self.lmbda = lmbda
        self.eps = eps  # PPO中截断范围的参数
        self.device = device

    def take_action(self, state):
        state = torch.tensor([state], dtype=torch.float).to(self.device)
        probs = self.actor(state)
        action_dict = torch.distributions.Categorical(probs)
        action = action_dict.sample()
        return action.item()

    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)
        td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones)
        td_delta = td_target - self.critic(states)
        advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device)
        old_log_probs = torch.log(self.actor(states).gather(1, actions)).detach()
        log_probs = torch.log(self.actor(states).gather(1, actions))
        ratio = torch.exp(log_probs - old_log_probs)
        surr1 = ratio * advantage
        surr2 = torch.clamp(ratio, 1 - self.eps, 1 + self.eps) * advantage  # 截断
        action_loss = torch.mean(-torch.min(surr1, surr2))  # PPO损失函数
        critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target.detach()))
        self.actor_optimizer.zero_grad()
        self.critic_optimizer.zero_grad()
        action_loss.backward()
        critic_loss.backward()
        self.actor_optimizer.step()
        self.critic_optimizer.step()

参数和环境设置

actor_lr = 3e-4
critic_lr = 1e-3
epochs = 10
episode_per_epoch = 1000
hidden_dim = 64
gamma = 0.99
lmbda = 0.97
eps = 0.2
team_size = 2  # 每个team里agent的数量
grid_size = (15, 15)  # 二维空间的大小
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# 创建环境
env = Combat(grid_shape=grid_size, n_agents=team_size, n_opponents=team_size)
state_dim = env.observation_space[0].shape[0]
action_dim = env.action_space[0].n

参数共享(parameter sharing)即对于所有智能体使用同一套策略参数,前提是这些智能体是同质(homogeneous)的,即它们的状态空间和动作空间是完全一致的,并且它们的优化目标也完全一致。

  • 智能体不共享策略
# 创建智能体(不参数共享)
agent1 = PPO(
    state_dim, hidden_dim, action_dim,
    actor_lr, critic_lr, lmbda, eps, gamma, device
)
agent2 = PPO(
    state_dim, hidden_dim, action_dim,
    actor_lr, critic_lr, lmbda, eps, gamma, device
)
  • 智能体共享同一策略
# 创建智能体(参数共享)
agent = PPO(
    state_dim, hidden_dim, action_dim,
    actor_lr, critic_lr, lmbda, eps, gamma, device
)

Training

win_list = []

for e in range(epochs):
    with tqdm(total=episode_per_epoch, desc='Epoch %d' % e) as pbar:
        for episode in range(episode_per_epoch):
            # Replay buffer for agent1
            buffer_agent1 = {
                'states': [],
                'actions': [],
                'next_states': [],
                'rewards': [],
                'dones': []
            }
            # Replay buffer for agent2
            buffer_agent2 = {
                'states': [],
                'actions': [],
                'next_states': [],
                'rewards': [],
                'dones': []
            }
            # 重置环境
            s = env.reset()
            terminal = False
            while not terminal:
                # 采取动作(不进行参数共享)
                a1 = agent1.take_action(s[0])
                a2 = agent2.take_action(s[1])
                # 采取动作(进行参数共享)
                # a1 = agent.take_action(s[0])
                # a2 = agent.take_action(s[1])
                next_s, r, done, info = env.step([a1, a2])

                buffer_agent1['states'].append(s[0])
                buffer_agent1['actions'].append(a1)
                buffer_agent1['next_states'].append(next_s[0])
                # 如果获胜,获得100的奖励,否则获得0.1惩罚
                buffer_agent1['rewards'].append(
                    r[0] + 100 if info['win'] else r[0] - 0.1)
                buffer_agent1['dones'].append(False)

                buffer_agent2['states'].append(s[1])
                buffer_agent2['actions'].append(a2)
                buffer_agent2['next_states'].append(next_s[1])
                buffer_agent2['rewards'].append(
                    r[1] + 100 if info['win'] else r[1] - 0.1)
                buffer_agent2['dones'].append(False)

                s = next_s  # 转移到下一个状态
                terminal = all(done)
            # 更新策略(不进行参数共享)
            agent1.update(buffer_agent1)
            agent2.update(buffer_agent2)
            # 更新策略(进行参数共享)
            # agent.update(buffer_agent1)
            # agent.update(buffer_agent2)
            win_list.append(1 if info['win'] else 0)
            if (episode + 1) % 100 == 0:
                pbar.set_postfix({
                    'episode': '%d' % (episode_per_epoch * e + episode + 1),
                    'winner prob': '%.3f' % np.mean(win_list[-100])
                })
            pbar.update(1)
win_array = np.array(win_list)
# 每100条轨迹取一次平均
win_array = np.mean(win_array.reshape(-1, 100), axis=1)

episode_list = np.array(win_array.shape[0]) * 100
plt.plot(episode_list, win_array)
plt.xlabel('Episodes')
plt.ylabel('win rate')
plt.title('IPPO on Combat')
plt.show()

在这里插入图片描述

7.2 MADDPG算法

多智能体深度确定性策略梯度(Multi-Agent Deep Deterministic Policy Gradient,MADDPG)
在这里插入图片描述
考虑 N N N个智能体的博弈,每个智能体的策略参数为 θ = { θ 1 , ⋯   , θ N } \theta=\{\theta_1,\cdots,\theta_N\} θ={θ1,,θN},记 π = { π 1 . ⋯   , π N } \pi=\{\pi_1.\cdots,\pi_N\} π={π1.,πN}为所有智能体的策略集合,于是有每个智能体 i i i的期望收益的策略梯度:
∇ θ i J ( θ i ) = E s ∼ p u , a ∼ π i [ ∇ θ i log ⁡ π i ( a i ∣ s i ) Q i π ( x , a 1 , ⋯   , a N ) ] \nabla_{\theta_i}J(\theta_i)=\Bbb E_{s\sim p^u,a\sim\pi_i}[\nabla_{\theta_i}\log\pi_i(a_i|s_i)Q^\pi_i(\mathbf x,a_1,\cdots,a_N)] θiJ(θi)=Espu,aπi[θilogπi(aisi)Qiπ(x,a1,,aN)]
其中, Q i π ( x , a 1 , ⋯   , a N ) Q^\pi_i(\rm x,a_1,\cdots,a_N) Qiπ(x,a1,,aN)为中心化的动作价值函数, x = ( s 1 , ⋯   , s N ) \mathbf x=(s_1,\cdots,s_N) x=(s1,,sN)。考虑 N N N个连续动作策略 μ θ i \mu_{\theta_i} μθi,得到DDPG的梯度公式:
∇ θ i J ( μ i ) = E x ∼ D [ ∇ θ i μ i ( s i ) ∇ a i Q i μ ( x , a 1 , ⋯   , a N ) ∣ a i = μ i ( s i ) ] \nabla_{\theta_i}J(\mu_i)=\Bbb E_{\mathbf x\sim\mathcal D}[\nabla_{\theta_i}\mu_i(s_i)\nabla_{a_i}Q^\mu_i(\mathbf x,a_1,\cdots,a_N)|_{a_i=\mu_i(s_i)}] θiJ(μi)=ExD[θiμi(si)aiQiμ(x,a1,,aN)ai=μi(si)]
在上式中 x ∼ D \mathbf x\sim\mathcal D xD中的 D \mathcal D D表示存储数据的经验回放池ReplayBuffer,存入的数据为 ( x , x ′ , a 1 , ⋯   , a N , r 1 , ⋯   , r N ) (\mathbf x,\mathbf x^\prime,a_1,\cdots,a_N,r_1,\cdots,r_N) (x,x,a1,,aN,r1,,rN)。在MADDPG中,中心化的动作价值函数Q的损失函数为:
L ( ω i ) = E x , a , r , x ′ [ ( Q i μ ( x , a 1 , ⋯   , a N ) − y ) 2 ] y = r i + γ Q i μ ′ ( x ′ , a 1 ′ , ⋯   , a N ′ ) ∣ a j ′ = μ j ′ ( o j ) \mathcal L(\omega_i)=\Bbb E_{\mathbf x,a,r,\mathbf x^\prime}[(Q^\mu_i(\mathbf x,a_1,\cdots,a_N)-y)^2]\\ y=r_i+\gamma Q^{\mu^\prime}_i(\mathbf x^\prime,a^\prime_1,\cdots,a^\prime_N)|_{a^\prime_j=\mu^\prime_j(o_j)} L(ωi)=Ex,a,r,x[(Qiμ(x,a1,,aN)y)2]y=ri+γQiμ(x,a1,,aN)aj=μj(oj)
算法流程

  • for  e = 1 → M  do \text{for}\space e=1\to M\space\text{do} for e=1M do
    • 初始化随机过程 N \mathcal N N,用于动作探索
    • 获取所有智能体的初始观测 x \mathbf x x
    • for  t = 1 → T  do \text{for}\space t=1\to T\space\text{do} for t=1T do
      • 对于每个智能体 i i i,用当前策略选择一个动作 a i = μ θ i ( o i ) + N t a_i=\mu_{\theta_i}(o_i)+\mathcal N_t ai=μθi(oi)+Nt
      • 执行动作 a = ( a 1 , ⋯   , a N ) a=(a_1,\cdots,a_N) a=(a1,,aN)获得奖励 r r r和新的观测 x ′ \mathbf x^\prime x
      • ( x , a , r , x ′ ) (\mathbf x,a,r,\mathbf x^\prime) (x,a,r,x)存入经验回放池 D \mathcal D D
      • x ← x ′ \mathbf x\larr\mathbf x^\prime xx
      • for  i = 1 → N  do \text{for}\space i=1\to N\space\text{do} for i=1N do
        • D \mathcal D D中随机采样 ( x j , a j , r j , x ′ j ) (\mathbf x^j,a^j,r^j,\mathbf x^{\prime j}) (xj,aj,rj,xj)
        • 中心化训练Critic网络
        • 训练自身的Actor网络
        • 更新目标 Actor 网络和目标 Critic 网络

MPE环境

多智能体粒子环境(Multi-Agent Particles Environment,MPE)由 1 个红色的对抗智能体(adversary), N N N个蓝色的正常智能体,以及 N N N个地点(一般 N = 2 N=2 N=2),这 N N N个地点中有一个是目标地点(绿色)。正常智能体知道哪一个是目标地点,但对抗智能体不知道。正常智能体是合作关系,它们其中任意一个距离目标地点足够近,则每个正常智能体都能获得相同的奖励。对抗智能体如果距离目标地点足够近,也能获得奖励,但它需要猜哪一个才是目标地点。因此,正常智能体需要进行合作,分散到不同的坐标点,以此欺骗对抗智能体。

在这里插入图片描述

Gumbel-Softmax近似采样

由于MPE 环境中的每个智能体的动作空间是离散的,而DDPG算法本身需要使智能体的动作对于其策略参数可导,因此引入Gumbel-Softmax的方法来得到离散分布的近似采样。

假设随机变量 Z Z Z服从某个离散分布 K = ( a 1 , ⋯   , a k ) \mathcal K=(a_1,\cdots,a_k) K=(a1,,ak)。其中, a i ∈ [ 0 , 1 ] a_i\in[0,1] ai[0,1],表示 P ( Z = i ) P(Z=i) P(Z=i),并且满足 ∑ i = 1 k a i = 1 \sum^k_{i=1}a_i=1 i=1kai=1。引入重参数因子 g i g_i gi,它是一个采样自Gumbel(0, 1)的噪声,表示为:
g i = − log ⁡ ( − log ⁡ u ) , u ∼ U n i f o r m ( 0 , 1 ) g_i=-\log(-\log u),u\sim\mathrm{Uniform}(0,1) gi=log(logu),uUniform(0,1)
于是Gumbel-Softmax采样可以写为:
y i = e log ⁡ a i + g i τ ∑ j = 1 k e log ⁡ a j + g i τ , ∀ i = 1 , ⋯   , k y_i={e^{\log a_i+g_i\over \tau}\over \sum^k_{j=1}e^{\log a_j+g_i\over\tau}},\forall i=1,\cdots,k yi=j=1keτlogaj+gieτlogai+gi,i=1,,k
通过 z = arg ⁡ max ⁡ i y i z=\arg\max_iy_i z=argmaxiyi计算离散值,该离散值近似等价于离散采样 z ∼ K z\sim\mathcal K zK的值。采样到结果 y y y自然地引入了对于 a a a的梯度。温度参数 τ > 0 \tau>0 τ>0:控制Gumbel-Softmax分布与离散分布的近似程度, τ \tau τ越小,生成的分布越趋向于 onehot ( arg ⁡ max ⁡ i ( log ⁡ a i + g i ) ) \text{onehot}(\arg\max_i(\log a_i+g_i)) onehot(argmaxi(logai+gi)) τ \tau τ越大,生成的分布越趋向于均匀分布。

代码实现

导入MPE环境

git clone https://github.com/boyu-ai/multiagent-particle-envs.git --quiet
pip install -e multiagent-particle-envs
# 由于multiagent-pariticle-env的一些版本问题,gym需要改为可用的版本
pip install --upgrade gym==0.10.5 -q
import os
import time
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import random
import collections
import gym
import sys
sys.path.append("..\multiagent-particle-envs")  # 刚git下来的包所存放的路径
from multiagent.environment import MultiAgentEnv
import multiagent.scenarios as scenarios

创建环境

def make_env(name):
    scenario = scenarios.load(f'{name}.py').Scenario()
    world = scenario.make_world()
    return MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)

env_id = "simple_adversary"
env = make_env(env_id)

state_dims = [state_space.shape[0] for state_space in env.observation_space]
action_dims = [action_space.n for action_space in env.action_space]
critic_input_dim = sum(state_dims) + sum(action_dims)

定义工具函数,包括让 DDPG 可以适用于离散动作空间的 Gumbel Softmax 采样的相关函数

# 生成最优动作的one-hot形式
def onehot_from_logits(logits, eps=0.01):
    argmax_acs = (logits == logits.max(1, keepdim=True)[0]).float()
    # 生成随机动作,转换成独热形式
    rand_acs = torch.autograd.Variable(
        torch.eye(logits.shape[1])[[
            np.random.choice(range(logits.shape[1]), size=logits.shape[0])
        ]], requires_grad=False
    ).to(logits.device)
    # 通过epsilon-贪婪算法来选择用哪个动作
    return torch.stack([
        argmax_acs[i] if r > eps else rand_acs[i]
        for i, r in enumerate(torch.rand(logits.shape[0]))
    ])

# Gumbel(0,1)分布中噪声采样
def sample_gumbel(shape, eps=1e-20, tens_type=torch.FloatTensor):
    U = torch.autograd.Variable(tens_type(*shape).uniform_(), requires_grad=False)
    return -torch.log(-torch.log(U + eps) + eps)

# 从Gumbel-Softmax分布中采样
def gumbel_softmax_sample(logits, temperature):
    y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data)).to(logits.device)
    return F.softmax(y / temperature, dim=1)

# 从Gumbel-Softmax分布中采样,并进行离散化
def gumbel_softmax(logits, temperature=1.0):
    y = gumbel_softmax_sample(logits, temperature)
    y_hard = onehot_from_logits(y)
    y = (y_hard.to(logits.device) - y).detach() + y
    return y

实现单智能体的DDPG,包含 Actor 网络与 Critic 网络,以及计算动作的函数

class ThreeLayerFC(torch.nn.Module):
    def __init__(self, num_in, num_out, hidden_dim):
        super().__init__()
        self.fc1 = torch.nn.Linear(num_in, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = torch.nn.Linear(hidden_dim, num_out)

    def forward(self, x):
        x = F.relu(self.fc2(F.relu(self.fc1(x))))
        return self.fc3(x)

class DDPG:
    def __init__(self, state_dim, action_dim, critic_input_dim, hidden_dim,
                 actor_lr, critic_lr, device):
        self.actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device)
        self.target_actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device)
        self.critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device)
        self.target_critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device)
        self.target_critic.load_state_dict(self.critic.state_dict())
        self.target_actor.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr)

    def take_action(self, state, explore=False):
        action = self.actor(state)
        if explore:
            action = gumbel_softmax(action)
        else:
            action = onehot_from_logits(action)
        return action.detach().cpu().numpy()[0]

    def soft_update(self, net, target_net, tau):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0 - tau) + param.data * tau)

MADDPG算法

class MADDPG:
    def __init__(self, env, device, actor_lr, critic_lr, hidden_dim,
                 state_dims, action_dims, critic_input_dim, gamma, tau):
        self.agents = [DDPG(
            state_dims[i],
            action_dims[i],
            critic_input_dim,
            hidden_dim,
            actor_lr,
            critic_lr,
            device
        ) for i in range(len(env.agents))]
        self.gamma = gamma
        self.tau = tau
        self.critic_criterion = torch.nn.MSELoss()
        self.device = device

    @property
    def policies(self):
        return [agt.actor for agt in self.agents]

    @property
    def target_policies(self):
        return [agt.target_actor for agt in self.agents]

    def take_action(self, states, explore):
        # 将各个状态分给各个智能体,让它们在各自状态下执行动作
        states = [
            torch.tensor([states[i]], dtype=torch.float, device=self.device)
            for i in range(len(env.agents))
        ]
        return [
            agent.take_action(state, explore)
            for agent, state in zip(self.agents, states)
        ]

    def update(self, sample, agent_id):
        current_agent = self.agents[agent_id]
        obs, acts, rewards, next_obs, done = sample

        '''更新critic网络'''
        current_agent.critic_optim.zero_grad()
        # 计算Q-target
        all_target_act = [
            onehot_from_logits(pi(next_obs_))
            for pi, next_obs_ in zip(self.target_policies, next_obs)
        ]
        # 拼接神经网络target_critic的输入
        target_critic_input = torch.cat((*next_obs, *all_target_act), dim=1)
        target_critic_value = rewards[agent_id].view(-1, 1)\
                              + self.gamma * (1 - done[agent_id].view(-1, 1)) * current_agent.target_critic(target_critic_input)
        # 计算Q-eval
        critic_input = torch.cat((*obs, *acts), dim=1)
        critic_value = current_agent.critic(critic_input)
        # 利用MSE更新critic网络
        critic_loss = self.critic_criterion(critic_value, target_critic_value.detach())
        critic_loss.backward()
        current_agent.critic_optim.step()

        '''更新actor网络'''
        current_agent.actor_optim.zero_grad()
        logits = current_agent.actor(obs[agent_id])
        act = gumbel_softmax(logits)
        all_actor_acts = []
        for i, (pi, obs_) in enumerate(zip(self.policies, obs)):
            if i == agent_id:
                all_actor_acts.append(act)
            else:
                all_actor_acts.append(onehot_from_logits(pi(obs_)))
        vf_input = torch.cat((*obs, *all_actor_acts), dim=1)
        actor_loss = -current_agent.critic(vf_input).mean()
        actor_loss += (logits ** 2).mean() * 1e-3
        actor_loss.backward()
        current_agent.actor_optim.step()

    # 对target网络进行软更新
    def update_all_target(self):
        for agt in self.agents:
            agt.soft_update(agt.actor, agt.target_actor, self.tau)
            agt.soft_update(agt.critic, agt.target_critic, self.tau)

定义评估策略的方法

def evaluate(env_id, maddpg, n_episode=10, episode_length=25):
    env = make_env(env_id)
    returns = np.zeros(len(env.agents))
    for _ in range(n_episode):
        obs = env.reset()
        for t_i in range(episode_length):
            actions = maddpg.take_action(obs, explore=False)
            obs, rew, done, info = env.step(actions)
            rew = np.array(rew)
            returns += rew / n_episode
    return returns.tolist()

Training

num_episodes = 5000
episode_length = 25
buffer_size = 100000
hidden_dim = 128
actor_lr = 1e-3
critic_lr = 1e-3
gamma = 0.99
tau = 0.005
batch_size = 256
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
update_interval = 50
minimal_size = 4000
epsilon = 0.3

maddpg = MADDPG(env, device, actor_lr, critic_lr, hidden_dim, state_dims,
                action_dims, critic_input_dim, gamma, tau)
replay_buffer = rl_utils.ReplayBuffer(buffer_size)
return_list = []
total_step = 0
for episode in range(num_episodes):
    state = env.reset()
    for step in range(episode_length):
        actions = maddpg.take_action(state, explore=True)
        next_state, reward, done, _ = env.step(actions)
        replay_buffer.add(state, actions, reward, next_state, done)
        state = next_state
        total_step += 1
        # 如果replay buffer存满了,以及达到更新间隔update_interval,对buffer进行更新
        if replay_buffer.size() >= minimal_size and total_step % update_interval == 0:
            sample = replay_buffer.sample(batch_size)

            # 处理样本数据
            def stack_array(x):
                rearranged = [[sub_x[i] for sub_x in x] for i in range(len(x[0]))]
                return [
                    torch.FloatTensor(np.vstack(ra)).to(device)
                    for ra in rearranged
                ]

            sample = [stack_array(x) for x in sample]
            # 更新每一个agent的critic和actor网络
            for agent_id in range(len(env.agents)):
                maddpg.update(sample, agent_id)
            # 更新target网络
            maddpg.update_all_target()
    if (episode + 1) % 100 == 0:
        ep_returns = evaluate(env_id, maddpg, n_episode=100)
        return_list.append(ep_returns)
        print(f'Episode: {episode + 1}, {ep_returns}')

return_array = np.array(return_list)
for i, agent_name in enumerate(["adversary", "agent0", "agent1"]):
    plt.figure()
    plt.plot(
        np.arange(return_array.shape[0]) * 100,
        rl_utils.moving_average(return_array[:, i], 9)
    )
    plt.xlabel("Episode")
    plt.ylabel("Returns")
    plt.title(agent_name)
Episode: 100, [-41.79770129924304, -6.3168697492582515, -6.3168697492582515]
Episode: 200, [-35.746634777012446, -2.223242348588429, -2.223242348588429]
Episode: 300, [-27.09008911270123, 4.13448241750085, 4.13448241750085]
Episode: 400, [-17.31921815462635, -12.26606023952409, -12.26606023952409]
Episode: 500, [-15.46087910500068, -6.349905966329104, -6.349905966329104]
Episode: 600, [-16.14443559368269, -3.0710776084592317, -3.0710776084592317]
Episode: 700, [-11.876055591617778, -5.904505591539993, -5.904505591539993]
Episode: 800, [-13.078474442139006, 4.495683109038817, 4.495683109038817]
Episode: 900, [-11.611163279573697, 3.93312276675548, 3.93312276675548]
Episode: 1000, [-11.486472126270582, 3.8046557249097206, 3.8046557249097206]
Episode: 1100, [-12.25079886118112, 6.378494530471136, 6.378494530471136]
Episode: 1200, [-10.257372543398363, 4.9894133046940725, 4.9894133046940725]
Episode: 1300, [-12.253466411078032, 7.5246822061406, 7.5246822061406]
Episode: 1400, [-11.211580279418538, 7.265312802084386, 7.265312802084386]
Episode: 1500, [-10.018828262498543, 6.807178644578339, 6.807178644578339]
Episode: 1600, [-9.910202894862806, 7.219979769962865, 7.219979769962865]
Episode: 1700, [-10.892095077410836, 7.563795471782733, 7.563795471782733]
Episode: 1800, [-10.639810730684314, 7.547593259255415, 7.547593259255415]
Episode: 1900, [-11.090809954616025, 7.8110478125958345, 7.8110478125958345]
Episode: 2000, [-9.360928161662294, 6.865737778390727, 6.865737778390727]
Episode: 2100, [-9.175077219714188, 6.825378315992431, 6.825378315992431]
Episode: 2200, [-8.668886375792239, 6.089487964131182, 6.089487964131182]
Episode: 2300, [-9.130070216868365, 6.391617506457572, 6.391617506457572]
Episode: 2400, [-10.399632648786255, 6.921985492024565, 6.921985492024565]
Episode: 2500, [-7.774121493125542, 5.450734621378697, 5.450734621378697]
Episode: 2600, [-8.737397538287832, 6.360115265517296, 6.360115265517296]
Episode: 2700, [-8.853587586610892, 6.205794741595939, 6.205794741595939]
Episode: 2800, [-7.765678006096937, 5.157541385382278, 5.157541385382278]
Episode: 2900, [-8.634643994364698, 6.300397829826716, 6.300397829826716]
Episode: 3000, [-8.749613090492417, 5.622372436994646, 5.622372436994646]
Episode: 3100, [-8.734745271210954, 6.277589113737612, 6.277589113737612]
Episode: 3200, [-8.225847316887608, 5.5379100961317524, 5.5379100961317524]
Episode: 3300, [-6.585527235470495, 4.716530421711814, 4.716530421711814]
Episode: 3400, [-9.299004363720465, 5.889285355348535, 5.889285355348535]
Episode: 3500, [-9.387208441243274, 5.596915497971028, 5.596915497971028]
Episode: 3600, [-8.895741261649446, 6.1194895186211715, 6.1194895186211715]
Episode: 3700, [-9.019982922011769, 5.298595280177206, 5.298595280177206]
Episode: 3800, [-8.586416198816009, 5.203399130661042, 5.203399130661042]
Episode: 3900, [-9.710989754236136, 5.996024970342459, 5.996024970342459]
Episode: 4000, [-9.036048166193453, 5.787041784355292, 5.787041784355292]
Episode: 4100, [-9.89072164409785, 5.789933911998946, 5.789933911998946]
Episode: 4200, [-9.934713788312312, 5.8068017294058105, 5.8068017294058105]
Episode: 4300, [-8.932486266144968, 5.362292334482804, 5.362292334482804]
Episode: 4400, [-8.582713557919002, 5.014241243228755, 5.014241243228755]
Episode: 4500, [-10.866978184979779, 6.364532836227362, 6.364532836227362]
Episode: 4600, [-8.584039710726367, 5.459594052206162, 5.459594052206162]
Episode: 4700, [-9.160827446809247, 5.1421444464918125, 5.1421444464918125]
Episode: 4800, [-8.429160173009647, 5.1680956738547295, 5.1680956738547295]
Episode: 4900, [-9.085638445447698, 5.83764664982803, 5.83764664982803]
Episode: 5000, [-10.111440830401705, 6.1385650332803605, 6.1385650332803605]

在这里插入图片描述
在这里插入图片描述

Mappo算法(Multi-Agent Proximal Policy Optimization)和MADDPG算法(Multi-Agent Deep Deterministic Policy Gradient)都是用于多智能体强化学习算法,但在一些方面有所不同。 Mappo算法是基于Proximal Policy Optimization(PPO)算法的扩展,专门用于解决多智能体协同决策问题。它通过在训练过程中引入自适应的共享价值函数和策略函数来提高训练效果。Mappo算法使用了一个中心化的价值函数来估计所有智能体的价值,并且每个智能体都有自己的策略函数。这种方法可以帮助智能体更好地协同合作,避免冲突和竞争。 MADDPG算法是基于Deep Deterministic Policy Gradient(DDPG)算法的扩展,也是一种用于多智能体协同决策的算法MADDPG算法通过每个智能体都有自己的Actor和Critic网络来实现,每个智能体根据自己的观测和动作来更新自己的策略和价值函数。MADDPG算法使用了经验回放和目标网络来提高训练的稳定性和效果。 总结一下两者的区别: 1. Mappo算法使用了一个中心化的价值函数来估计所有智能体的价值,而MADDPG算法每个智能体都有自己的Critic网络来估计自己的价值。 2. Mappo算法在训练过程中引入了自适应的共享价值函数和策略函数,而MADDPG算法每个智能体都有自己的Actor和Critic网络。 3. Mappo算法更加注重智能体之间的协同合作,避免冲突和竞争,而MADDPG算法更加注重每个智能体的个体决策和学习。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值