多智能体的情形相比于单智能体更加复杂,因为每个智能体在和环境交互的同时也在和其他智能体进行直接或者间接的交互。多智能体强化学习可以分为以下类别
- 集中式强化学习
由一个全局学习单元承担学习任务,以整个多智能体系统的整体状态为输入,输出各个智能体的动作。 - 独立强化学习
每个智能体都是独立的学习主体,只考虑自身的观测环境和策略利益 - 社会强化学习
独立强化学习与社会/经济模型的结合,模拟人类社会中人类个体的交互过程,用社会学和管理学的方法调节智能体之间的关系 - 群体强化学习
集中训练-分布执行CTDE范式,融合集中学习和独立学习的优势。在训练阶段,智能体利用全局信息集中学习;在执行阶段,智能体仅使用自身观测状态和局部信息选择动作
7.1 IPPO算法
每个智能体独立使用单智能体算法PPO进行训练
算法流程
- 对于N个智能体,为每个智能体初始化各自的策略以及价值函数
-
for
\text{for}
for训练轮数
k
=
0
,
1
,
2
,
⋯
do
k=0,1,2,\cdots\text{do}
k=0,1,2,⋯do
- 所有智能体在环境中交互分别获得各自的的一条轨迹数据
- 对每个智能体,基于当前的价值函数用GAE计算优质函数的估计
- 对每个智能体,通过最大化其PPO-截断的目标来更新其策略
- 对每个智能体,通过均方误差损失函数优化其价值函数
Combat环境
Combat 是一个在二维的格子世界上进行的两个队伍的对战模拟游戏,每个智能体的动作集合为:向四周移动 1 格,攻击周围 3x3 格范围内其他敌对智能体,或者不采取任何行动。起初每个智能体有 3 点生命值,如果智能体在敌人的攻击范围内被攻击到了,则会扣 1 生命值,生命值掉为 0 则死亡,最后存活的队伍获胜。每个智能体的攻击有一轮的冷却时间。
在游戏中,我们能够控制一个队伍的所有智能体与另一个队伍的智能体对战。另一个队伍的智能体使用固定的算法:攻击在范围内最近的敌人,如果攻击范围内没有敌人,则向敌人靠近。
代码实现
导入Combat环境
git clone https://github.com/boyu-ai/ma-gym.git
import torch
import torch.nn.functional as F
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import sys
sys.path.append("../ma-gym")
from ma_gym.envs.combat.combat import Combat
PPO算法
# PPO算法
class PolicyNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
self.fc3 = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc2(F.relu(self.fc1(x))))
return F.softmax(self.fc3(x), dim=1)
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
self.fc3 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc2(F.relu(self.fc1(x))))
return self.fc3(x)
def compute_advantage(gamma, lmbda, td_delta):
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
# PPO,采用截断方式
class PPO:
def __init__(self, state_dim, hidden_dim, action_dim,
actor_lr, critic_lr, lmbda, eps, gamma, device):
self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr)
self.gamma = gamma
self.lmbda = lmbda
self.eps = eps # PPO中截断范围的参数
self.device = device
def take_action(self, state):
state = torch.tensor([state], dtype=torch.float).to(self.device)
probs = self.actor(state)
action_dict = torch.distributions.Categorical(probs)
action = action_dict.sample()
return action.item()
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device)
rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)
td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device)
old_log_probs = torch.log(self.actor(states).gather(1, actions)).detach()
log_probs = torch.log(self.actor(states).gather(1, actions))
ratio = torch.exp(log_probs - old_log_probs)
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1 - self.eps, 1 + self.eps) * advantage # 截断
action_loss = torch.mean(-torch.min(surr1, surr2)) # PPO损失函数
critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target.detach()))
self.actor_optimizer.zero_grad()
self.critic_optimizer.zero_grad()
action_loss.backward()
critic_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.step()
参数和环境设置
actor_lr = 3e-4
critic_lr = 1e-3
epochs = 10
episode_per_epoch = 1000
hidden_dim = 64
gamma = 0.99
lmbda = 0.97
eps = 0.2
team_size = 2 # 每个team里agent的数量
grid_size = (15, 15) # 二维空间的大小
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# 创建环境
env = Combat(grid_shape=grid_size, n_agents=team_size, n_opponents=team_size)
state_dim = env.observation_space[0].shape[0]
action_dim = env.action_space[0].n
参数共享(parameter sharing)即对于所有智能体使用同一套策略参数,前提是这些智能体是同质(homogeneous)的,即它们的状态空间和动作空间是完全一致的,并且它们的优化目标也完全一致。
- 智能体不共享策略
# 创建智能体(不参数共享)
agent1 = PPO(
state_dim, hidden_dim, action_dim,
actor_lr, critic_lr, lmbda, eps, gamma, device
)
agent2 = PPO(
state_dim, hidden_dim, action_dim,
actor_lr, critic_lr, lmbda, eps, gamma, device
)
- 智能体共享同一策略
# 创建智能体(参数共享)
agent = PPO(
state_dim, hidden_dim, action_dim,
actor_lr, critic_lr, lmbda, eps, gamma, device
)
Training
win_list = []
for e in range(epochs):
with tqdm(total=episode_per_epoch, desc='Epoch %d' % e) as pbar:
for episode in range(episode_per_epoch):
# Replay buffer for agent1
buffer_agent1 = {
'states': [],
'actions': [],
'next_states': [],
'rewards': [],
'dones': []
}
# Replay buffer for agent2
buffer_agent2 = {
'states': [],
'actions': [],
'next_states': [],
'rewards': [],
'dones': []
}
# 重置环境
s = env.reset()
terminal = False
while not terminal:
# 采取动作(不进行参数共享)
a1 = agent1.take_action(s[0])
a2 = agent2.take_action(s[1])
# 采取动作(进行参数共享)
# a1 = agent.take_action(s[0])
# a2 = agent.take_action(s[1])
next_s, r, done, info = env.step([a1, a2])
buffer_agent1['states'].append(s[0])
buffer_agent1['actions'].append(a1)
buffer_agent1['next_states'].append(next_s[0])
# 如果获胜,获得100的奖励,否则获得0.1惩罚
buffer_agent1['rewards'].append(
r[0] + 100 if info['win'] else r[0] - 0.1)
buffer_agent1['dones'].append(False)
buffer_agent2['states'].append(s[1])
buffer_agent2['actions'].append(a2)
buffer_agent2['next_states'].append(next_s[1])
buffer_agent2['rewards'].append(
r[1] + 100 if info['win'] else r[1] - 0.1)
buffer_agent2['dones'].append(False)
s = next_s # 转移到下一个状态
terminal = all(done)
# 更新策略(不进行参数共享)
agent1.update(buffer_agent1)
agent2.update(buffer_agent2)
# 更新策略(进行参数共享)
# agent.update(buffer_agent1)
# agent.update(buffer_agent2)
win_list.append(1 if info['win'] else 0)
if (episode + 1) % 100 == 0:
pbar.set_postfix({
'episode': '%d' % (episode_per_epoch * e + episode + 1),
'winner prob': '%.3f' % np.mean(win_list[-100])
})
pbar.update(1)
win_array = np.array(win_list)
# 每100条轨迹取一次平均
win_array = np.mean(win_array.reshape(-1, 100), axis=1)
episode_list = np.array(win_array.shape[0]) * 100
plt.plot(episode_list, win_array)
plt.xlabel('Episodes')
plt.ylabel('win rate')
plt.title('IPPO on Combat')
plt.show()
7.2 MADDPG算法
多智能体深度确定性策略梯度(Multi-Agent Deep Deterministic Policy Gradient,MADDPG)
考虑
N
N
N个智能体的博弈,每个智能体的策略参数为
θ
=
{
θ
1
,
⋯
,
θ
N
}
\theta=\{\theta_1,\cdots,\theta_N\}
θ={θ1,⋯,θN},记
π
=
{
π
1
.
⋯
,
π
N
}
\pi=\{\pi_1.\cdots,\pi_N\}
π={π1.⋯,πN}为所有智能体的策略集合,于是有每个智能体
i
i
i的期望收益的策略梯度:
∇
θ
i
J
(
θ
i
)
=
E
s
∼
p
u
,
a
∼
π
i
[
∇
θ
i
log
π
i
(
a
i
∣
s
i
)
Q
i
π
(
x
,
a
1
,
⋯
,
a
N
)
]
\nabla_{\theta_i}J(\theta_i)=\Bbb E_{s\sim p^u,a\sim\pi_i}[\nabla_{\theta_i}\log\pi_i(a_i|s_i)Q^\pi_i(\mathbf x,a_1,\cdots,a_N)]
∇θiJ(θi)=Es∼pu,a∼πi[∇θilogπi(ai∣si)Qiπ(x,a1,⋯,aN)]
其中,
Q
i
π
(
x
,
a
1
,
⋯
,
a
N
)
Q^\pi_i(\rm x,a_1,\cdots,a_N)
Qiπ(x,a1,⋯,aN)为中心化的动作价值函数,
x
=
(
s
1
,
⋯
,
s
N
)
\mathbf x=(s_1,\cdots,s_N)
x=(s1,⋯,sN)。考虑
N
N
N个连续动作策略
μ
θ
i
\mu_{\theta_i}
μθi,得到DDPG的梯度公式:
∇
θ
i
J
(
μ
i
)
=
E
x
∼
D
[
∇
θ
i
μ
i
(
s
i
)
∇
a
i
Q
i
μ
(
x
,
a
1
,
⋯
,
a
N
)
∣
a
i
=
μ
i
(
s
i
)
]
\nabla_{\theta_i}J(\mu_i)=\Bbb E_{\mathbf x\sim\mathcal D}[\nabla_{\theta_i}\mu_i(s_i)\nabla_{a_i}Q^\mu_i(\mathbf x,a_1,\cdots,a_N)|_{a_i=\mu_i(s_i)}]
∇θiJ(μi)=Ex∼D[∇θiμi(si)∇aiQiμ(x,a1,⋯,aN)∣ai=μi(si)]
在上式中
x
∼
D
\mathbf x\sim\mathcal D
x∼D中的
D
\mathcal D
D表示存储数据的经验回放池ReplayBuffer,存入的数据为
(
x
,
x
′
,
a
1
,
⋯
,
a
N
,
r
1
,
⋯
,
r
N
)
(\mathbf x,\mathbf x^\prime,a_1,\cdots,a_N,r_1,\cdots,r_N)
(x,x′,a1,⋯,aN,r1,⋯,rN)。在MADDPG中,中心化的动作价值函数Q的损失函数为:
L
(
ω
i
)
=
E
x
,
a
,
r
,
x
′
[
(
Q
i
μ
(
x
,
a
1
,
⋯
,
a
N
)
−
y
)
2
]
y
=
r
i
+
γ
Q
i
μ
′
(
x
′
,
a
1
′
,
⋯
,
a
N
′
)
∣
a
j
′
=
μ
j
′
(
o
j
)
\mathcal L(\omega_i)=\Bbb E_{\mathbf x,a,r,\mathbf x^\prime}[(Q^\mu_i(\mathbf x,a_1,\cdots,a_N)-y)^2]\\ y=r_i+\gamma Q^{\mu^\prime}_i(\mathbf x^\prime,a^\prime_1,\cdots,a^\prime_N)|_{a^\prime_j=\mu^\prime_j(o_j)}
L(ωi)=Ex,a,r,x′[(Qiμ(x,a1,⋯,aN)−y)2]y=ri+γQiμ′(x′,a1′,⋯,aN′)∣aj′=μj′(oj)
算法流程
-
for
e
=
1
→
M
do
\text{for}\space e=1\to M\space\text{do}
for e=1→M do:
- 初始化随机过程 N \mathcal N N,用于动作探索
- 获取所有智能体的初始观测 x \mathbf x x
-
for
t
=
1
→
T
do
\text{for}\space t=1\to T\space\text{do}
for t=1→T do:
- 对于每个智能体 i i i,用当前策略选择一个动作 a i = μ θ i ( o i ) + N t a_i=\mu_{\theta_i}(o_i)+\mathcal N_t ai=μθi(oi)+Nt
- 执行动作 a = ( a 1 , ⋯ , a N ) a=(a_1,\cdots,a_N) a=(a1,⋯,aN)获得奖励 r r r和新的观测 x ′ \mathbf x^\prime x′
- 将 ( x , a , r , x ′ ) (\mathbf x,a,r,\mathbf x^\prime) (x,a,r,x′)存入经验回放池 D \mathcal D D中
- x ← x ′ \mathbf x\larr\mathbf x^\prime x←x′
-
for
i
=
1
→
N
do
\text{for}\space i=1\to N\space\text{do}
for i=1→N do:
- 从 D \mathcal D D中随机采样 ( x j , a j , r j , x ′ j ) (\mathbf x^j,a^j,r^j,\mathbf x^{\prime j}) (xj,aj,rj,x′j)
- 中心化训练Critic网络
- 训练自身的Actor网络
- 更新目标 Actor 网络和目标 Critic 网络
MPE环境
多智能体粒子环境(Multi-Agent Particles Environment,MPE)由 1 个红色的对抗智能体(adversary), N N N个蓝色的正常智能体,以及 N N N个地点(一般 N = 2 N=2 N=2),这 N N N个地点中有一个是目标地点(绿色)。正常智能体知道哪一个是目标地点,但对抗智能体不知道。正常智能体是合作关系,它们其中任意一个距离目标地点足够近,则每个正常智能体都能获得相同的奖励。对抗智能体如果距离目标地点足够近,也能获得奖励,但它需要猜哪一个才是目标地点。因此,正常智能体需要进行合作,分散到不同的坐标点,以此欺骗对抗智能体。
Gumbel-Softmax近似采样
由于MPE 环境中的每个智能体的动作空间是离散的,而DDPG算法本身需要使智能体的动作对于其策略参数可导,因此引入Gumbel-Softmax的方法来得到离散分布的近似采样。
假设随机变量
Z
Z
Z服从某个离散分布
K
=
(
a
1
,
⋯
,
a
k
)
\mathcal K=(a_1,\cdots,a_k)
K=(a1,⋯,ak)。其中,
a
i
∈
[
0
,
1
]
a_i\in[0,1]
ai∈[0,1],表示
P
(
Z
=
i
)
P(Z=i)
P(Z=i),并且满足
∑
i
=
1
k
a
i
=
1
\sum^k_{i=1}a_i=1
∑i=1kai=1。引入重参数因子
g
i
g_i
gi,它是一个采样自Gumbel(0, 1)的噪声,表示为:
g
i
=
−
log
(
−
log
u
)
,
u
∼
U
n
i
f
o
r
m
(
0
,
1
)
g_i=-\log(-\log u),u\sim\mathrm{Uniform}(0,1)
gi=−log(−logu),u∼Uniform(0,1)
于是Gumbel-Softmax采样可以写为:
y
i
=
e
log
a
i
+
g
i
τ
∑
j
=
1
k
e
log
a
j
+
g
i
τ
,
∀
i
=
1
,
⋯
,
k
y_i={e^{\log a_i+g_i\over \tau}\over \sum^k_{j=1}e^{\log a_j+g_i\over\tau}},\forall i=1,\cdots,k
yi=∑j=1keτlogaj+gieτlogai+gi,∀i=1,⋯,k
通过
z
=
arg
max
i
y
i
z=\arg\max_iy_i
z=argmaxiyi计算离散值,该离散值近似等价于离散采样
z
∼
K
z\sim\mathcal K
z∼K的值。采样到结果
y
y
y自然地引入了对于
a
a
a的梯度。温度参数
τ
>
0
\tau>0
τ>0:控制Gumbel-Softmax分布与离散分布的近似程度,
τ
\tau
τ越小,生成的分布越趋向于
onehot
(
arg
max
i
(
log
a
i
+
g
i
)
)
\text{onehot}(\arg\max_i(\log a_i+g_i))
onehot(argmaxi(logai+gi));
τ
\tau
τ越大,生成的分布越趋向于均匀分布。
代码实现
导入MPE环境
git clone https://github.com/boyu-ai/multiagent-particle-envs.git --quiet
pip install -e multiagent-particle-envs
# 由于multiagent-pariticle-env的一些版本问题,gym需要改为可用的版本
pip install --upgrade gym==0.10.5 -q
import os
import time
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import random
import collections
import gym
import sys
sys.path.append("..\multiagent-particle-envs") # 刚git下来的包所存放的路径
from multiagent.environment import MultiAgentEnv
import multiagent.scenarios as scenarios
创建环境
def make_env(name):
scenario = scenarios.load(f'{name}.py').Scenario()
world = scenario.make_world()
return MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)
env_id = "simple_adversary"
env = make_env(env_id)
state_dims = [state_space.shape[0] for state_space in env.observation_space]
action_dims = [action_space.n for action_space in env.action_space]
critic_input_dim = sum(state_dims) + sum(action_dims)
定义工具函数,包括让 DDPG 可以适用于离散动作空间的 Gumbel Softmax 采样的相关函数
# 生成最优动作的one-hot形式
def onehot_from_logits(logits, eps=0.01):
argmax_acs = (logits == logits.max(1, keepdim=True)[0]).float()
# 生成随机动作,转换成独热形式
rand_acs = torch.autograd.Variable(
torch.eye(logits.shape[1])[[
np.random.choice(range(logits.shape[1]), size=logits.shape[0])
]], requires_grad=False
).to(logits.device)
# 通过epsilon-贪婪算法来选择用哪个动作
return torch.stack([
argmax_acs[i] if r > eps else rand_acs[i]
for i, r in enumerate(torch.rand(logits.shape[0]))
])
# Gumbel(0,1)分布中噪声采样
def sample_gumbel(shape, eps=1e-20, tens_type=torch.FloatTensor):
U = torch.autograd.Variable(tens_type(*shape).uniform_(), requires_grad=False)
return -torch.log(-torch.log(U + eps) + eps)
# 从Gumbel-Softmax分布中采样
def gumbel_softmax_sample(logits, temperature):
y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data)).to(logits.device)
return F.softmax(y / temperature, dim=1)
# 从Gumbel-Softmax分布中采样,并进行离散化
def gumbel_softmax(logits, temperature=1.0):
y = gumbel_softmax_sample(logits, temperature)
y_hard = onehot_from_logits(y)
y = (y_hard.to(logits.device) - y).detach() + y
return y
实现单智能体的DDPG,包含 Actor 网络与 Critic 网络,以及计算动作的函数
class ThreeLayerFC(torch.nn.Module):
def __init__(self, num_in, num_out, hidden_dim):
super().__init__()
self.fc1 = torch.nn.Linear(num_in, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
self.fc3 = torch.nn.Linear(hidden_dim, num_out)
def forward(self, x):
x = F.relu(self.fc2(F.relu(self.fc1(x))))
return self.fc3(x)
class DDPG:
def __init__(self, state_dim, action_dim, critic_input_dim, hidden_dim,
actor_lr, critic_lr, device):
self.actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device)
self.target_actor = ThreeLayerFC(state_dim, action_dim, hidden_dim).to(device)
self.critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device)
self.target_critic = ThreeLayerFC(critic_input_dim, 1, hidden_dim).to(device)
self.target_critic.load_state_dict(self.critic.state_dict())
self.target_actor.load_state_dict(self.actor.state_dict())
self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), actor_lr)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), critic_lr)
def take_action(self, state, explore=False):
action = self.actor(state)
if explore:
action = gumbel_softmax(action)
else:
action = onehot_from_logits(action)
return action.detach().cpu().numpy()[0]
def soft_update(self, net, target_net, tau):
for param_target, param in zip(target_net.parameters(), net.parameters()):
param_target.data.copy_(param_target.data * (1.0 - tau) + param.data * tau)
MADDPG算法
class MADDPG:
def __init__(self, env, device, actor_lr, critic_lr, hidden_dim,
state_dims, action_dims, critic_input_dim, gamma, tau):
self.agents = [DDPG(
state_dims[i],
action_dims[i],
critic_input_dim,
hidden_dim,
actor_lr,
critic_lr,
device
) for i in range(len(env.agents))]
self.gamma = gamma
self.tau = tau
self.critic_criterion = torch.nn.MSELoss()
self.device = device
@property
def policies(self):
return [agt.actor for agt in self.agents]
@property
def target_policies(self):
return [agt.target_actor for agt in self.agents]
def take_action(self, states, explore):
# 将各个状态分给各个智能体,让它们在各自状态下执行动作
states = [
torch.tensor([states[i]], dtype=torch.float, device=self.device)
for i in range(len(env.agents))
]
return [
agent.take_action(state, explore)
for agent, state in zip(self.agents, states)
]
def update(self, sample, agent_id):
current_agent = self.agents[agent_id]
obs, acts, rewards, next_obs, done = sample
'''更新critic网络'''
current_agent.critic_optim.zero_grad()
# 计算Q-target
all_target_act = [
onehot_from_logits(pi(next_obs_))
for pi, next_obs_ in zip(self.target_policies, next_obs)
]
# 拼接神经网络target_critic的输入
target_critic_input = torch.cat((*next_obs, *all_target_act), dim=1)
target_critic_value = rewards[agent_id].view(-1, 1)\
+ self.gamma * (1 - done[agent_id].view(-1, 1)) * current_agent.target_critic(target_critic_input)
# 计算Q-eval
critic_input = torch.cat((*obs, *acts), dim=1)
critic_value = current_agent.critic(critic_input)
# 利用MSE更新critic网络
critic_loss = self.critic_criterion(critic_value, target_critic_value.detach())
critic_loss.backward()
current_agent.critic_optim.step()
'''更新actor网络'''
current_agent.actor_optim.zero_grad()
logits = current_agent.actor(obs[agent_id])
act = gumbel_softmax(logits)
all_actor_acts = []
for i, (pi, obs_) in enumerate(zip(self.policies, obs)):
if i == agent_id:
all_actor_acts.append(act)
else:
all_actor_acts.append(onehot_from_logits(pi(obs_)))
vf_input = torch.cat((*obs, *all_actor_acts), dim=1)
actor_loss = -current_agent.critic(vf_input).mean()
actor_loss += (logits ** 2).mean() * 1e-3
actor_loss.backward()
current_agent.actor_optim.step()
# 对target网络进行软更新
def update_all_target(self):
for agt in self.agents:
agt.soft_update(agt.actor, agt.target_actor, self.tau)
agt.soft_update(agt.critic, agt.target_critic, self.tau)
定义评估策略的方法
def evaluate(env_id, maddpg, n_episode=10, episode_length=25):
env = make_env(env_id)
returns = np.zeros(len(env.agents))
for _ in range(n_episode):
obs = env.reset()
for t_i in range(episode_length):
actions = maddpg.take_action(obs, explore=False)
obs, rew, done, info = env.step(actions)
rew = np.array(rew)
returns += rew / n_episode
return returns.tolist()
Training
num_episodes = 5000
episode_length = 25
buffer_size = 100000
hidden_dim = 128
actor_lr = 1e-3
critic_lr = 1e-3
gamma = 0.99
tau = 0.005
batch_size = 256
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
update_interval = 50
minimal_size = 4000
epsilon = 0.3
maddpg = MADDPG(env, device, actor_lr, critic_lr, hidden_dim, state_dims,
action_dims, critic_input_dim, gamma, tau)
replay_buffer = rl_utils.ReplayBuffer(buffer_size)
return_list = []
total_step = 0
for episode in range(num_episodes):
state = env.reset()
for step in range(episode_length):
actions = maddpg.take_action(state, explore=True)
next_state, reward, done, _ = env.step(actions)
replay_buffer.add(state, actions, reward, next_state, done)
state = next_state
total_step += 1
# 如果replay buffer存满了,以及达到更新间隔update_interval,对buffer进行更新
if replay_buffer.size() >= minimal_size and total_step % update_interval == 0:
sample = replay_buffer.sample(batch_size)
# 处理样本数据
def stack_array(x):
rearranged = [[sub_x[i] for sub_x in x] for i in range(len(x[0]))]
return [
torch.FloatTensor(np.vstack(ra)).to(device)
for ra in rearranged
]
sample = [stack_array(x) for x in sample]
# 更新每一个agent的critic和actor网络
for agent_id in range(len(env.agents)):
maddpg.update(sample, agent_id)
# 更新target网络
maddpg.update_all_target()
if (episode + 1) % 100 == 0:
ep_returns = evaluate(env_id, maddpg, n_episode=100)
return_list.append(ep_returns)
print(f'Episode: {episode + 1}, {ep_returns}')
return_array = np.array(return_list)
for i, agent_name in enumerate(["adversary", "agent0", "agent1"]):
plt.figure()
plt.plot(
np.arange(return_array.shape[0]) * 100,
rl_utils.moving_average(return_array[:, i], 9)
)
plt.xlabel("Episode")
plt.ylabel("Returns")
plt.title(agent_name)
Episode: 100, [-41.79770129924304, -6.3168697492582515, -6.3168697492582515]
Episode: 200, [-35.746634777012446, -2.223242348588429, -2.223242348588429]
Episode: 300, [-27.09008911270123, 4.13448241750085, 4.13448241750085]
Episode: 400, [-17.31921815462635, -12.26606023952409, -12.26606023952409]
Episode: 500, [-15.46087910500068, -6.349905966329104, -6.349905966329104]
Episode: 600, [-16.14443559368269, -3.0710776084592317, -3.0710776084592317]
Episode: 700, [-11.876055591617778, -5.904505591539993, -5.904505591539993]
Episode: 800, [-13.078474442139006, 4.495683109038817, 4.495683109038817]
Episode: 900, [-11.611163279573697, 3.93312276675548, 3.93312276675548]
Episode: 1000, [-11.486472126270582, 3.8046557249097206, 3.8046557249097206]
Episode: 1100, [-12.25079886118112, 6.378494530471136, 6.378494530471136]
Episode: 1200, [-10.257372543398363, 4.9894133046940725, 4.9894133046940725]
Episode: 1300, [-12.253466411078032, 7.5246822061406, 7.5246822061406]
Episode: 1400, [-11.211580279418538, 7.265312802084386, 7.265312802084386]
Episode: 1500, [-10.018828262498543, 6.807178644578339, 6.807178644578339]
Episode: 1600, [-9.910202894862806, 7.219979769962865, 7.219979769962865]
Episode: 1700, [-10.892095077410836, 7.563795471782733, 7.563795471782733]
Episode: 1800, [-10.639810730684314, 7.547593259255415, 7.547593259255415]
Episode: 1900, [-11.090809954616025, 7.8110478125958345, 7.8110478125958345]
Episode: 2000, [-9.360928161662294, 6.865737778390727, 6.865737778390727]
Episode: 2100, [-9.175077219714188, 6.825378315992431, 6.825378315992431]
Episode: 2200, [-8.668886375792239, 6.089487964131182, 6.089487964131182]
Episode: 2300, [-9.130070216868365, 6.391617506457572, 6.391617506457572]
Episode: 2400, [-10.399632648786255, 6.921985492024565, 6.921985492024565]
Episode: 2500, [-7.774121493125542, 5.450734621378697, 5.450734621378697]
Episode: 2600, [-8.737397538287832, 6.360115265517296, 6.360115265517296]
Episode: 2700, [-8.853587586610892, 6.205794741595939, 6.205794741595939]
Episode: 2800, [-7.765678006096937, 5.157541385382278, 5.157541385382278]
Episode: 2900, [-8.634643994364698, 6.300397829826716, 6.300397829826716]
Episode: 3000, [-8.749613090492417, 5.622372436994646, 5.622372436994646]
Episode: 3100, [-8.734745271210954, 6.277589113737612, 6.277589113737612]
Episode: 3200, [-8.225847316887608, 5.5379100961317524, 5.5379100961317524]
Episode: 3300, [-6.585527235470495, 4.716530421711814, 4.716530421711814]
Episode: 3400, [-9.299004363720465, 5.889285355348535, 5.889285355348535]
Episode: 3500, [-9.387208441243274, 5.596915497971028, 5.596915497971028]
Episode: 3600, [-8.895741261649446, 6.1194895186211715, 6.1194895186211715]
Episode: 3700, [-9.019982922011769, 5.298595280177206, 5.298595280177206]
Episode: 3800, [-8.586416198816009, 5.203399130661042, 5.203399130661042]
Episode: 3900, [-9.710989754236136, 5.996024970342459, 5.996024970342459]
Episode: 4000, [-9.036048166193453, 5.787041784355292, 5.787041784355292]
Episode: 4100, [-9.89072164409785, 5.789933911998946, 5.789933911998946]
Episode: 4200, [-9.934713788312312, 5.8068017294058105, 5.8068017294058105]
Episode: 4300, [-8.932486266144968, 5.362292334482804, 5.362292334482804]
Episode: 4400, [-8.582713557919002, 5.014241243228755, 5.014241243228755]
Episode: 4500, [-10.866978184979779, 6.364532836227362, 6.364532836227362]
Episode: 4600, [-8.584039710726367, 5.459594052206162, 5.459594052206162]
Episode: 4700, [-9.160827446809247, 5.1421444464918125, 5.1421444464918125]
Episode: 4800, [-8.429160173009647, 5.1680956738547295, 5.1680956738547295]
Episode: 4900, [-9.085638445447698, 5.83764664982803, 5.83764664982803]
Episode: 5000, [-10.111440830401705, 6.1385650332803605, 6.1385650332803605]