16.9 训练模型
接下来开始实现强化学习代理(Agent)的训练过程,该代理使用了深度确定性策略梯度(DDPG)算法来解决一个多智能体协作的问题。在本项目中需要训练多个智能体,其中捕食者智能体使用 DDPG 算法进行训练,而猎物智能体则随机选择动作。训练的目标是使捕食者智能体学会在环境中捕获猎物,并获得最大的累积奖励。训练过程中监测并记录了不同指标的变化,以便进一步分析和改进训练算法。
16.9.1 环境初始化
通过创建 PredatorPreyEnv 环境来模拟智能体的训练场景,其中包括多个捕食者(Predator)和猎物(Prey),以及其他参数如视野、历史记录长度等。具体实现代码如下所示。
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
TAU = 0.001
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 500
ACTOR_LR = 1e-4
CRITIC_LR = 1e-2
GAMMA = 0.99
NUM_EPISODES = 1000
NUM_PRED = 5
NUM_PREY = 1
BATCH_SIZE = 8
VISION = 7 # do not change
SIZE = 13
LOAD = False
HISTORY = 4 #do not change
COMMUNICATION_BIT = 10
env = PredatorPreyEnv( # Use PredatorPreyEnv2 for immortal prey
render_mode='none',
size=SIZE,
predator=NUM_PRED,
prey=NUM_PREY,
episode_length=100,
img_mode=True,
vision=VISION,
history_length=HISTORY,
communication_bits=COMMUNICATION_BIT,
success_reward=10,
living_reward=-0.5,
error_reward=0,
cooperate=1
)
state, _ = env.reset()
n_actions = env.single_action_space.n
n_observations = 4 # considering single agent observation space
episode_durations = []
steps_done = 0
losses = []
steps = []
rewards = []
16.9.2 创建智能体
分别创建捕食者智能体和猎物智能体,使用类DDPGAgent创建了多个捕食者智能体(predators)和多个随机猎物智能体(preys)。具体实现代码如下所示。
predators = [DDPGAgent((HISTORY*4,VISION,VISION),
n_actions,
num_agents=NUM_PRED,
idd=i,
gamma=GAMMA,
tau=TAU,
actor_lr=ACTOR_LR,
critic_lr=CRITIC_LR,
communication_bits=COMMUNICATION_BIT,
batch_size=BATCH_SIZE,
buffer_size=int(1e6)
) for i in range(NUM_PRED)]
preys = [RandomAgent(VISION*VISION*3,
n_actions,
fix_pos=False) for i in range(NUM_PREY)]
在上述代码中,RandomAgent 是一个随机智能体,它在每个时间步骤随机选择动作而不依赖于环境状态或学习算法。在上述代码中,RandomAgent 用作猎物(Prey)的智能体,其目的是模拟没有学习能力的实体,它会随机采取行动。
16.9.3 训练循环
通过循环迭代多个训练周期(NUM_EPISODES),每个周期通过以下步骤进行训练:
- 选择动作:捕食者智能体根据当前状态选择动作,猎物智能体随机选择动作。
- 执行动作:将捕食者和猎物的动作应用于环境,并获得下一个状态、奖励和是否完成的信息。
- 存储经验:将每个智能体的经验存储在经验回放缓冲区中,以供后续训练使用。
- 更新模型:对捕食者智能体进行模型更新,包括演员模型和评论家模型的更新。
- 逐步更新:根据当前训练步数逐步降低探索率(epsilon)。
- 监测和记录:监测并记录损失、奖励和训练时长等信息,以便后续分析和可视化。
具体实现代码如下所示。
fin_losses = []
episode_durations = []
fin_rewards = []
MAV_loss_window = 5
MAV_episode_duration_window = 5
MAV_rewards_window = 5
epsilon_predator = EPS_START
epsilon_prey = EPS_START
#load model
if LOAD:
for i in range(NUM_PRED):
predators[i].load_model(f"./models2/actor_model_{i}.dict", f"./models2/critic_model{i}.dict")
predators[i].load_optim(f"./models2/actor_optimizer_{i}.dict", f"./models2/critic_optimizer_{i}.dict")
for i_episode in range(NUM_EPISODES):
# Initialize the environment and get it's state
state, info = env.reset()
# print(state)
losses = []
episode_reward = 0
for i_step in count():
# Select and perform an action
predator_actions = []
prey_actions = []
pred_communication = [0 for _ in range(NUM_PRED)]
prey_communication = [0 for _ in range(NUM_PREY)]
if state is None:
pred_actions = [0 for _ in range(NUM_PRED)]
prey_actions = [0 for _ in range(NUM_PREY)]
next_state, reward, done, info = env.step(pred_actions,
prey_actions,
pred_communication=pred_communication,
prey_communication=prey_communication)
steps_done+=1
# Move to the next state
state = next_state
# if done or i == env.episode_length-1:
if done:
episode_durations.append(i_step + 1)
break
continue
#CHOOSE ACTION
for i in range(NUM_PRED):
state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in state]
state_i = torch.cat(state_i, dim=0).unsqueeze(0) # 12, 5, 5
# print(state_i.shape, "AFTER")
# state_i = torch.tensor(state_i, dtype=torch.float32, device=device).unsqueeze(0)
action_i, action_choice, msg_i = predators[i].select_action(state_i, epsilon_predator)
predator_actions.append(action_choice)
pred_communication[i] = msg_i
# print(state_i[:4, :, :], "STATE")
# exit()
# print(msg_i, "RECIEVED_MSG")
for i in range(NUM_PREY):
state_i = [torch.tensor(s["prey"][i], dtype=torch.float32, device=device) for s in state]
action_i, _= preys[i].select_action(state_i, epsilon_prey)
prey_actions.append(action_i)
# print(pred_communication, "pred_communication")
#TAKE ACTION IN ENVIRONMENT
next_state, reward, done, info = env.step(predator_actions,
prey_actions,
pred_communication=pred_communication)
episode_reward += sum(reward["predator"])/len(reward["predator"])*GAMMA**i_step
# Store the transition in memory
total_state = []
total_next_state = []
reward_total = torch.tensor(reward["predator"], dtype=torch.float32, device=device)
action_total = torch.tensor(predator_actions, dtype=torch.float32, device=device)
pred_msg_total = torch.tensor(pred_communication, dtype=torch.float32, device=device)
done_total = torch.tensor(done, dtype=torch.float32, device=device)
for i in range(NUM_PRED):
state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in state]
state_i = torch.cat(state_i, dim=0)
next_state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in next_state]
next_state_i = torch.cat(next_state_i, dim=0)
total_state.append(state_i)
total_next_state.append(next_state_i)
total_state = torch.cat(total_state, dim=0)
total_next_state = torch.cat(total_next_state, dim=0)
for i in range(NUM_PRED):
predators[i].replay_buffer.push(total_state, action_total, pred_msg_total, reward_total, total_next_state, done_total)
# update the Actor-Critic Network
for i in range(NUM_PRED):
loss = predators[i].update_model()
if loss is not None:
losses.append(loss)
predators[i].update_target_model()
for i in range(NUM_PREY):
preys[i].update_model()
preys[i].update_target_model()
#decay epsilon
epsilon_predator = EPS_END + (EPS_START - EPS_END) * \
np.exp(-1. * steps_done / EPS_DECAY)
epsilon_prey = EPS_END + (EPS_START - EPS_END) * \
np.exp(-1. * steps_done / EPS_DECAY)
steps_done+=1
state = next_state
if done:
episode_durations.append(i_step + 1)
break
LOSS = sum(losses)/(len(losses)+1)
if i_episode>3:
fin_losses.append(LOSS)
fin_rewards.append(episode_reward)
print(f"Episode {i_episode} finished after {i_step+1} steps with Loss {LOSS:.3f}, REWARD {episode_reward:.3f}", end='\r')
16.9.4 保存模型
每隔一定的周期,将训练后得到的模型权重文件和优化器状态文件将保存到指定的目录中。具体实现代码如下所示。
if (i_episode) % 10 == 0:
for i in range(NUM_PRED):
torch.save(predators[i].actor_model.state_dict(), f"./models2/actor_model_{i}.dict")
torch.save(predators[i].critic_model.state_dict(), f"./models2/critic_model{i}.dict")
torch.save(predators[i].actor_optimizer.state_dict(), f"./models2/actor_optimizer_{i}.dict")
torch.save(predators[i].critic_optimizer.state_dict(), f"./models2/critic_optimizer_{i}.dict")
16.9.5 训练结果可视化
使用库matplotlib绘制训练过程中的损失、奖励和训练时长的变化曲线,具体实现代码如下所示。
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(16, 4))
ax[0].plot(pd.Series(episode_durations, dtype=float).rolling(window=MAV_episode_duration_window).mean(), label="Variation")
ax[0].plot(pd.Series(episode_durations, dtype=float).rolling(window=MAV_episode_duration_window * 3).mean(), label="Trend")
ax[0].set_xlabel('Episode')
ax[0].set_ylabel('Duration')
ax[0].set_title('EPISODE DURATIONS')
ax[1].plot(pd.Series(fin_losses, dtype=float).rolling(window=MAV_loss_window).mean(), label="Variation")
ax[1].plot(pd.Series(fin_losses, dtype=float).rolling(window=MAV_loss_window * 3).mean(), label="Trend")
ax[1].set_xlabel('Episode')
ax[1].set_ylabel('Loss')
ax[1].set_title('LOSS')
ax[2].plot(pd.Series(fin_rewards, dtype=float).rolling(window=MAV_rewards_window).mean(), label="Variation")
ax[2].plot(pd.Series(fin_rewards, dtype=float).rolling(window=MAV_rewards_window * 3).mean(), label="Trend")
ax[2].set_xlabel('Episode')
ax[2].set_ylabel('Reward')
ax[2].set_title('AVG REWARD')
fig.subplots_adjust(wspace=0.4)
[plot.legend() for plot in ax.flat]
plt.savefig('figure.png', dpi=400)
plt.show()
执行后会显示多幅训练过程的可视化图,下图16-2仅仅是其中的两幅可视化图。
图16-2 训练过程的可视化图
本《基于强化学习的多智能体Predator-Prey游戏实战》项目已完结:
(16-1)多智能体强化学习实战:Predator-Prey游戏(1)-CSDN博客
(16-2)多智能体强化学习实战:Predator-Prey游戏(2)-CSDN博客
(16-3)多智能体强化学习实战:Predator-Prey游戏(3)-CSDN博客
(16-6)多智能体Predator-Prey游戏实战:训练模型-CSDN博客(16-4)多智能体Predator-Prey游戏实战:实现随机智能体-CSDN博客
(16-5)多智能体Predator-Prey游戏实战:DDPG实现-CSDN博客(16-6)多智能体Predator-Prey游戏实战:训练模型-CSDN博客