(16-6)多智能体Predator-Prey游戏实战:训练模型

16.9  训练模型

接下来开始实现强化学习代理(Agent)的训练过程,该代理使用了深度确定性策略梯度(DDPG)算法来解决一个多智能体协作的问题。在本项目中需要训练多个智能体,其中捕食者智能体使用 DDPG 算法进行训练,而猎物智能体则随机选择动作。训练的目标是使捕食者智能体学会在环境中捕获猎物,并获得最大的累积奖励。训练过程中监测并记录了不同指标的变化,以便进一步分析和改进训练算法。

16.9.1  环境初始化

通过创建 PredatorPreyEnv 环境来模拟智能体的训练场景,其中包括多个捕食者(Predator)和猎物(Prey),以及其他参数如视野、历史记录长度等。具体实现代码如下所示。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
plt.ion()

TAU = 0.001
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 500
ACTOR_LR = 1e-4
CRITIC_LR = 1e-2
GAMMA = 0.99
NUM_EPISODES = 1000
NUM_PRED = 5
NUM_PREY = 1
BATCH_SIZE = 8
VISION = 7 # do not change
SIZE = 13
LOAD = False
HISTORY = 4 #do not change
COMMUNICATION_BIT = 10
env = PredatorPreyEnv(  # Use PredatorPreyEnv2 for immortal prey
    render_mode='none',
    size=SIZE,
    predator=NUM_PRED,
    prey=NUM_PREY,
    episode_length=100,
    img_mode=True,
    vision=VISION,
    history_length=HISTORY,
    communication_bits=COMMUNICATION_BIT,
    success_reward=10,
    living_reward=-0.5,
    error_reward=0,
    cooperate=1
)
state, _ = env.reset()
n_actions = env.single_action_space.n
n_observations = 4 # considering single agent observation space
episode_durations = []
steps_done = 0

losses = []
steps = []
rewards = []

16.9.2  创建智能体

分别创建捕食者智能体和猎物智能体,使用类DDPGAgent创建了多个捕食者智能体(predators)和多个随机猎物智能体(preys)。具体实现代码如下所示。

predators = [DDPGAgent((HISTORY*4,VISION,VISION), 
                       n_actions, 
                       num_agents=NUM_PRED, 
                       idd=i,
                       gamma=GAMMA,
                       tau=TAU,
                       actor_lr=ACTOR_LR,
                       critic_lr=CRITIC_LR,
                       communication_bits=COMMUNICATION_BIT,
                       batch_size=BATCH_SIZE,
                       buffer_size=int(1e6)
                       ) for i in range(NUM_PRED)]

preys = [RandomAgent(VISION*VISION*3, 
                     n_actions,
                     fix_pos=False) for i in range(NUM_PREY)]

在上述代码中,RandomAgent 是一个随机智能体,它在每个时间步骤随机选择动作而不依赖于环境状态或学习算法。在上述代码中,RandomAgent 用作猎物(Prey)的智能体,其目的是模拟没有学习能力的实体,它会随机采取行动。

16.9.3  训练循环

通过循环迭代多个训练周期(NUM_EPISODES),每个周期通过以下步骤进行训练:

  1. 选择动作:捕食者智能体根据当前状态选择动作,猎物智能体随机选择动作。
  2. 执行动作:将捕食者和猎物的动作应用于环境,并获得下一个状态、奖励和是否完成的信息。
  3. 存储经验:将每个智能体的经验存储在经验回放缓冲区中,以供后续训练使用。
  4. 更新模型:对捕食者智能体进行模型更新,包括演员模型和评论家模型的更新。
  5. 逐步更新:根据当前训练步数逐步降低探索率(epsilon)。
  6. 监测和记录:监测并记录损失、奖励和训练时长等信息,以便后续分析和可视化。

具体实现代码如下所示。

fin_losses = []
episode_durations = []
fin_rewards = []
MAV_loss_window = 5
MAV_episode_duration_window = 5
MAV_rewards_window = 5
epsilon_predator = EPS_START
epsilon_prey = EPS_START
#load model
if LOAD:
    for i in range(NUM_PRED):
        predators[i].load_model(f"./models2/actor_model_{i}.dict", f"./models2/critic_model{i}.dict")
        predators[i].load_optim(f"./models2/actor_optimizer_{i}.dict", f"./models2/critic_optimizer_{i}.dict")
      
for i_episode in range(NUM_EPISODES):
    # Initialize the environment and get it's state
    state, info = env.reset()
    # print(state)
    losses = []
    episode_reward = 0
    for i_step in count():
        # Select and perform an action
        predator_actions = []
        prey_actions = []
        pred_communication = [0 for _ in range(NUM_PRED)]
        prey_communication = [0 for _ in range(NUM_PREY)]
        if state is None:
            pred_actions = [0 for _ in range(NUM_PRED)]
            prey_actions = [0 for _ in range(NUM_PREY)]
            next_state, reward, done, info = env.step(pred_actions, 
                                                      prey_actions, 
                                                      pred_communication=pred_communication, 
                                                      prey_communication=prey_communication)
            steps_done+=1
            # Move to the next state
            state = next_state
            # if done or i == env.episode_length-1:
            if done:
                episode_durations.append(i_step + 1)
                break
            continue

        #CHOOSE ACTION
        for i in range(NUM_PRED):
            state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in state]
            state_i = torch.cat(state_i, dim=0).unsqueeze(0)  # 12, 5, 5
            # print(state_i.shape, "AFTER")
            # state_i = torch.tensor(state_i, dtype=torch.float32, device=device).unsqueeze(0)
            action_i, action_choice, msg_i = predators[i].select_action(state_i, epsilon_predator)
            predator_actions.append(action_choice)
            pred_communication[i] = msg_i
            # print(state_i[:4, :, :], "STATE")
            # exit()
            # print(msg_i, "RECIEVED_MSG")
        for i in range(NUM_PREY):
            state_i = [torch.tensor(s["prey"][i], dtype=torch.float32, device=device) for s in state]
            action_i, _= preys[i].select_action(state_i, epsilon_prey)
            prey_actions.append(action_i)
        
        # print(pred_communication, "pred_communication")
        #TAKE ACTION IN ENVIRONMENT
        next_state, reward, done, info = env.step(predator_actions, 
                                                  prey_actions, 
                                                  pred_communication=pred_communication)
        episode_reward += sum(reward["predator"])/len(reward["predator"])*GAMMA**i_step
        # Store the transition in memory
        total_state = []
        total_next_state = []
        reward_total = torch.tensor(reward["predator"], dtype=torch.float32, device=device)
        action_total = torch.tensor(predator_actions, dtype=torch.float32, device=device)
        pred_msg_total = torch.tensor(pred_communication, dtype=torch.float32, device=device)
        done_total = torch.tensor(done, dtype=torch.float32, device=device)

        for i in range(NUM_PRED):
            state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in state]
            state_i = torch.cat(state_i, dim=0)
            
            next_state_i = [torch.tensor(s["predator"][i], dtype=torch.float32, device=device) for s in next_state]
            next_state_i = torch.cat(next_state_i, dim=0)
            total_state.append(state_i)
            total_next_state.append(next_state_i)
        total_state = torch.cat(total_state, dim=0)
        total_next_state = torch.cat(total_next_state, dim=0)

        for i in range(NUM_PRED):
            predators[i].replay_buffer.push(total_state, action_total, pred_msg_total, reward_total, total_next_state, done_total)
        
        

        # update the Actor-Critic Network
        for i in range(NUM_PRED):
            loss = predators[i].update_model()
            if loss is not None:
                losses.append(loss)
            predators[i].update_target_model()
        for i in range(NUM_PREY):
            preys[i].update_model()
            preys[i].update_target_model()
        
        #decay epsilon
        epsilon_predator = EPS_END + (EPS_START - EPS_END) * \
            np.exp(-1. * steps_done / EPS_DECAY)
        epsilon_prey = EPS_END + (EPS_START - EPS_END) * \
            np.exp(-1. * steps_done / EPS_DECAY)
        steps_done+=1

        state = next_state
        if done:
            episode_durations.append(i_step + 1)
            break
    LOSS = sum(losses)/(len(losses)+1)
    if i_episode>3:
        fin_losses.append(LOSS)
        fin_rewards.append(episode_reward)
    print(f"Episode {i_episode} finished after {i_step+1} steps with Loss {LOSS:.3f}, REWARD {episode_reward:.3f}", end='\r')

16.9.4  保存模型

每隔一定的周期,将训练后得到的模型权重文件和优化器状态文件将保存到指定的目录中。具体实现代码如下所示。

    if (i_episode) % 10 == 0:
        for i in range(NUM_PRED):
            torch.save(predators[i].actor_model.state_dict(), f"./models2/actor_model_{i}.dict")
            torch.save(predators[i].critic_model.state_dict(), f"./models2/critic_model{i}.dict")
            torch.save(predators[i].actor_optimizer.state_dict(), f"./models2/actor_optimizer_{i}.dict")
            torch.save(predators[i].critic_optimizer.state_dict(), f"./models2/critic_optimizer_{i}.dict")

16.9.5  训练结果可视化

使用库matplotlib绘制训练过程中的损失、奖励和训练时长的变化曲线,具体实现代码如下所示。

        fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(16, 4))
        ax[0].plot(pd.Series(episode_durations, dtype=float).rolling(window=MAV_episode_duration_window).mean(), label="Variation")
        ax[0].plot(pd.Series(episode_durations, dtype=float).rolling(window=MAV_episode_duration_window * 3).mean(), label="Trend")
        ax[0].set_xlabel('Episode')
        ax[0].set_ylabel('Duration')
        ax[0].set_title('EPISODE DURATIONS')
        ax[1].plot(pd.Series(fin_losses, dtype=float).rolling(window=MAV_loss_window).mean(), label="Variation")
        ax[1].plot(pd.Series(fin_losses, dtype=float).rolling(window=MAV_loss_window * 3).mean(), label="Trend")
        ax[1].set_xlabel('Episode')
        ax[1].set_ylabel('Loss')
        ax[1].set_title('LOSS')
        ax[2].plot(pd.Series(fin_rewards, dtype=float).rolling(window=MAV_rewards_window).mean(), label="Variation")
        ax[2].plot(pd.Series(fin_rewards, dtype=float).rolling(window=MAV_rewards_window * 3).mean(), label="Trend")
        ax[2].set_xlabel('Episode')
        ax[2].set_ylabel('Reward')
        ax[2].set_title('AVG REWARD')
        fig.subplots_adjust(wspace=0.4)
        [plot.legend() for plot in ax.flat]
        plt.savefig('figure.png', dpi=400)
        plt.show()

执行后会显示多幅训练过程的可视化图,下图16-2仅仅是其中的两幅可视化图。

图16-2  训练过程的可视化图

本《基于强化学习的多智能体Predator-Prey游戏实战》项目已完结:

(16-1)多智能体强化学习实战:Predator-Prey游戏(1)-CSDN博客

(16-2)多智能体强化学习实战:Predator-Prey游戏(2)-CSDN博客

(16-3)多智能体强化学习实战:Predator-Prey游戏(3)-CSDN博客

(16-6)多智能体Predator-Prey游戏实战:训练模型-CSDN博客(16-4)多智能体Predator-Prey游戏实战:实现随机智能体-CSDN博客

(16-5)多智能体Predator-Prey游戏实战:DDPG实现-CSDN博客(16-6)多智能体Predator-Prey游戏实战:训练模型-CSDN博客

  • 12
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值