【OpenAI】MADDPG算法与Multiagent-Envs环境项目总结

许久之前做的一个项目。

在maddpg和openai给的环境的基础上进行修改,使用多智能体强化学习完成追逃博弈,并与传统方法进行对比。下面记录下修改的地方。

1. Multiagent-Envs环境的改动

  1. 本项目主要用到simple_tag环境
  2. 设定为1个good_agent,1个adversary
  3. agent运动窗口大小为-30到+30
  4. good_agent的size为0.25,adversary的size为0.2
  5. 质量均设为1.0,阻尼为0.5,captured距离为0.5
  6. 不需要用到landmark和交流信息,故将涉及到landmark和交流信息的代码都删除掉
  7. 将追与逃两个agent的动力学模型设置为论文《A Dynamics Perspective of Pursuit-Evasion Capturing and Escaping When The Pursuer Runs Faster Than The Agile Evader》中的动力学模型
  8. 添加论文《A Dynamics Perspective of Pursuit-Evasion Capturing and Escaping When The Pursuer Runs Faster Than The Agile Evader》提到的追逃策略,目的是为了与神经网络训练出来的追逃策略进行一个对比

1.1 core.py文件的改动

除了删除一些未用到的类,主要在agent运动模型上进行了修改

  1. 注释掉不需用到的代码

    # update state of the world
    def step(self): #在environment.py的step函数中调用
        # set actions for scripted agents 
        for agent in self.scripted_agents:
            agent.action = agent.action_callback(agent, self)
        # gather forces applied to entities
        p_force = [None] * len(self.entities)
        # apply agent physical controls
        p_force = self.apply_action_force(p_force)
        # apply environment forces
        # p_force = self.apply_environment_force(p_force)
        self.integrate_state(p_force)
        # print('p_force:', p_force)
    
  2. 力的改变—不需要噪声

    # gather agent action forces
    def apply_action_force(self, p_force):
        # set applied forces
        for i,agent in enumerate(self.agents):
            if agent.movable:
                noise = np.random.randn(*agent.action.u.shape) * agent.u_noise if agent.u_noise else 0.0
                p_force[i] = agent.action.u # + noise # agent的action.u=action.u*accel  environment.py line 190
                # 下面这行是在action.u*accel的基础上再次乘了accel
                # p_force[i] = (agent.mass * agent.accel if agent.accel is not None else agent.mass) * agent.action.u #+ noise
                # print(f"force{i}", p_force[i])
                # print(f"force", p_force)
        return p_force
    
  3. 速度计算公式的改变
    参照论文中的公式将模型修改
    在这里插入图片描述在这里插入图片描述模型是公式的前半部分,后半部分即fd(t)或fe(t)分别指的是追逃策略。模型主要是此行代码的体现:

    entity.state.p_vel += (p_force[i] / entity.mass -entity.state.p_vel * self.damping / entity.mass) * self.dt 
    

    完整函数如下:

    # integrate physical state
    def integrate_state(self, p_force):
        for i,entity in enumerate(self.entities):
            if not entity.movable: continue # 如果是运动的实体则执行循环体,否则continue
            # entity.state.p_vel = entity.state.p_vel * (1 - self.damping)  # 更新agent速度:速度×(1-阻尼)`  
            # print('v',entity.state.p_vel)
            if (p_force[i] is not None):
                entity.state.p_vel += (p_force[i] / entity.mass -entity.state.p_vel * self.damping) * self.dt 
                # entity.state.p_vel += (p_force[i] / entity.mass ) * self.dt # 牛二:v=v0+f/m*t 速度=力/质量×时间
                # print(f'p_force{i}',p_force[i])
                # print('ss',p_force[i] / entity.mass -entity.state.p_vel * self.damping)
            # 实时加速度
            # a_tmp= p_force[i] / entity.mass - entity.state.p_vel * self.damping
            # print('a%d'%i,a_tmp)
            # print('p_force',p_force)
            entity.state.p_pos += entity.state.p_vel * self.dt
    
  4. 新增边界墙类

    # 边界类
    class Border(Entity):
        def __init__(self):
            super(Border, self).__init__()
            self.pos = None
    

1.2 environment.py文件的修改

除了删除一些未用到的类,主要在环境参数上进行了修改

  1. step函数中添加两者距离小于抓捕范围时,done_n返回True的代码(line107行)

    # 当两者距离小于抓捕范围时,done_n==true
    delta_pos = np.abs(self.agents[0].state.p_pos - self.agents[1].state.p_pos)
    dist = np.sqrt(np.sum(np.square(delta_pos)))
    # print('dist',dist)
    capture_distance=self.agents[0].size + self.agents[1].size
    capture_distance=0.5
    if dist< capture_distance:
        done_n.append(True)
    # if any(done_n) == True:
    #     print("done_n:", done_n)
    
  2. 在离散环境情况下,即:

    self.discrete_action_space = True
    

    需要修改_set_action函数中(line167行)agent.action.u的计算规则,让其与论文中的模型一致

    if self.discrete_action_space:
       # agent.action.u[0] += action[0][1] - action[0][2]
       # agent.action.u[1] += action[0][3] - action[0][4]
       # chh 修改  在论文中并不需要累加
       agent.action.u[0] = action[0][1] - action[0][2]
       agent.action.u[1] = action[0][3] - action[0][4]
    
    
  3. 修改显示框大小(line250行)
    render函数中修改:

    self.viewers[i] = rendering.Viewer(800,800) # 修改显示框大小
    

    将显示框大小改为800*800

  4. 修改运动窗口大小(line293行)
    render函数中修改:

     self.viewers[i].set_bounds(-30,+30,-30,+30)
    

    agent运动窗口大小改为-30到+30,即窗口中心点为原点,窗口显示的运动距离为-30到30

1.3 simple_tag.py文件的改动

  1. Scenario类中的make_world函数中修改good_agent、adversary的数量,修改两者的加速度、尺寸大小、初始质量;

    def make_world(self):
         world = World()
         # set any world properties first
         num_good_agents = 1
         num_adversaries = 1
         num_agents = num_adversaries + num_good_agents
    
         # add agents
         world.agents = [Agent() for i in range(num_agents)]
         for i, agent in enumerate(world.agents):#agent的设置
             agent.name = 'agent %d' % i
             agent.collide = True
             agent.adversary = True if i < num_adversaries else False
             agent.size = 0.25 if agent.adversary else 0.2
             # agent.accel = 3 if agent.adversary else 5 #这组加速度agent逃跑后adversary和good agent 会处在某个位置不动
             # 加速度  enviroment.py line 188
             agent.accel = 4 if agent.adversary else 2.4
             # agent.max_speed = 1.0 if agent.adversary else 1.0
             agent.initial_mass = 1.0 if agent.adversary else 1.0 # chh 10.19  质量的差别影响很大,红球总是能追上小球
    
         # make initial conditions
         self.reset_world(world)
         return world
    
  2. 添加边界墙
    Scenario类中的make_world函数中添加:

            # 加入 borders
            world.borders = [Border() for i in range(num_borders)]
            for i, border in enumerate(world.borders):
                border.name = 'border %d' % i
                border.collide = True
                border.movable = False
                border.size = 0.15  # 边界大小
                border.boundary = True
                # 改变边界厚度border.shape
                border.shape = [[-0.05, -0.05], [0.05, -0.05],
                                [0.05, 0.05], [-0.05, 0.05]]
    
  3. Scenario类中的reset_world函数中修改agent颜色、初始位置以及初始速度

    def reset_world(self, world):
        # random properties for agents
        for i, agent in enumerate(world.agents):# agent颜色
            agent.color = np.array([0.35, 0.85, 0.35]) if not agent.adversary else np.array([0.85, 0.35, 0.35])
            # agent.state.p_pos = np.asarray([0.0, 0.0]) if agent.adversary else np.random.uniform(-0.5, +0.5, world.dim_p)   # agent初始位置状态 [x,y]=[0,0]是在视野中心
            agent.state.p_pos = np.asarray([0.0, -4.0]) if agent.adversary else np.asarray([0.0, 0.0])   # agent初始位置状态 [x,y]=[0,0]是在视野中心
            # agent.state.p_pos = np.random.uniform(-10, +10, world.dim_p)
            agent.state.p_vel = np.zeros(world.dim_p)  # agent初始速度
    
  4. 在判断碰撞的函数中(line58行)将判断碰撞的距离 改为0.5

    def is_collision(self, agent1, agent2):
      	delta_pos = agent1.state.p_pos - agent2.state.p_pos
       	dist = np.sqrt(np.sum(np.square(delta_pos)))
       	dist_min = agent1.size + agent2.size
       # return True if dist < dist_min else False
       	return True if dist < 0.5 else False
    
  5. observation函数中修改传给obs_n的值(line139)

    # 返回一个1*8的数据
    def observation(self, agent, world):
    
        other_pos = []
        other_vel = []
        for other in world.agents:
            if other is agent: continue
    
            other_pos.append(other.state.p_pos - agent.state.p_pos) #相对位置
            # other_pos.append(other.state.p_pos ) #chh  修改 默认策略 # 绝对位置
            other_vel.append(other.state.p_vel) # 绝对速度
            # other_vel.append(other.state.p_vel - agent.state.p_vel) #相对速度
     
            # if not other.adversary:# if  other.adversary ==False:
                # other_vel.append(other.state.p_vel)
        #         print('agent.state.p_vel:',agent.state.p_vel)  # 1*2
        #         print('other_vel:',other_vel) # 1*2
        #         print('*'*30)
        #     print('other_vel2 :',other_vel ) # 1*2
        # print('agent.state.p_vel:',agent.state.p_vel)  # 1*2
        # print("other_pos:",other_pos)  # 1*2
        # print("entity_pos:",entity_pos) # 空
        # print("agent.state.p_pos:",agent.state.p_pos) # 1*2
        
        return np.concatenate([agent.state.p_pos] + other_pos + [agent.state.p_vel] + other_vel)
    

2.maddpg相关代码的改动

maddpg算法部分变动不大,主要是添加了保存数据成mat文件的功能以及论文中追逃策略的实现(目的是为了与神经网络进行对比)

2.1 神经网络部分

mlp_model函数是神经网络的搭建,在离散环境下用的是三层全连接层,在连续环境下用三层全连接层训练不出

2.2 在train函数中添加论文中给的追逃策略,通过覆盖掉神经网络的追逃策略来实现

主要是在函数中的while True循环中添加,在

action_n = [agent.action(obs) for agent, obs in zip(trainers, obs_n)]

的后面添加(line171):

# 追捕者adversary
delta_pos_a = [obs_n[0][2]-obs_n[0][0], obs_n[0][3]-obs_n[0][1]]
distance_a = np.sqrt(np.sum(np.square(delta_pos_a)))
d_t = delta_pos_a / distance_a # the unitary relative-positional vector
## ----------------
# 离散情况下
# action_n[0][1] = d_t[0]
# action_n[0][3] = d_t[1]
# action_n[0][0], action_n[0][2], action_n[0][4] = 0, 0, 0
## -----------------
# 连续情况下
action_n[0][0] = d_t[0]
action_n[0][1] = d_t[1]
## ----------------

# # # print('delta_pos_a',distance_a)
# # print('d_t',d_t)
# # print('adv',obs_n[0][0:2])
# # print('eva',obs_n[0][2:4])
#
# # # # #逃避者evader
delta_pos_b = [obs_n[1][0]-obs_n[1][2], obs_n[1][1]-obs_n[1][3]]
distance_b = np.sqrt(np.sum(np.square(delta_pos_b)))
d_t = np.array(delta_pos_b / distance_b)
if distance_b > 2.4:# 两者的距离大于既定距离时
   fe_t = d_t
else:
   R_l = np.array([[0, -1], [1, 0]])
   R_r = np.array([[0, 1], [-1, 0]])
   x_ = np.array([obs_n[1][4], obs_n[1][5]]) # 速度
    # print(x_)
   neiji = np.dot(d_t, x_) # 内积
    # print('d_t',d_t)
    # print('x_',x_)
    # print('neiji',neiji)
   mo_x_ = np.sqrt(np.sum(np.square(x_)))  # 模
   mo_d_t = np.sqrt(np.sum(np.square(d_t)))  # d(t)的模
    # print(mo_x_)
   cos = neiji / (mo_x_*mo_d_t)
    # print('cos,', cos) #cos范围是-1到1
   x__ = np.append(x_, np.array([0])) #将速度拼接成1*3维向量  方便计算叉积 判断速度与d(t)的相对方向问题
   d_t_ = np.append(d_t, np.array([0])) #将d(t)拼接成1*3维向量
    # print('d_t_',d_t_)
   chaji1 = np.cross(x__, d_t_)
   chaji2 = np.cross(d_t_, x__)
    # print('chaji1',chaji1)
    # print('chaji2',chaji2)
   if chaji1.any() > 0 or mo_x_ == 0 or cos == 1:
       fe_t = R_l * d_t
       fe_t = np.array([fe_t[0][1], fe_t[1][0]])
        # print('dt',d_t)
        # print('fet',fe_t)
   else:
       fe_t = R_r * d_t
       fe_t = np.array([fe_t[0][1], fe_t[1][0]])
## ===================
# 离散情况下
# action_n[1][1] = fe_t[0]
# action_n[1][3] = fe_t[1]
# action_n[1][0], action_n[1][2], action_n[1][4] = 0, 0, 0
# 连续情况下
action_n[1][0] = fe_t[0]
action_n[1][1] = fe_t[1]

2.3 保存数据:实时位置、实时速度、action、step步数、reward

step = [i for i in range(arglist.max_episode_len+1)]  #161行
step_episode.append(rew_n[0]) #保存adversary的reward  258行
action_save.append(action_n) #储存action  282行

注:保存的位置均为绝对位置
simple_tag.py文件的observation函数中如果返回的是相对速度相对位置,则保存数据的时候要进行转换(line260行):

# 取相对位置相对速度时-------------------------
# import copy
# obs_n_ = copy.copy(obs_n)
# p_x = obs_n_[0][0]+obs_n_[0][2]
# p_y = obs_n_[0][1]+obs_n_[0][3]
# obs_n_[0][2] = p_x
# obs_n_[0][3] = p_y
# position_.append(obs_n_[0][0:4])

# v_x = obs_n_[0][4]+obs_n_[0][6]
# v_y = obs_n_[0][5]+obs_n_[0][7]
# obs_n_[0][6] = v_x
# obs_n_[0][7] = v_y
# volocity.append(obs_n_[0][4:8])
#----------------------------

取绝对位置绝对速度时,直接保存即可

 # 取绝对位置绝对速度时
position_.append(obs_n[0][0:4])
volocity.append(obs_n[0][4:8])

if done or terminal:(line288行)if条件中添加保存数据的代码:

if done or terminal:
    if done:
        print('*'*20)
        print('done:',episode_step)
    # #
    # sio.savemat(file_folder_name + '/network_vs_network-a15_13.7.mat',{'step': step, 'position': position_, 'volocity': volocity, 'action_save': action_save})
    print('save !!!')
    # break #保存完之后退出
    episode_reward.append(step_episode) #将400个step保存进列表中

if len(episode_rewards) > arglist.num_episodes:(line386行)条件语句中添加保存所有rewards的代码:

sio.savemat(file_folder_name + '/rewards.mat', {'episode_reward': episode_reward})

由于项目已经做了很久,今天整理的时候发现整个代码不知道被我放哪了,只剩下之前当《机器学习》课程助教时基于openai particle环境修改的环境代码,就放上这个github仓库链接吧。

  • 10
    点赞
  • 37
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 20
    评论
以下是适用于离散动作空间的MADDPG代码的框架: ```python import torch import torch.nn as nn import torch.nn.functional as F import numpy as np import random from collections import deque from itertools import count # 定义神经网络 class Actor(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim): super(Actor, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, action_dim) def forward(self, state): x = F.relu(self.fc1(state)) x = F.relu(self.fc2(x)) x = F.softmax(self.fc3(x), dim=-1) return x class Critic(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim): super(Critic, self).__init__() self.fc1 = nn.Linear(state_dim + action_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, 1) def forward(self, state, action): x = torch.cat([state, action], dim=-1) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x # 定义MADDPG算法 class MADDPG: def __init__(self, state_dim, action_dim, hidden_dim, gamma, tau, lr, device): self.actor_local = Actor(state_dim, action_dim, hidden_dim).to(device) self.actor_target = Actor(state_dim, action_dim, hidden_dim).to(device) self.actor_optimizer = torch.optim.Adam(self.actor_local.parameters(), lr=lr) self.critic_local = Critic(state_dim, action_dim, hidden_dim).to(device) self.critic_target = Critic(state_dim, action_dim, hidden_dim).to(device) self.critic_optimizer = torch.optim.Adam(self.critic_local.parameters(), lr=lr) self.gamma = gamma self.tau = tau self.device = device def act(self, state): state = torch.FloatTensor(state).to(self.device) self.actor_local.eval() with torch.no_grad(): action_probs = self.actor_local(state) self.actor_local.train() actions = [np.random.choice(np.arange(len(prob)), p=prob.detach().cpu().numpy()) for prob in action_probs] return actions def update(self, experiences): states, actions, rewards, next_states, dones = experiences states = torch.FloatTensor(states).to(self.device) actions = torch.LongTensor(actions).unsqueeze(-1).to(self.device) rewards = torch.FloatTensor(rewards).unsqueeze(-1).to(self.device) next_states = torch.FloatTensor(next_states).to(self.device) dones = torch.FloatTensor(dones).unsqueeze(-1).to(self.device) # 更新critic网络 Q_targets_next = self.critic_target(next_states, self.actor_target(next_states)) Q_targets = rewards + (self.gamma * Q_targets_next * (1 - dones)) Q_expected = self.critic_local(states, actions) critic_loss = F.mse_loss(Q_expected, Q_targets) self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() # 更新actor网络 actions_pred = self.actor_local(states) actor_loss = -self.critic_local(states, actions_pred).mean() self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # 更新target网络 self.soft_update(self.critic_local, self.critic_target) self.soft_update(self.actor_local, self.actor_target) def soft_update(self, local_model, target_model): for target_param, local_param in zip(target_model.parameters(), local_model.parameters()): target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data) ``` 在训练过程中,需要定义一个 replay buffer,用于存储经验,在每个时间步从 buffer 中随机采样一批经验进行训练,具体实现可以参考以下代码: ```python from collections import deque replay_buffer = deque(maxlen=10000) for i_episode in range(1000): state = env.reset() for t in count(): # 在环境中执行动作 action = maddpg.act(state) next_state, reward, done, _ = env.step(action) # 存储经验到 replay buffer 中 replay_buffer.append((state, action, reward, next_state, done)) state = next_state # 从 replay buffer 中随机采样一批经验进行训练 if len(replay_buffer) >= batch_size: experiences = [replay_buffer.popleft() for _ in range(batch_size)] maddpg.update(experiences) if done: break ``` 其中,`env` 表示环境对象,`batch_size` 表示每次训练时从 replay buffer 中采样的经验数。需要注意的是,在离散动作空间下,需要使用 softmax 函数将 actor 输出的动作概率规范化。
评论 20
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

CHH3213

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值