【OpenAI】MADDPG算法与Multiagent-Envs环境项目总结

CHH3213

已于 2023-02-24 21:25:04 修改

阅读量4k

点赞数 10

分类专栏：学习强化学习文章标签：算法 python 机器学习

于 2022-10-31 19:41:13 首次发布

本文链接：https://blog.csdn.net/weixin_42301220/article/details/127621908

版权

学习强化学习专栏收录该内容

22 篇文章 51 订阅

订阅专栏

文章目录

1. Multiagent-Envs环境的改动
2.maddpg相关代码的改动

许久之前做的一个项目。

在maddpg和openai给的环境的基础上进行修改，使用多智能体强化学习完成追逃博弈，并与传统方法进行对比。下面记录下修改的地方。

1. Multiagent-Envs环境的改动

本项目主要用到simple_tag环境
设定为1个good_agent,1个adversary
agent运动窗口大小为-30到+30
good_agent的size为0.25，adversary的size为0.2
质量均设为1.0，阻尼为0.5，captured距离为0.5
不需要用到landmark和交流信息，故将涉及到landmark和交流信息的代码都删除掉
将追与逃两个agent的动力学模型设置为论文《A Dynamics Perspective of Pursuit-Evasion Capturing and Escaping When The Pursuer Runs Faster Than The Agile Evader》中的动力学模型
添加论文《A Dynamics Perspective of Pursuit-Evasion Capturing and Escaping When The Pursuer Runs Faster Than The Agile Evader》提到的追逃策略，目的是为了与神经网络训练出来的追逃策略进行一个对比

1.1 core.py文件的改动

除了删除一些未用到的类，主要在agent运动模型上进行了修改

注释掉不需用到的代码

# update state of the world
def step(self): #在environment.py的step函数中调用
    # set actions for scripted agents 
    for agent in self.scripted_agents:
        agent.action = agent.action_callback(agent, self)
    # gather forces applied to entities
    p_force = [None] * len(self.entities)
    # apply agent physical controls
    p_force = self.apply_action_force(p_force)
    # apply environment forces
    # p_force = self.apply_environment_force(p_force)
    self.integrate_state(p_force)
    # print('p_force:', p_force)

力的改变—不需要噪声

# gather agent action forces
def apply_action_force(self, p_force):
    # set applied forces
    for i,agent in enumerate(self.agents):
        if agent.movable:
            noise = np.random.randn(*agent.action.u.shape) * agent.u_noise if agent.u_noise else 0.0
            p_force[i] = agent.action.u # + noise # agent的action.u=action.u*accel  environment.py line 190
            # 下面这行是在action.u*accel的基础上再次乘了accel
            # p_force[i] = (agent.mass * agent.accel if agent.accel is not None else agent.mass) * agent.action.u #+ noise
            # print(f"force{i}", p_force[i])
            # print(f"force", p_force)
    return p_force

速度计算公式的改变
参照论文中的公式将模型修改
在这里插入图片描述模型是公式的前半部分，后半部分即f_d(t)或f_e(t)分别指的是追逃策略。模型主要是此行代码的体现：

entity.state.p_vel += (p_force[i] / entity.mass -entity.state.p_vel * self.damping / entity.mass) * self.dt

完整函数如下：

# integrate physical state
def integrate_state(self, p_force):
    for i,entity in enumerate(self.entities):
        if not entity.movable: continue # 如果是运动的实体则执行循环体，否则continue
        # entity.state.p_vel = entity.state.p_vel * (1 - self.damping)  # 更新agent速度：速度×（1-阻尼）`  
        # print('v',entity.state.p_vel)
        if (p_force[i] is not None):
            entity.state.p_vel += (p_force[i] / entity.mass -entity.state.p_vel * self.damping) * self.dt 
            # entity.state.p_vel += (p_force[i] / entity.mass ) * self.dt # 牛二：v=v0+f/m*t 速度=力/质量×时间
            # print(f'p_force{i}',p_force[i])
            # print('ss',p_force[i] / entity.mass -entity.state.p_vel * self.damping)
        # 实时加速度
        # a_tmp= p_force[i] / entity.mass - entity.state.p_vel * self.damping
        # print('a%d'%i,a_tmp)
        # print('p_force',p_force)
        entity.state.p_pos += entity.state.p_vel * self.dt

新增边界墙类

# 边界类
class Border(Entity):
    def __init__(self):
        super(Border, self).__init__()
        self.pos = None

1.2 environment.py文件的修改

除了删除一些未用到的类，主要在环境参数上进行了修改

在step函数中添加两者距离小于抓捕范围时,done_n返回True的代码（line107行）

# 当两者距离小于抓捕范围时，done_n==true
delta_pos = np.abs(self.agents[0].state.p_pos - self.agents[1].state.p_pos)
dist = np.sqrt(np.sum(np.square(delta_pos)))
# print('dist',dist)
capture_distance=self.agents[0].size + self.agents[1].size
capture_distance=0.5
if dist< capture_distance:
    done_n.append(True)
# if any(done_n) == True:
#     print("done_n:", done_n)

在离散环境情况下，即：

self.discrete_action_space = True

需要修改_set_action函数中(line167行）agent.action.u的计算规则，让其与论文中的模型一致

if self.discrete_action_space:
   # agent.action.u[0] += action[0][1] - action[0][2]
   # agent.action.u[1] += action[0][3] - action[0][4]
   # chh 修改  在论文中并不需要累加
   agent.action.u[0] = action[0][1] - action[0][2]
   agent.action.u[1] = action[0][3] - action[0][4]

修改显示框大小（line250行）
在render函数中修改：
```
self.viewers[i] = rendering.Viewer(800,800) # 修改显示框大小
```
将显示框大小改为800*800
修改运动窗口大小（line293行）
在render函数中修改：
```
 self.viewers[i].set_bounds(-30,+30,-30,+30)
```
agent运动窗口大小改为-30到+30，即窗口中心点为原点，窗口显示的运动距离为-30到30

1.3 simple_tag.py文件的改动

在Scenario类中的make_world函数中修改good_agent、adversary的数量，修改两者的加速度、尺寸大小、初始质量；

def make_world(self):
     world = World()
     # set any world properties first
     num_good_agents = 1
     num_adversaries = 1
     num_agents = num_adversaries + num_good_agents

     # add agents
     world.agents = [Agent() for i in range(num_agents)]
     for i, agent in enumerate(world.agents):#agent的设置
         agent.name = 'agent %d' % i
         agent.collide = True
         agent.adversary = True if i < num_adversaries else False
         agent.size = 0.25 if agent.adversary else 0.2
         # agent.accel = 3 if agent.adversary else 5 #这组加速度agent逃跑后adversary和good agent 会处在某个位置不动
         # 加速度  enviroment.py line 188
         agent.accel = 4 if agent.adversary else 2.4
         # agent.max_speed = 1.0 if agent.adversary else 1.0
         agent.initial_mass = 1.0 if agent.adversary else 1.0 # chh 10.19  质量的差别影响很大，红球总是能追上小球

     # make initial conditions
     self.reset_world(world)
     return world

添加边界墙
在Scenario类中的make_world函数中添加：

        # 加入 borders
        world.borders = [Border() for i in range(num_borders)]
        for i, border in enumerate(world.borders):
            border.name = 'border %d' % i
            border.collide = True
            border.movable = False
            border.size = 0.15  # 边界大小
            border.boundary = True
            # 改变边界厚度border.shape
            border.shape = [[-0.05, -0.05], [0.05, -0.05],
                            [0.05, 0.05], [-0.05, 0.05]]

在Scenario类中的reset_world函数中修改agent颜色、初始位置以及初始速度

def reset_world(self, world):
    # random properties for agents
    for i, agent in enumerate(world.agents):# agent颜色
        agent.color = np.array([0.35, 0.85, 0.35]) if not agent.adversary else np.array([0.85, 0.35, 0.35])
        # agent.state.p_pos = np.asarray([0.0, 0.0]) if agent.adversary else np.random.uniform(-0.5, +0.5, world.dim_p)   # agent初始位置状态 [x,y]=[0,0]是在视野中心
        agent.state.p_pos = np.asarray([0.0, -4.0]) if agent.adversary else np.asarray([0.0, 0.0])   # agent初始位置状态 [x,y]=[0,0]是在视野中心
        # agent.state.p_pos = np.random.uniform(-10, +10, world.dim_p)
        agent.state.p_vel = np.zeros(world.dim_p)  # agent初始速度

在判断碰撞的函数中（line58行）将判断碰撞的距离改为0.5

def is_collision(self, agent1, agent2):
  	delta_pos = agent1.state.p_pos - agent2.state.p_pos
   	dist = np.sqrt(np.sum(np.square(delta_pos)))
   	dist_min = agent1.size + agent2.size
   # return True if dist < dist_min else False
   	return True if dist < 0.5 else False

observation函数中修改传给obs_n的值（line139）

# 返回一个1*8的数据
def observation(self, agent, world):

    other_pos = []
    other_vel = []
    for other in world.agents:
        if other is agent: continue

        other_pos.append(other.state.p_pos - agent.state.p_pos) #相对位置
        # other_pos.append(other.state.p_pos ) #chh  修改 默认策略 # 绝对位置
        other_vel.append(other.state.p_vel) # 绝对速度
        # other_vel.append(other.state.p_vel - agent.state.p_vel) #相对速度
 
        # if not other.adversary:# if  other.adversary ==False:
            # other_vel.append(other.state.p_vel)
    #         print('agent.state.p_vel:',agent.state.p_vel)  # 1*2
    #         print('other_vel:',other_vel) # 1*2
    #         print('*'*30)
    #     print('other_vel2 :',other_vel ) # 1*2
    # print('agent.state.p_vel:',agent.state.p_vel)  # 1*2
    # print("other_pos:",other_pos)  # 1*2
    # print("entity_pos:",entity_pos) # 空
    # print("agent.state.p_pos:",agent.state.p_pos) # 1*2
    
    return np.concatenate([agent.state.p_pos] + other_pos + [agent.state.p_vel] + other_vel)

2.maddpg相关代码的改动

maddpg算法部分变动不大，主要是添加了保存数据成mat文件的功能以及论文中追逃策略的实现（目的是为了与神经网络进行对比）

2.1 神经网络部分

mlp_model函数是神经网络的搭建，在离散环境下用的是三层全连接层，在连续环境下用三层全连接层训练不出

2.2 在train函数中添加论文中给的追逃策略，通过覆盖掉神经网络的追逃策略来实现

主要是在函数中的while True循环中添加，在

action_n = [agent.action(obs) for agent, obs in zip(trainers, obs_n)]

的后面添加（line171）：

# 追捕者adversary
delta_pos_a = [obs_n[0][2]-obs_n[0][0], obs_n[0][3]-obs_n[0][1]]
distance_a = np.sqrt(np.sum(np.square(delta_pos_a)))
d_t = delta_pos_a / distance_a # the unitary relative-positional vector
## ----------------
# 离散情况下
# action_n[0][1] = d_t[0]
# action_n[0][3] = d_t[1]
# action_n[0][0], action_n[0][2], action_n[0][4] = 0, 0, 0
## -----------------
# 连续情况下
action_n[0][0] = d_t[0]
action_n[0][1] = d_t[1]
## ----------------

# # # print('delta_pos_a',distance_a)
# # print('d_t',d_t)
# # print('adv',obs_n[0][0:2])
# # print('eva',obs_n[0][2:4])
#
# # # # #逃避者evader
delta_pos_b = [obs_n[1][0]-obs_n[1][2], obs_n[1][1]-obs_n[1][3]]
distance_b = np.sqrt(np.sum(np.square(delta_pos_b)))
d_t = np.array(delta_pos_b / distance_b)
if distance_b > 2.4:# 两者的距离大于既定距离时
   fe_t = d_t
else:
   R_l = np.array([[0, -1], [1, 0]])
   R_r = np.array([[0, 1], [-1, 0]])
   x_ = np.array([obs_n[1][4], obs_n[1][5]]) # 速度
    # print(x_)
   neiji = np.dot(d_t, x_) # 内积
    # print('d_t',d_t)
    # print('x_',x_)
    # print('neiji',neiji)
   mo_x_ = np.sqrt(np.sum(np.square(x_)))  # 模
   mo_d_t = np.sqrt(np.sum(np.square(d_t)))  # d(t)的模
    # print(mo_x_)
   cos = neiji / (mo_x_*mo_d_t)
    # print('cos,', cos) #cos范围是-1到1
   x__ = np.append(x_, np.array([0])) #将速度拼接成1*3维向量  方便计算叉积 判断速度与d(t)的相对方向问题
   d_t_ = np.append(d_t, np.array([0])) #将d(t)拼接成1*3维向量
    # print('d_t_',d_t_)
   chaji1 = np.cross(x__, d_t_)
   chaji2 = np.cross(d_t_, x__)
    # print('chaji1',chaji1)
    # print('chaji2',chaji2)
   if chaji1.any() > 0 or mo_x_ == 0 or cos == 1:
       fe_t = R_l * d_t
       fe_t = np.array([fe_t[0][1], fe_t[1][0]])
        # print('dt',d_t)
        # print('fet',fe_t)
   else:
       fe_t = R_r * d_t
       fe_t = np.array([fe_t[0][1], fe_t[1][0]])
## ===================
# 离散情况下
# action_n[1][1] = fe_t[0]
# action_n[1][3] = fe_t[1]
# action_n[1][0], action_n[1][2], action_n[1][4] = 0, 0, 0
# 连续情况下
action_n[1][0] = fe_t[0]
action_n[1][1] = fe_t[1]

2.3 保存数据：实时位置、实时速度、action、step步数、reward

step = [i for i in range(arglist.max_episode_len+1)]  #161行

step_episode.append(rew_n[0]) #保存adversary的reward  258行

action_save.append(action_n) #储存action  282行

注：保存的位置均为绝对位置
在simple_tag.py文件的observation函数中如果返回的是相对速度相对位置，则保存数据的时候要进行转换（line260行）：

# 取相对位置相对速度时-------------------------
# import copy
# obs_n_ = copy.copy(obs_n)
# p_x = obs_n_[0][0]+obs_n_[0][2]
# p_y = obs_n_[0][1]+obs_n_[0][3]
# obs_n_[0][2] = p_x
# obs_n_[0][3] = p_y
# position_.append(obs_n_[0][0:4])

# v_x = obs_n_[0][4]+obs_n_[0][6]
# v_y = obs_n_[0][5]+obs_n_[0][7]
# obs_n_[0][6] = v_x
# obs_n_[0][7] = v_y
# volocity.append(obs_n_[0][4:8])
#----------------------------

取绝对位置绝对速度时，直接保存即可

 # 取绝对位置绝对速度时
position_.append(obs_n[0][0:4])
volocity.append(obs_n[0][4:8])

在if done or terminal:（line288行）if条件中添加保存数据的代码：

if done or terminal:
    if done:
        print('*'*20)
        print('done:',episode_step)
    # #
    # sio.savemat(file_folder_name + '/network_vs_network-a15_13.7.mat',{'step': step, 'position': position_, 'volocity': volocity, 'action_save': action_save})
    print('save !!!')
    # break #保存完之后退出
    episode_reward.append(step_episode) #将400个step保存进列表中

在if len(episode_rewards) > arglist.num_episodes:（line386行）条件语句中添加保存所有rewards的代码：

sio.savemat(file_folder_name + '/rewards.mat', {'episode_reward': episode_reward})

由于项目已经做了很久，今天整理的时候发现整个代码不知道被我放哪了，只剩下之前当《机器学习》课程助教时基于openai particle环境修改的环境代码，就放上这个github仓库链接吧。

CHH3213

关注

10
点赞
踩
37

收藏

觉得还不错? 一键收藏
打赏
20
评论
【OpenAI】MADDPG算法与Multiagent-Envs环境项目总结

许久之前做的一个项目。在maddpg和openai给的环境的基础上进行修改，使用多智能体强化学习完成追逃博弈，并与传统方法进行对比。下面记录下修改的地方。
复制链接

扫一扫