rllib_使用rllib和tensorforce制定和解决多代理问题

rllib

Multi-Agent systems are everywhere. From a flying flock of birds and a wolfpack hunting deer to people driving cars and trading in stocks. These real-world cognition problems involve multiple intelligent agents interacting with each other. But what’s driving us to study them? Curiosity, perhaps?

中号 ULTI-代理系统是随处可见。 从一群飞鸟和狼群猎鹿到开车和买卖股票的人们。 这些现实世界的认知问题涉及多个智能主体之间的交互。 但是是什么驱使我们学习它们呢? 好奇心吧?

If only we could mimic complex group behaviours using Artificial Intelligence.

如果只有我们可以使用人工智能来模仿复杂的群体行为。

And that’s where Multi-Agent Reinforcement Learning (MARL) comes in. While in a single-agent Reinforcement Learning (RL) setup, the state of the environment changes due to action of one agent, in MARL, the state transition occurs based on the joint action of multiple agents. Formally, it is an extension of the classic Markov Decision Process (MDP) framework to include multiple agents and is represented using a Stochastic Game (SG). Typically, a MARL based multi-agent system is designed with the following characteristics in mind:

这就是多智能体强化学习(MARL)出现的地方。在单智能体强化学习(RL)设置中,环境状态由于一个智能体的作用而改变,而在MARL中,状态转换是基于多种药物的共同作用。 从形式上讲,它是经典马尔可夫决策过程(MDP)框架的扩展,包括多个代理,并使用随机博弈(SG)表示。 通常,基于MARL的多主体系统在设计时会考虑以下特征:

  1. Autonomy: Agents must be at least partially independent and self-aware

    自治:代理必须至少部分独立并且具有自我意识
  2. Local View: Decision making is done subject to only local observability of the system

    本地视图:决策仅取决于系统的本地可观察性
  3. Decentralization: No agent can control the whole system

    分散化:没有代理可以控制整个系统

This design is in contrast to a more traditional monolithic problem-solving approach which can be severely limited in terms of scalability and complex decision making.

这种设计与更传统的整体式问题解决方法形成了鲜明对比,后者在可伸缩性和复杂决策方面可能受到严重限制。

Okay, enough theory. Let’s get our hands dirty with some code.

好的,足够的理论。 让我们开始编写一些代码。

玩具多智能体问题 (A Toy Multi-Agent Problem)

Consider five farmers. Each of them wants to withdraw water from a stream to irrigate their fields (grow more corns!). However, the problem is, if an upstream farmer withdraws too much water, there will be water scarcity for the farmers who are situated downstream. Let’s now try to frame a MARL problem that nudges farmers to not be greedy and cooperate.

考虑五个农民。 他们每个人都想从溪流中抽水以灌溉他们的田地(种更多的玉米!)。 但是,问题是,如果上游农民抽水过多,下游农民将缺水。 现在让我们尝试构建一个MARL问题,该问题促使农民不要贪婪和合作。

观察空间 (Observation Space)

Each agent observes the amount of water flowing in the stream on a particular month. For simplicity, we randomly choose water flowing in the stream on a specific day to be between 200 and 800 volume units. (Note that here we allow agents to have global observability because the observation is quite simple)

每个代理观察特定月份的水流量。 为简单起见,我们随机选择特定日期在溪流中流动的水量在200至800体积单位之间。 (请注意,由于观察非常简单,因此在这里我们允许代理具有全局可观察性)

动作空间 (Action Space)

Each agent chooses a proportion of water to withdraw from the stream. This proportion is between 0 and 1. Note that action selection happens at the start of every episode. You can consider it like an announcement made by every farmer at the start of the episode. (A bit unrealistic, but hey, this is a toy example!)

每种试剂都选择一定比例的水从溪流中抽出。 此比例在0到1之间。请注意,动作选择发生在每个情节的开头。 您可以将其视为情节开始时每个农民发出的公告。 (有点不现实,但是,这是一个玩具示例!)

This month, 380 metric literes of water will flow in the stream. Hmm, I think 110 metric literes is enough for my corn field. I declare that I will withdraw 30 % of the water from the stream this month.

本月,溪流将流380公升水。 嗯,我认为110公升就足够我的玉米田了。 我宣布本月我将从溪流中抽出30%的水。

What if the actual water in the stream is less than water demanded (because upstream agents already withdrew the water!). In that case, the farmer withdraws all that is left in the stream and a penalty is imposed on the system (described in the next section).

如果溪流中的实际水量少于所需水量(因为上游代理商已经撤出水!),该怎么办? 在这种情况下,农民将撤回流中剩余的所有资源,并对系统施加罚款(在下一节中介绍)。

奖励 (Reward)

More water means better irrigation. But water more than a limit can damage crops. Moreover, there is a minimum requirement of water for every farmer. Let’s first put a simple bound on water withdrawn w for every agent as 0<w<200. We then define reward from crop yield as a quadratic function of water withdrawn, with positive reward between 0 and 200.

多喝水意味着更好的灌溉。 但是水超过极限会损害农作物。 此外,对每个农民都有最低限度的用水需求。 首先让我们对每个代理的取水w设置一个简单的界限,即0 < w <200。 然后,我们将作物产量的回报定义为取水的二次函数,正回报在0到200之间。

R(w) = -w² + 200w

R(W)= - 瓦特 ²+200瓦特

Further, we define a penalty proportional to the amount of water deficit for every agent.

此外,我们定义了与每种药剂的水分亏缺量成正比的惩罚。

water deficit =water demanded — water withdrawn

缺水=需水–取水

Penalty = 100* (water deficit)

罚款= 100 *(缺水)

To promote cooperation, we give global reward to all the agents, which is equal to the sum of individual rewards and penalty for every agent.

为了促进合作,我们向所有代理商提供全球奖励,这等于每个代理商的个人奖励和罚款之和。

使用RLlib的实现 (Implementation using RLlib)

First, we need to implement a custom environment. In RLlib, this is done by inheriting from aMultiAgentEnvclass. A brief outline of custom environment design is as follows:

首先,我们需要实现一个自定义环境。 在RLlib中,这是通过从MultiAgentEnv类继承来完成的。 定制环境设计的简要概述如下:

  1. The custom environment must define a resetand a step function. Add other helper functions if you want to.

    自定义环境必须定义一个reset和一个step功能。 如果需要,可以添加其他助手功能。

  2. The reset function returns a dictionary of observations with keys being agent ids. Agent ids could be anything that is unique for the agents (duh!) but must be consistent across all functions in the environment definition.

    reset功能返回一个观察字典,其中的键是座席ID。 代理程序ID可以是代理程序唯一的任何东西(duh!),但必须在环境定义中的所有功能之间保持一致。

  3. step function takes as input a dictionary of actions with keys being agent ids (same as above). The function must return a dictionary of observations, rewards, dones (boolean whether the episode is terminated or not) and any additional info. Again, the keys for all these dictionaries are agent ids.

    step函数将一个动作字典作为输入作为输入,其键为座席ID(与上面相同)。 该函数必须返回observationsrewardsdones (布尔值是否终止情节)以及任何其他info的字典。 同样,所有这些词典的关键字都是代理ID。

  4. dones dictionary has an additional key __all__ which must be True only when all agents have completed the episode.

    dones词典有一个附加键__all__ ,只有当所有特工都完成了这一集后,该键才必须为True。

  5. Lastly, a powerful but rather confusing design choice. Not all agents need to be present in the game at any time step. This necessitates that

    最后,一个强大而混乱的设计选择。 并非所有代理都需要随时出现在游戏中。 这就需要
  • The action_dict passed to thestep function always contains actions for observations returned in the previous timestep.

    传递给step函数的action_dict始终包含上一个时间步长返回的观察值的动作。

  • The observations returned at any timesteps need not be for the same agents for which actions were received.

    在任何时间步长返回的observations不必是针对接收到操作的同一代理。

  • The keys for observations, rewards, dones and info must be the same.

    observationsrewardsdonesinfo的键必须相同。

This design choice allows decoupling agent action and reward, which is useful in many multi-agent scenarios.

此设计选择允许将代理动作和奖励分离,这在许多多代理场景中很有用。

And now, coming back to our game of happy farmers, the custom environment can be defined something like this.

现在,回到我们快乐的农民的游戏中,可以定义类似这样的自定义环境。

class IrrigationEnv(MultiAgentEnv):
    def __init__(self, return_agent_actions = False, part=False):
        self.num_agents = 5
        self.observation_space = gym.spaces.Box(low=200, high=800, shape=(1,))
        self.action_space = gym.spaces.Box(low=0, high=1, shape=(1,))


    def reset(self):
        obs = {}
        self.water = np.random.uniform(200,800)
        for i in range(self.num_agents):
            obs[i] = np.array([self.water])
        return obs


    def cal_rewards(self, action_dict):
        self.curr_water = self.water
        reward = 0
        for i in range(self.num_agents):
            water_demanded = self.water*action_dict[i][0]
            if self.curr_water == 0:
                # No water is left in stream
                reward -= water_demanded*100 # Penalty
            elif self.curr_water - water_demanded<0:
                # Water in stream is less than water demanded, withdraw all left
                water_needed = water_demanded - self.curr_water
                water_withdrawn = self.curr_water
                self.curr_water = 0
                reward += -water_withdrawn**2 + 200*water_withdrawn
                reward -= water_needed*100 # Penalty
            else:
                # Water in stream is more than water demanded, withdraw water demanded
                self.curr_water -= water_demanded
                water_withdrawn = water_demanded
                reward += -water_withdrawn**2 + 200*water_withdrawn


        return reward


    def step(self, action_dict):
        obs, rew, done, info = {}, {}, {}, {}


        reward = self.cal_rewards(action_dict)


        for i in range(self.num_agents):


            obs[i], rew[i], done[i], info[i] = np.array([self.curr_water]), reward, True, {}


        done["__all__"] = True
        return obs, rew, done, info

火车司机 (The Train Driver)

RLlib needs some information before starting a heavy-duty training. This includes

RLlib在开始重型培训之前需要一些信息。 这包括

  1. Registering the custom environment

    注册自定义环境

def env_creator(_):
return IrrigationEnv()
single_env = IrrigationEnv()
env_name = "IrrigationEnv"
register_env(env_name, env_creator)

2. Defining a mapping between agents and policies

2. 定义代理与策略之间的映射

obs_space = single_env.observation_space
act_space = single_env.action_space
num_agents = single_env.num_agentsdef gen_policy():
return (None, obs_space, act_space, {})policy_graphs = {}
for i in range(num_agents):
policy_graphs['agent-' + str(i)] = gen_policy()def policy_mapping_fn(agent_id):
return 'agent-' + str(agent_id)

3. Hyperparameters and training configuration details (for my humble training setup)

3.超参数和培训配置详细信息 (针对我不起眼的培训设置)

config={
"log_level": "WARN",
"num_workers": 3,
"num_cpus_for_driver": 1,
"num_cpus_per_worker": 1,
"lr": 5e-3,
"model":{"fcnet_hiddens": [8, 8]},
"multiagent": {
"policies": policy_graphs,
"policy_mapping_fn": policy_mapping_fn,
},
"env": "IrrigationEnv"
}

4. Lastly, the training driver code

4. 最后,培训驾驶员代码

exp_name = 'more_corns_yey'
exp_dict = {
'name': exp_name,
'run_or_experiment': 'PG',
"stop": {
"training_iteration": 100
},
'checkpoint_freq': 20,
"config": config,
}ray.init()
tune.run(**exp_dict)

And that’s it, the mighty Policy Gradient (PG) will optimize your system towards a socially optimal and cooperative behaviour. You can use Proximal Policy Optimization PPO instead of PG to get some improvements in the results. More algorithm choices and hyperparameter details are available in the RLlib docs.

就是这样,强大的Policy Gradient(PG)将优化您的系统,以实现社会最优和合作的行为。 您可以使用近端策略优化PPO代替PG来获得一些改进。 RLlib文档中提供了更多的算法选择和超参数详细信息。

使用Tensorforce实施 (Implementation using Tensorforce)

Unlike RLlib, Tensorforce doesn’t natively support Multi-Agent RL. Why do we want to try it then? Well, from my personal experience, if you wish to implement complex network architectures for policy function, or need a very efficient training pipeline over multiple clusters, RLlib truly shines. But it seems of an overdo when you just want a simple multiagent system that has somewhat tricky inter-agent interactions. And that’s where Tensorforce can be very handy.

与RLlib不同,Tensorforce本身不支持Multi-Agent RL。 那我们为什么要尝试呢? 好吧,以我的个人经验来看,如果您希望实现复杂的网络体系结构以实现策略功能,或者需要在多个群集上进行非常高效的培训,那么RLlib无疑是无与伦比的。 但是,当您只需要一个具有棘手的智能体间交互作用的简单多智能体系统时,这似乎太过分了。 这就是Tensorforce可以非常方便的地方。

Just like RLlib’s tune.run, Tensorforce too has a similar API. But we aren’t going to discuss that. Instead, we will focus on the act and observe workflow. It is super flexible and gives you the freedom to decouple agent actions, environment step execution, and internal model update. To achieve similar autonomy in RLlib is relatively hard. Okay, now back to code snippets.

就像RLlib的tune.run ,Tensorforce也具有类似的API。 但是我们不会讨论这一点。 相反,我们将专注于act and observe工作流程。 它超级灵活,可让您自由地分离代理动作,环境步骤执行和内部模型更新。 要在RLlib中实现类似的自治权相对困难。 好的,现在回到代码片段。

Firstly, we need an environment. Here, we don’t need any particular format. The only requirement is that we should be able to get initial observations when reset is called and get new observations, rewards, terminal states and any additional info on environment step execution. So, for simplicity’s sake, let’s just reuse previously defined environment.

首先,我们需要一个环境。 在这里,我们不需要任何特定的格式。 唯一的要求是,在调用reset时,我们应该能够获得初始观察new observations ,并获得new observationsrewardsterminal states以及有关环境步骤执行的任何其他info 。 因此,为简单起见,让我们仅重用先前定义的环境。

Secondly, we need Agents. We can create them with required specifications using Agent.from_spec method. Let us create 5 of those. (Note that we use state_space and action_space from the environment definition)

其次,我们需要代理商。 我们可以使用Agent.from_spec方法以所需的规范创建它们。 让我们创建其中的5个。 (请注意,我们使用环境定义中的state_spaceaction_space )

env = IrrigationEnv()
num_agents = env.num_agentsstate_space = dict(type='float', shape=(1,))
action_space = dict(type='float', shape=(1,), min_value=0., max_value=1.)config = dict(
states=state_space,
actions=action_space,
network=[
dict(type='dense', size=8),
dict(type='dense', size=8),
]
)agent_list = []
for i in range(num_agents):
agent_list.append(Agent.from_spec(spec='ppo_agent', kwargs=config))

The agent configuration can be provided in multiple ways. This is a minimal example but the reader can refer to Tensorforce docs for more details.

代理配置可以通过多种方式提供。 这是一个最小的示例,但读者可以参考Tensorforce文档以获取更多详细信息。

And lastly, we need the training code. I must confess that I have written it in a very messy way. Thus, here is the basic outline of the workflow.

最后,我们需要培训代码。 我必须承认,我以非常混乱的方式编写了它。 因此,这是工作流程的基本概述。

  1. Create a batch of environments

    创建一批环境

env_batch = []
for i in range(batch_size):
env_batch.append(IrrigationEnv())

2. Loop over training iterations and in every iteration, reset the batch of environments to obtain initial observations.

2. 遍历训练迭代,并在每次迭代中重置一批环境以获得初始观测值。

for _ in range(training_iterations):
for b in range(batch_size):
obs = env_batch[b].reset()

3. Loop over agent_ids for which observations are returned and call Agent.act on batch of observations.

3. 循环返回要返回其观察值的agent_id ,然后对一批观察值调用 Agent.act

    for agent_id in obs:
actions = agent_list[agent_id].act(states = obs_batch[agent_id])

4. Loop over all the environments in the batch and apply the actions to each environment. We get new observations, rewards, terminal states and additional info.

4. 循环遍历批处理中的所有环境 ,然后将操作应用于每个环境。 我们获得新的观察结果,奖励,最终状态和其他信息。

    for b in range(batch_size):
new_obs, rew, dones, info = env_batch[b].step(action_batch[b])

5. Lastly, for every agent for which we called Agent.act, call Agent.model.observe with rewards and terminal states to internalize experience trajectories.

5.最后, 对于我们 Agent.act 调用了 Agent.model.observe 每个代理请调用带有奖励和终端状态的Agent.model.observe来内部化体验轨迹。

    for agent_id in new_obs:
agent_list[agent_id].model.observe(reward = rew_batch[agent_id], terminal = done_batch[agent_id])

The rest is just some code gymnastics for being able to conveniently access values for every agent and every element in the batch. The complete code for training is

剩下的仅仅是一些代码体操,它们能够方便地访问批处理中每个代理和每个元素的值。 完整的培训代码是

env_batch = []
for i in range(batch_size):
env_batch.append(IrrigationEnv())for _ in range(training_iterations): obs_batch = {i:[] for i in range(num_agents)}
rew_batch = {i:[] for i in range(num_agents)}
done_batch = {i:[] for i in range(num_agents)}
action_batch = {b:{} for b in range(batch_size)} for b in range(batch_size):
obs = env_batch[b].reset()
for agent_id in range(num_agents):
obs_batch[agent_id].append(obs[agent_id]) for agent_id in obs:
actions = agent_list[agent_id].act(states = obs_batch[agent_id])
for b in range(batch_size):
action_batch[b][agent_id] = actions[b] for b in range(batch_size):
new_obs, rew, dones, info = env_batch[b].step(action_batch[b])
for agent_id in obs:
rew_batch[agent_id].append(rew[agent_id])
done_batch[agent_id].append(dones[agent_id]) for agent_id in new_obs:
agent_list[agent_id].model.observe(reward = rew_batch[agent_id], terminal = done_batch[agent_id])
print(np.mean(rew_batch[0]))

Now, I must say that this isn't a very scalable code. Certainly, some efficiency improvements can be brought in by parallelising the for loops and using threading. However, what I wanted to showcase here is the ease with which one can quickly prototype of agent interactions in multi-agent systems. More complex mechanisms such as inter-agent teaching, advising, or multi-agent communication are also straightforward to implement using this workflow.

现在,我必须说这不是一个可扩展的代码。 当然,可以通过并行化for循环并使用线程来提高效率。 但是,我想在这里展示的是可以轻松实现多代理系统中代理交互原型的简便性。 使用此工作流程也可以直接实现更复杂的机制,例如座席间教学,咨询或多座席沟通。

结论与未来之谜 (Conclusion and the mysteries of future)

Tensorforce and RLlib are both remarkable libraries for training RL agents at the moment. However, they suffer from a hard to read and often times chaotic documentation. There is also a severe lack of examples of advanced use-cases. Moreover, help is limited on the internet as the community is still not very large. I have thus decided to write a series of blogs highlighting how interesting multi-agent systems can be crafted using these libraries. I hope this will especially be useful to those who are stuck for days trying to put to code what is in their mind. Let me know if you have any suggestions, comments or fun ideas! I am also open to collaboration.

Tensorforce和RLlib都是目前训练RL代理的出色库。 但是,它们的文档难以阅读且经常混乱。 也严重缺乏高级用例的示例。 此外,由于社区还不是很大,因此互联网上的帮助有限。 因此,我决定写一系列博客,重点介绍如何使用这些库制作有趣的多代理系统。 我希望这对于那些长时间坚持编写代码的人来说特别有用。 如果您有任何建议,评论或有趣的想法,请告诉我! 我也愿意合作。

Oh and all the code is available in the repository here. Cheerio!

哦,所有代码都可以在此处的存储库中找到 。 加油!

翻译自: https://medium.com/@vermashresth/craft-and-solve-multi-agent-problems-using-rllib-and-tensorforce-a3bd1bb6f556

rllib

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值