RLlib三:环境

RAY环境

RLlib适用于几种不同类型的环境,包括Farama-Foundation Gymnasium、用户定义、多代理以及批处理环境。

并非所有环境都适用于所有算法。查看算法概述以获取更多信息。
在这里插入图片描述

配置环境

您可以传递字符串名称或Python类来指定环境。默认情况下,字符串将被解释为gym环境名称。直接传递给算法的自定义环境类必须在其构造函数中采用单个env_config参数:

import gymnasium as gym
import ray
from ray.rllib.algorithms import ppo

class MyEnv(gym.Env):
    def __init__(self, env_config):
        self.action_space = <gym.Space>
        self.observation_space = <gym.Space>
    def reset(self, seed, options):
        return <obs>, <info>
    def step(self, action):
        return <obs>, <reward: float>, <terminated: bool>, <truncated: bool>, <info: dict>

ray.init()
algo = ppo.PPO(env=MyEnv, config={
    "env_config": {},  # config to pass to env class
})

while True:
    print(algo.train())

您还可以使用字符串名称注册自定义env creator函数。此函数必须采用单个env_config(duce)参数并返回一个env实例:

from ray.tune.registry import register_env

def env_creator(env_config):
    return MyEnv(...)  # return an env instance

register_env("my_env", env_creator)
algo = ppo.PPO(env="my_env")

有关使用自定义环境API的完整可运行代码示例,请参见custom_env.py。

gymnasium 注册表与Ray不兼容。相反,请始终使用上面记录的注册流程来确保Ray工作人员可以访问环境。

在上面的示例中,请注意env_creator函数接受一个env_config对象。这是一个包含通过算法传入的选项的dict。您还可以访问env_config.worker_index和env_config.vector_index以获取工作器ID和工作器内的env ID(如果num_envs_per_worker>0)。如果您想在不同环境的集合上进行训练,这将非常有用,例如:

class MultiEnv(gym.Env):
    def __init__(self, env_config):
        # pick actual env based on worker and env indexes
        self.env = gym.make(
            choose_env_for(env_config.worker_index, env_config.vector_index))
        self.action_space = self.env.action_space
        self.observation_space = self.env.observation_space
    def reset(self, seed, options):
        return self.env.reset(seed, options)
    def step(self, action):
        return self.env.step(action)

register_env("multienv", lambda config: MultiEnv(config))

在环境中使用日志记录时,日志记录配置需要在环境内部完成,该环境在Ray工作人员内部运行。环境之外的任何配置,例如,在启动Ray之前,都将被忽略。

Gymnasium

RLlib使用Gymnasium作为单代理训练的环境接口。有关如何实现自定义Gymnasium环境的更多信息,请参阅Gymnasium. Env类定义。您可能会发现SimpleCorridor示例可作为参考。

表现 Performance

有两种方法可以使用Gym环境扩展经验收集:

  1. 单个进程内的向量化:尽管许多env可以实现每个核心的高帧速率,但它们的吞吐量在实践中受到步骤之间策略评估的限制。例如,即使是小型TensorFlow模型也会产生几毫秒的延迟来评估。这可以通过为每个进程创建多个env并在这些env上批量进行策略评估来解决。
    您可以配置{“num_envs_per_worker”: M}让RLlib为每个工作人员创建M个并发环境。RLlib通过VectorEnv.cel()自动向量化gym环境。
  2. 跨多个进程分布:您还可以让RLlib为经验收集创建多个进程(Ray参与者)。在大多数算法中,这可以通过设置{“num_workers”: N}配置来控制。
    在这里插入图片描述
    您还可以将矢量化和分布式执行结合起来,如上图所示。这里我们只绘制了RLlib策略评估从1到128个CPU的吞吐量。GPU上的PongNoFrameskip-v4从2.4k到200k个动作/秒,CPU上的Pendulum-v1从15k到1.5M个动作/秒。一台机器用于1-16个工人,四台机器的Ray集群用于32-128个工人。每个工人配置num_envs_per_worker=64。

昂贵的环境

创建某些环境可能需要大量资源。RLlib将创建环境的num_workers+1个副本,因为驱动程序进程需要一个副本。为了避免支付访问环境的操作和观察空间所需的驱动程序副本的额外开销,您可以推迟环境初始化,直到调用重置()。

多智能体和分层环境 Multi-Agent and Hierarchical

在多智能体环境中,有不止一个“智能体”同时行动,或者以基于回合(turn-based)的方式行动,或者以这两者的组合。

例如,在交通模拟中,环境中可能有多个“汽车”和“红绿灯”代理同时行动。而在棋盘游戏中,您可能有两个或更多代理以回合制方式行动。

RLlib中多智能体的模型如下:
(1)您的环境(MultiAgentEnv的子类)返回一个字典,其中每个代理ID(例如字符串;环境可以任意选择这些)映射到单个代理的观察、奖励和完成标志。
(2)您定义(一些)预先可用的策略(您也可以在整个训练过程中动态添加新策略)
(3)您定义一个函数,将环境生成的代理ID映射到任何可用的策略ID,然后将其用于计算该特定代理的操作。

总结如下图所示:
在这里插入图片描述

在实现您自己的 MultiAgentEnv 时,请注意,您应该只在观察字典中返回那些您希望在下一次调用 step() 时收到操作的代理 ID。

该 API 允许您实现任何类型的多代理环境,从所有代理始终同时行动的回合制游戏环境到介于两者之间的任何环境。

这是一个 env 的示例,其中所有代理始终同时执行:

# Env, in which all agents (whose IDs are entirely determined by the env
# itself via the returned multi-agent obs/reward/dones-dicts) step
# simultaneously.
env = MultiAgentTrafficEnv(num_cars=2, num_traffic_lights=1)

# Observations are a dict mapping agent names to their obs. Only those
# agents' names that require actions in the next call to `step()` should
# be present in the returned observation dict (here: all, as we always step
# simultaneously).
print(env.reset())
# ... {
# ...   "car_1": [[...]],
# ...   "car_2": [[...]],
# ...   "traffic_light_1": [[...]],
# ... }

# In the following call to `step`, actions should be provided for each
# agent that returned an observation before:
new_obs, rewards, dones, infos = env.step(
    actions={"car_1": ..., "car_2": ..., "traffic_light_1": ...})

# Similarly, new_obs, rewards, dones, etc. also become dicts.
print(rewards)
# ... {"car_1": 3, "car_2": -1, "traffic_light_1": 0}

# Individual agents can early exit; The entire episode is done when
# dones["__all__"] = True.
print(dones)
# ... {"car_2": True, "__all__": False}

另一个例子,代理一步一步地行动(回合制游戏):

# Env, in which two agents step in sequence (tuen-based game).
# The env is in charge of the produced agent ID. Our env here produces
# agent IDs: "player1" and "player2".
env = TicTacToe()

# Observations are a dict mapping agent names to their obs. Only those
# agents' names that require actions in the next call to `step()` should
# be present in the returned observation dict (here: one agent at a time).
print(env.reset())
# ... {
# ...   "player1": [[...]],
# ... }

# In the following call to `step`, only those agents' actions should be
# provided that were present in the returned obs dict:
new_obs, rewards, dones, infos = env.step(actions={"player1": ...})

# Similarly, new_obs, rewards, dones, etc. also become dicts.
# Note that only in the `rewards` dict, any agent may be listed (even those that have
# not(!) acted in the `step()` call). Rewards for individual agents will be added
# up to the point where a new action for that agent is needed. This way, you may
# implement a turn-based 2-player game, in which player-2's reward is published
# in the `rewards` dict immediately after player-1 has acted.
print(rewards)
# ... {"player1": 0, "player2": 0}

# Individual agents can early exit; The entire episode is done when
# dones["__all__"] = True.
print(dones)
# ... {"player1": False, "__all__": False}

# In the next step, it's player2's turn. Therefore, `new_obs` only container
# this agent's ID:
print(new_obs)
# ... {
# ...   "player2": [[...]]
# ... }

如果所有智能体将使用相同的算法类进行训练,那么您可以按如下方式设置多智能体训练:

algo = pg.PGAgent(env="my_multiagent_env", config={
    "multiagent": {
        "policies": {
            # Use the PolicySpec namedtuple to specify an individual policy:
            "car1": PolicySpec(
                policy_class=None,  # infer automatically from Algorithm
                observation_space=None,  # infer automatically from env
                action_space=None,  # infer automatically from env
                config={"gamma": 0.85},  # use main config plus <- this override here
                ),  # alternatively, simply do: `PolicySpec(config={"gamma": 0.85})`

            # Deprecated way: Tuple specifying class, obs-/action-spaces,
            # config-overrides for each policy as a tuple.
            # If class is None -> Uses Algorithm's default policy class.
            "car2": (None, car_obs_space, car_act_space, {"gamma": 0.99}),

            # New way: Use PolicySpec() with keywords: `policy_class`,
            # `observation_space`, `action_space`, `config`.
            "traffic_light": PolicySpec(
                observation_space=tl_obs_space,  # special obs space for lights?
                action_space=tl_act_space,  # special action space for lights?
                ),
        },
        "policy_mapping_fn":
            lambda agent_id, episode, worker, **kwargs:
                "traffic_light"  # Traffic lights are always controlled by this policy
                if agent_id.startswith("traffic_light_")
                else random.choice(["car1", "car2"])  # Randomly choose from car policies
    },
})

while True:
    print(algo.train())

要排除 multiagent.policies 字典中的某些策略,您可以使用 multiagent.policies_to_train 设置。例如,您可能希望有一个或多个随机(非学习)策略与您的学习策略交互:

# Example for a mapping function that maps agent IDs "player1" and "player2" to either
# "random_policy" or "learning_policy", making sure that in each episode, both policies
# are always playing each other.
def policy_mapping_fn(agent_id, episode, worker, **kwargs):
    agent_idx = int(agent_id[-1])  # 0 (player1) or 1 (player2)
    # agent_id = "player[1|2]" -> policy depends on episode ID
    # This way, we make sure that both policies sometimes play player1
    # (start player) and sometimes player2 (player to move 2nd).
    return "learning_policy" if episode.episode_id % 2 == agent_idx else "random_policy"

algo = pg.PGAgent(env="two_player_game", config={
    "multiagent": {
        "policies": {
            "learning_policy": PolicySpec(),  # <- use default class & infer obs-/act-spaces from env.
            "random_policy": PolicySpec(policy_class=RandomPolicy),  # infer obs-/act-spaces from env.
        },
        # Example for a mapping function that maps agent IDs "player1" and "player2" to either
        # "random_policy" or "learning_policy", making sure that in each episode, both policies
        # are always playing each other.
        "policy_mapping_fn": policy_mapping_fn,
        # Specify a (fixed) list (or set) of policy IDs that should be updated.
        "policies_to_train": ["learning_policy"],

        # Alternatively, you can provide a callable that returns True or False, when provided
        # with a policy ID and an (optional) SampleBatch:

        # "policies_to_train": lambda pid, batch: ... (<- return True or False)

        # This allows you to more flexibly update (or not) policies, based on
        # who they played with in the episode (or other information that can be
        # found in the given batch, e.g. rewards).
    },
})

RLlib 将创建三个不同的策略,并使用给定的policy_mapping_fn 将代理决策路由到其绑定策略。当代理第一次出现在环境中时,将调用policy_mapping_fn来确定它绑定到哪个策略。 RLlib 在 train() 返回中报告每个策略的单独训练统计数据以及组合奖励。

下面是一个简单的示例训练脚本,您可以在其中更改环境中代理和策略的数量。有关如何一次使用多种训练方法(此处为DQN和PPO),请参阅双教练示例。每个策略的指标都单独报告,例如:

 Result for PPO_multi_cartpole_0:
   episode_len_mean: 34.025862068965516
   episode_reward_max: 159.0
   episode_reward_mean: 86.06896551724138
   info:
     policy_0:
       cur_lr: 4.999999873689376e-05
       entropy: 0.6833480000495911
       kl: 0.010264254175126553
       policy_loss: -11.95590591430664
       total_loss: 197.7039794921875
       vf_explained_var: 0.0010995268821716309
       vf_loss: 209.6578826904297
     policy_1:
       cur_lr: 4.999999873689376e-05
       entropy: 0.6827034950256348
       kl: 0.01119876280426979
       policy_loss: -8.787769317626953
       total_loss: 88.26161193847656
       vf_explained_var: 0.0005457401275634766
       vf_loss: 97.0471420288086
   policy_reward_mean:
     policy_0: 21.194444444444443
     policy_1: 21.798387096774192

为了扩展到数百个代理(如果这些代理使用相同的策略),MultiAgentEnv 在内部对多个代理进行批量策略评估。通过设置 num_envs_per_worker > 1,您的 MultiAgentEnv 也会自动矢量化(可以是普通的单代理环境,例如gym.Env)。

Petting Zoo多代理环境

PettingZoo 是一个包含 50 多个不同多代理环境的存储库。但是,该 API 并不直接与 rllib 兼容,但可以将其转换为 rllib MultiAgentEnv,如本例所示

from ray.tune.registry import register_env
# import the pettingzoo environment
from pettingzoo.butterfly import prison_v3
# import rllib pettingzoo interface
from ray.rllib.env import PettingZooEnv
# define how to make the environment. This way takes an optional environment config, num_floors
env_creator = lambda config: prison_v3.env(num_floors=config.get("num_floors", 4))
# register that way to make the environment under an rllib name
register_env('prison', lambda config: PettingZooEnv(env_creator(config)))
# now you can use `prison` as an environment
# you can pass arguments to the environment creator with the env_config option in the config
config['env_config'] = {"num_floors": 5}

石头剪刀布的例子

rock_paper_scissors_multiagent.py 示例演示了几种相互竞争的策略:重复相同动作的启发式策略、击败对手的最后一步动作以及学习的 LSTM 和前馈策略。

TensorBoard运行石头-布-剪刀示例的输出,在该示例中,学习的策略在随机选择相同移动和最后一步移动的启发式规则之间进行对峙。这里比较了启发式策略与学习策略的性能与启用LSTM(蓝色)和普通前馈策略(红色)的性能。虽然前馈策略可以通过简单地避免最后一步操作来轻松击败相同移动启发式策略,但需要LSTM策略来区分并一致地击败这两种策略。

策略之间的变量共享

使用ModelV2,您可以将层放在全局变量中,并在模型之间直接共享这些层对象,而不是使用变量作用域。

RLlib将在单独的tf.ariable_scope中创建每个策略的模型。但是,通过使用tf.VariableScope(Reuse=tf.AUTO_REUSE)显式输入全局共享变量范围,仍然可以在策略之间共享变量:

with tf.variable_scope(
        tf.VariableScope(tf.AUTO_REUSE, "name_of_global_shared_scope"),
        reuse=tf.AUTO_REUSE,
        auxiliary_name_scope=False):
    <create the shared layers here>

实施集中的评论家

以下是实现与多代理API兼容的集中式批评家的两种方法:

策略1:分享轨迹预处理器的经验:

实现集中式批评家的最通用方法包括定义自定义策略的postprocess_fn方法。postprocess_fn由Policy.postprocess_trajectory调用,它可以通过other_agent_batches和事件参数完全访问并发代理的策略和观察。然后可以将批评家预测添加到后处理轨迹中。这里有一个例子:

def postprocess_fn(policy, sample_batch, other_agent_batches, episode):
    agents = ["agent_1", "agent_2", "agent_3"]  # simple example of 3 agents
    global_obs_batch = np.stack(
        [other_agent_batches[agent_id][1]["obs"] for agent_id in agents],
        axis=1)
    # add the global obs and global critic value
    sample_batch["global_obs"] = global_obs_batch
    sample_batch["central_vf"] = self.sess.run(
        self.critic_network, feed_dict={"obs": global_obs_batch})
    return sample_batch

策略2:通过观察功能共享观察结果:
或者,您可以使用观察函数在代理之间共享观察结果。在此策略中,每个观察都包括所有全局状态,并且策略使用自定义模型来忽略它们在计算操作时不应该“看到”的状态。这种方法的优点是它非常简单,您根本不必更改算法——只需使用观察功能(即,像环境包装器一样)和自定义模型。然而,它的原则性稍低,因为您必须更改代理观察空间以包含仅训练时间的信息。您可以在示例/centralized_critic_2.py中找到此策略的可运行示例。

分组智能体

在多智能体RL中,通常会有多个智能体组。RLlib将代理组视为具有元组操作和观察空间的单个代理。然后,可以将组代理分配给用于集中执行的单个策略,或者分配给实施集中培训但分散执行的专用多代理策略。您可以使用MultiAgentEnv.with_AGENT_Groups()方法定义以下组:
对于具有多个组的环境或代理组和单个代理的混合环境,您可以将分组与前面部分中介绍的策略映射API结合使用。

    def with_agent_groups(
        self,
        groups: Dict[str, List[AgentID]],
        obs_space: gym.Space = None,
            act_space: gym.Space = None) -> "MultiAgentEnv":
        """Convenience method for grouping together agents in this env.

        An agent group is a list of agent IDs that are mapped to a single
        logical agent. All agents of the group must act at the same time in the
        environment. The grouped agent exposes Tuple action and observation
        spaces that are the concatenated action and obs spaces of the
        individual agents.

        The rewards of all the agents in a group are summed. The individual
        agent rewards are available under the "individual_rewards" key of the
        group info return.

        Agent grouping is required to leverage algorithms such as Q-Mix.

        Args:
            groups: Mapping from group id to a list of the agent ids
                of group members. If an agent id is not present in any group
                value, it will be left ungrouped. The group id becomes a new agent ID
                in the final environment.
            obs_space: Optional observation space for the grouped
                env. Must be a tuple space. If not provided, will infer this to be a
                Tuple of n individual agents spaces (n=num agents in a group).
            act_space: Optional action space for the grouped env.
                Must be a tuple space. If not provided, will infer this to be a Tuple
                of n individual agents spaces (n=num agents in a group).

        .. testcode::
            :skipif: True

            from ray.rllib.env.multi_agent_env import MultiAgentEnv
            class MyMultiAgentEnv(MultiAgentEnv):
                # define your env here
                ...
            env = MyMultiAgentEnv(...)
            grouped_env = env.with_agent_groups(env, {
              "group1": ["agent1", "agent2", "agent3"],
              "group2": ["agent4", "agent5"],
            })

        """

        from ray.rllib.env.wrappers.group_agents_wrapper import \
            GroupAgentsWrapper
        return GroupAgentsWrapper(self, groups, obs_space, act_space)

层次化环境

分层训练有时可以作为多代理RL的特殊情况来实现。例如,考虑一个三级策略层次结构,其中顶级策略发布高级操作,中级策略和低级策略在更精细的时间尺度上执行这些操作。以下时间表显示了顶层策略的一个步骤,该步骤对应于两个中级操作和五个低级操作:

top_level ---------------------------------------------------------------> top_level --->
mid_level_0 -------------------------------> mid_level_0 ----------------> mid_level_1 ->
low_level_0 -> low_level_0 -> low_level_0 -> low_level_1 -> low_level_1 -> low_level_2 ->

这可以作为具有三种类型代理的多代理环境来实现。每个较高级别的操作都会创建一个具有新ID的新的较低级别代理实例(例如,上例中的Low_Level_0、Low_Level_1、Low_Level_2)。这些较低级别的代理在较高级别的步骤开始时突然存在,并在其较高级别的操作结束时终止。他们的经验是按策略聚合的,所以从RLlib的角度来看,它只是优化了三种不同类型的策略。配置可能如下所示:

"multiagent": {
    "policies": {
        "top_level": (custom_policy or None, ...),
        "mid_level": (custom_policy or None, ...),
        "low_level": (custom_policy or None, ...),
    },
    "policy_mapping_fn":
        lambda agent_id:
            "low_level" if agent_id.startswith("low_level_") else
            "mid_level" if agent_id.startswith("mid_level_") else "top_level"
    "policies_to_train": ["top_level"],
},

在此设置中,多代理环境实施必须为培训较低级别的代理提供适当的奖励。环境类还负责代理之间的路由,例如,将目标从较高级别代理传送到较低级别代理,作为较低级别代理观察的一部分。

看到这里了,给辛苦搬运的博主助力一杯奶茶,或者一个免费的赞吧。
在这里插入图片描述

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值