多智能体游戏环境PettingZoo_自定义环境示例_代码详解

本文原始代码链接
PettingZoo的github链接

自定义环境:(注意这个不是并行环境)
Example_Custom_Environment.py

''''''
'''
This is a carefully commented version of the PettingZoo rock paper scissors environment.
这是一个经过仔细注释的PettingZoo石头剪刀布环境版本。
https://pettingzoo.farama.org/content/environment_creation/
'''
import functools

import gymnasium
import numpy as np
from gymnasium.spaces import Discrete

from pettingzoo import AECEnv
from pettingzoo.utils import agent_selector, wrappers

ROCK = 0
PAPER = 1
SCISSORS = 2
NONE = 3
MOVES = ["ROCK", "PAPER", "SCISSORS", "None"]
NUM_ITERS = 5  # 这个常量是最大时间步。原为100。
REWARD_MAP = {
    (ROCK, ROCK): (0, 0),
    (ROCK, PAPER): (-1, 1),
    (ROCK, SCISSORS): (1, -1),
    (PAPER, ROCK): (1, -1),
    (PAPER, PAPER): (0, 0),
    (PAPER, SCISSORS): (-1, 1),
    (SCISSORS, ROCK): (-1, 1),
    (SCISSORS, PAPER): (1, -1),
    (SCISSORS, SCISSORS): (0, 0),
}


def env(render_mode=None):
    ''''''
    """
    The env function often wraps the environment in wrappers by default.
    You can find full documentation for these methods
    elsewhere in the developer documentation.
    默认情况下,env函数通常将环境包装在包装器中。
    您可以找到这些方法的完整文档
    在开发人员文档的其他地方。
    """
    internal_render_mode = render_mode if render_mode != "ansi" else "human"
    # 若render_mode为"ansi",则internal_render_mode为"human"
    # 若render_mode为"human",则internal_render_mode还是为"human"
    # 若render_mode不为以上2个,则internal_render_mode = render_mode

    env = raw_env(render_mode=internal_render_mode)
    # This wrapper is only for environments which print results to the terminal
    # 此包装器仅适用于将结果打印到终端的环境

    if render_mode == "ansi":
        env = wrappers.CaptureStdoutWrapper(env)
    # this wrapper helps error handling for discrete action spaces
    # 该包装器有助于离散行动空间的错误处理

    env = wrappers.AssertOutOfBoundsWrapper(env)
    # Provides a wide vareity of helpful user errors
    # 提供各种有用的用户错误
    # Strongly recommended
    # 强烈推荐

    env = wrappers.OrderEnforcingWrapper(env)
    return env


class raw_env(AECEnv):
    ''''''
    """
    The metadata holds environment constants. From gymnasium, we inherit the "render_modes",
    metadata which specifies which modes can be put into the render() method.
    At least human mode should be supported.
    The "name" metadata allows the environment to be pretty printed.
    元数据包含环境常量。从体育馆,我们继承了“render_modes”,
    元数据,指定哪些模式可以放入render()方法中。
    至少应该支持人工模式。
    “名称”元数据允许对环境进行漂亮的打印。
    """

    metadata = {"render_modes": ["human"], "name": "rps_v2"}

    def __init__(self, render_mode=None):
        ''''''
        """
        The init method takes in environment arguments and
         should define the following attributes:
        - possible_agents
        - action_spaces
        - observation_spaces
        These attributes should not be changed after initialization.
        init方法接受环境参数和
        应定义以下属性:
        -可能的智能体
        -行动空间
        -观测空间
        初始化后不应更改这些属性。
        """
        self.possible_agents = ["player_" + str(r) for r in range(2)]
        # possible_agents指的应该是允许的智能体的名字。

        self.agent_name_mapping = dict(
            zip(self.possible_agents, list(range(len(self.possible_agents))))
        )
        # zip()函数可以将列表组装为字典。
        #self.agent_name_mapping为{'player_0': 0, 'player_1': 1}

        # gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
        # 这里定义并记录了健身房的空间:https://gymnasium.farama.org/api/spaces/
        self._action_spaces = {agent: Discrete(3) for agent in self.possible_agents}
        # self._action_spaces为{'player_0': Discrete(3), 'player_1': Discrete(3)}
        self._observation_spaces = {
            agent: Discrete(4) for agent in self.possible_agents
        }
        self.render_mode = render_mode

    # this cache ensures that same space object is returned for the same agent
    # 该缓存确保为同一智能体返回相同的空间对象
    # allows action space seeding to work as expected
    # 允许操作空间种子按预期工作
    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        # gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
        return Discrete(4)


    @functools.lru_cache(maxsize=None)
    def action_space(self, agent):
        return Discrete(3)

    def render(self):
        ''''''
        """
        Renders the environment. In human mode, it can print to terminal, open
        up a graphical window, or open up some other display that a human can see and understand.
        渲染环境。在人工模式下,它可以打印到终端,打开
        打开一个图形窗口,或者打开一些人类可以看到和理解的其他显示。
        """
        if self.render_mode is None:
            gymnasium.logger.warn(
                "You are calling render method without specifying any render mode."
            )
            return

        if len(self.agents) == 2:
            string = "Current state: Agent1: {} , Agent2: {}".format(
                MOVES[self.state[self.agents[0]]], MOVES[self.state[self.agents[1]]]
            )
        else:
            string = "Game over"
        print(string)

    def observe(self, agent):
        ''''''
        """
        Observe should return the observation of the specified agent. This function
        should return a sane observation (though not necessarily the most up to date possible)
        at any time after reset() is called.
        观察应返回指定智能体的观察结果。此功能
        应该返回理智的观察结果(尽管不一定是最新的)
        在调用reset()之后的任何时间。
        """

        # observation of one agent is the previous state of the other
        #对一个代理的观察是另一个代理先前的状态
        return np.array(self.observations[agent])

    def close(self):
        ''''''
        """
        Close should release any graphical displays, subprocesses, network connections
        or any other environment data which should not be kept around after the
        user is no longer using the environment.
        关闭应释放任何图形显示、子流程、网络连接
        或任何其他环境数据,这种环境数据不应继续存在,
        当用户不再使用该环境之后。
        """
        pass

    def reset(self, seed=None, return_info=False, options=None):
        ''''''
        """
        Reset needs to initialize the following attributes
        - agents
        - rewards
        - _cumulative_rewards
        - terminations
        - truncations
        - infos
        - agent_selection
        And must set up the environment so that render(), step(), and observe()
        can be called without issues.
        Here it sets up the state dictionary which is used by step() and the observations dictionary which is used by step() and observe()
        Reset需要初始化以下属性
        -智能体
        -奖励
        -累计奖励
        -终止
        -截断
        -信息
        -智能体选择
        并且必须设置环境,以便render()、step()和observe()
        可以毫无问题地调用。
        在这里,它设置了step()使用的状态字典,和step()和observe()使用的观测字典
        """
        self.agents = self.possible_agents[:]
        # 后面带上[:],则是拷贝的意思,而非指向原来的对象。好处是:
        # 修改agents,不会影响到原来的possible_agents
        self.rewards = {agent: 0 for agent in self.agents}
        self._cumulative_rewards = {agent: 0 for agent in self.agents}
        self.terminations = {agent: False for agent in self.agents}
        self.truncations = {agent: False for agent in self.agents}
        self.infos = {agent: {} for agent in self.agents}
        self.state = {agent: NONE for agent in self.agents}
        self.observations = {agent: NONE for agent in self.agents}
        self.num_moves = 0  # 时间步计数
        """
        Our agent_selector utility allows easy cyclic stepping through the agents list.
        我们的agent_selector实用程序允许轻松地循环遍历代理列表。
        """
        self._agent_selector = agent_selector(self.agents)
        self.agent_selection = self._agent_selector.next() # 我认为这里的第一次next()等同于reset()
        # 这两个有啥区别?为何要用next()?
        # 我的理解是:这个next()的作用是循环迭代智能体列表。
        # selector是迭代选择器;self.agent_selection是“当前智能体”(我结合后半部分代码推导出来的)
        # 初始化时,self.agent_selection要用选择器的next()功能来指向第0号智能体。其实reset()也会指向第0号智能体。

    def step(self, action):
        ''''''
        """
        step(action) takes in an action for the current agent (specified by
        agent_selection) and needs to update
        - rewards
        - _cumulative_rewards (accumulating the rewards)
        - terminations
        - truncations
        - infos
        - agent_selection (to the next agent)
        And any internal state used by observe() or render()
        步骤(动作)为当前代理执行动作(由agent_selection指定),并且需要更新:
        -奖励
        -_cumulative_rewards(累积奖励)
        -终止
        -截断
        -信息
        -agent_selection(到下一个代理)
        以及observe()或render()使用的任何内部状态
        """
        if (
            self.terminations[self.agent_selection]
            or self.truncations[self.agent_selection]
        ):
            # handles stepping an agent which is already dead
            # 处理步进已经死亡的智能体
            # accepts a None action for the one agent, and moves the agent_selection to
            # the next dead agent,  or if there are no more dead agents, to the next live agent
            # 接受一个代理的None操作,并将agent_selection移动到
            # 下一个死亡智能体,或者如果不再有死亡智能体,则移动到下个存活智能体
            self._was_dead_step(action) # 哪里定义了这个函数?在父类里
            # 这个函数会把截断或终止的智能体从智能体列表中删除。对应的代码是:
            '''
            del self.terminations[agent]
            del self.truncations[agent]
            del self.rewards[agent]
            del self._cumulative_rewards[agent]
            del self.infos[agent]
            self.agents.remove(agent)
            '''

            # 我认为,既然死亡了,就直接在这里把action设为None好了。
            # 不过,考虑到这篇代码里没有明确游戏结束的标志,所以这里仍然保留原状。
            # 事实上,self._was_dead_step(action)这个函数可以视为控制游戏结束的函数。
            # 这篇代码从头到尾都没有把terminations改为True
            return


        agent = self.agent_selection # 这里是下一个代理吗?不是。而是当前智能体
        # 其实这句话可以忽略。它只是简单替代了self.agent_selection

        # the agent which stepped last had its _cumulative_rewards accounted for
        # (because it was returned by last()), so the _cumulative_rewards for this
        # agent should start again at 0
        # 最后一步执行的智能体有其累计值(因为它是由last()返回的),因此该智能体的累计值应该在0处重新开始
        # 啥意思?没明白。
        # 这篇代码没写last()。last()这个函数是父类的
        self._cumulative_rewards[agent] = 0

        # stores action of current agent
        # 存储当前智能体的动作
        self.state[self.agent_selection] = action

        # collect reward if it is the last agent to act
        # 如果是最后一个行动的智能体,则收取奖励
        if self._agent_selector.is_last():  # 这里的意思应该是:当前智能体是最后一个,则更新该时间步的奖励
        # 可以理解为:目的是控制一个完整的时间步。
            # rewards for all agents are placed in the .rewards dictionary
            # 所有智能体的奖励都放在奖励字典中
            self.rewards[self.agents[0]], self.rewards[self.agents[1]] = REWARD_MAP[
                (self.state[self.agents[0]], self.state[self.agents[1]])
            ]

            self.num_moves += 1 # 这个变量是时间步计数
            # The truncations dictionary must be updated for all players.
            # 必须为所有玩家更新截断字典。
            self.truncations = {
                agent: self.num_moves >= NUM_ITERS for agent in self.agents
            }
            # 这里截断的判断标准是:智能体行动的次数是否大于最大时间步。
            # 可以理解为:截断的意义就在于控制最大时间步。

            # observe the current state
            # 观察当前状态
            for i in self.agents:
                self.observations[i] = self.state[
                    self.agents[1 - self.agent_name_mapping[i]]
                ]
        else:  # 此时一个时间步还没结束。
            # necessary so that observe() returns a reasonable observation at all times.
            # 必要的,以便observe()始终返回合理的观测值。
            self.state[self.agents[1 - self.agent_name_mapping[agent]]] = NONE
            # no rewards are allocated until both players give an action
            # 在两名玩家都做出动作之前,不会分配奖励
            self._clear_rewards()

        # selects the next agent.
        # 选择下一个智能体。
        self.agent_selection = self._agent_selector.next()
        # Adds .rewards to ._cumulative_rewards
        # 向_cumulative_rewards添加奖励
        self._accumulate_rewards()

        if self.render_mode == "human":
            self.render()

使用自定义环境:(此代码要和Example_Custom_Environment.py位于同一个文件夹中)
t_ece.py

#测试Example_Custom_Environment

import Example_Custom_Environment as ece

env = ece.env(render_mode='human')
env.reset()
for agent in env.agent_iter(1000):#括号里的数字应该是总时间步
    observation, reward, termination, truncation, info = env.last()
    action =env.action_space(agent).sample() # 这里随机采样。实际用于强化学习算法时,这里应该改为算法的函数。
    if truncation==True:
        action = None
    env.step(action)

运行结果如下:
在这里插入图片描述

  • 7
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
当编写多智能体强化学习(Multi-Agent Reinforcement Learning)环境时,你需要定义一个环境类,其中包含状态空间、动作空间、环境动力学以及与智能体的交互逻辑。下面是一个简单的示例,展示了如何编写一个简单的多智能体环境模型: ```python import numpy as np class MultiAgentEnvironment: def __init__(self, num_agents, state_dim, action_dim): self.num_agents = num_agents self.state_dim = state_dim self.action_dim = action_dim def reset(self): # 初始化环境状态 self.state = np.zeros((self.num_agents, self.state_dim)) return self.state def step(self, actions): # 执行动作并更新环境状态 rewards = np.zeros(self.num_agents) for i in range(self.num_agents): rewards[i] = self._calculate_reward(actions[i]) self.state += actions return self.state, rewards def _calculate_reward(self, action): # 计算每个智能体的奖励 return np.sum(action) # 创建一个2个智能体,状态维度为2,动作维度为1的环境 env = MultiAgentEnvironment(num_agents=2, state_dim=2, action_dim=1) # 重置环境并获取初始状态 state = env.reset() print("Initial state:", state) # 模拟交互过程,每个智能体采取随机动作 actions = [np.random.randn(1) for _ in range(env.num_agents)] next_state, rewards = env.step(actions) print("Next state:", next_state) print("Rewards:", rewards) ``` 在这个示例中,我们定义了一个`MultiAgentEnvironment`类,它有三个主要方法: - `__init__`:初始化环境,接受智能体数量、状态维度和动作维度作为参数。 - `reset`:重置环境并返回初始状态。 - `step`:执行智能体的动作并返回下一个状态和奖励。 在`step`方法中,我们简单地将每个智能体的动作累积到环境的状态中,并计算每个智能体的奖励。在这个示例中,奖励被定义为每个智能体动作的总和。 你可以根据需要扩展和修改这个示例,以适应你的具体多智能体强化学习问题。希望这个示例对你有所帮助!如果还有其他问题,请随时提问。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值