强化学习实验环境Gymnasium：1.INTRODUCTION：1.1 Basic Usage

本文链接：https://blog.csdn.net/l963852k/article/details/130590574

Gymnasium是提供单代理强化学习环境API的项目，包括CartPole、Pendulum等环境的实现。其核心是Env类，代表马尔可夫决策过程。用户可通过make函数初始化环境，使用reset、step和render函数进行交互。文章介绍了如何通过Wrappers修改环境，如TimeLimit、ClipAction等，以适应不同的需求。

摘要由CSDN通过智能技术生成

openAI的Gym现在改成了Gymnasium。本文主要对是对Gymnasium文档的翻译。大多是机器翻译，少量人工修改。肯定有翻译不对的地方，可根据上下文理解修正。

文档地址：https://gymnasium.farama.org/content/basic_usage/

1 INTRODUCTION

1.1 Basic Usage

1.1.1 Initializing Environments

1.1.2 Interacting with the Environment

Explaining the code

1.1.3 Action and observation spaces

1.1.4 Modifying the environment

正文开始

1 INTRODUCTION

1.1 Basic Usage

Gymnasium is a project that provide an API for all single agent reinforcement learning environments that include implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. Gymnasium是一个为所有单代理强化学习环境提供API的项目，包括常见环境的实现:cartpole、pendulum、mountain-car、mujoco、atari等等。

The API contains four key functions: make, reset, step and render that this basic usage will introduce you to. At the core of Gymnasium is Env which is a high level python class representing a markov decision process from reinforcement learning theory (this is not a perfect reconstruction missing several components of MDPs). Within gymnasium, environments (MDPs) are implements as Env along with Wrappers that can change the results passed to the user. API包含四个关键函数:make、reset、step和render，这是基本用法介绍。Gymnasium的核心是Env，它是一个高级python类，表示来自强化学习理论的马尔可夫决策过程(这不是一个完美的重构，缺少MDPs的几个组件)。在gymnasium中，环境(MDP)是作为Env和Wrappers实现的，Wrappers可以改变传递给用户的结果。

1.1.1 Initializing Environments

Initializing environments is very easy in Gymnasium and can be done via the make function: 在Gymnasium中初始化环境非常容易，可以通过make函数完成:

import gymnasium as gym
env = gym.make('CartPole-v1')

This will return an Env for users to interact with. To see all environments you can create, use gymnasium.envs.registry.keys().make includes a number of additional parameters to adding wrappers, specifying keywords to the environment and more. 这将返回一个Env供用户交互。要查看您可以创建的所有环境，请使用ymnasium.envs.registry.keys().make 包括许多附加参数，用于添加包装器、为环境指定关键字等等。

1.1.2 Interacting with the Environment

The classic “agent-environment loop” pictured below is simplified representation of reinforcement learning that Gymnasium implements.

下图中经典的“代理-环境循环”是Gymnasium实施的强化学习的简化表示。

This loop is implemented using the following gymnasium code

该循环使用以下gymnasium代码实现

import gymnasium as gym
env = gym.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()

for _ in range(1000):
    action = env.action_space.sample()  # agent policy that uses the observation and info
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

Explaining the code

First, an environment is created using `make` with an additional keyword `"render_mode"` that specifies how the environment should be visualised. See `render` for details on the default meaning of different render modes. In this example, we use the `"LunarLander"` environment where the agent controls a spaceship that needs to land safely.	首先，使用make创建一个环境，并附加一个关键字“render_mode ”,指定环境应该如何可视化。有关不同渲染模式的默认含义的详细信息，请参见`render`。在这个例子中，我们使用“LunarLander”环境，其中代理控制一艘需要安全着陆的飞船。
After initializing the environment, we `reset` the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the `seed` or `options` parameters with `reset`.	初始化环境后，我们重置`（reset`）环境以获得对环境的第一次观察。要使用特定的随机种子或选项初始化环境(有关可能的值，请参见环境文档),请使用reset的参数`seed`或`options`参数。
Next, the agent performs an action in the environment, `step`, this can be imagined as moving a robot or pressing a button on a games’ controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a timestep.	接下来，代理在环境中执行一个动作，`step`，这可以想象为移动一个机器人或按下游戏控制器上的一个按钮，从而引起环境的变化。结果，代理从更新的环境中接收到新的观察以及对采取该行动的奖励。这个奖励可以是正面的，比如消灭一个敌人，或者是负面的，比如进入岩浆。一个这样的动作-观察交换被称为时间步长timestep）。
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or the agent have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by `step`. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of `terminated` or `truncated` are `true` then `reset` should be called next to restart the environment.	然而，在一些时间步之后，环境可能结束，这被称为terminal 状态。例如，机器人可能已经坠毁，或者代理已经成功完成任务，环境将需要停止，因为代理无法继续。在gymnasium中，如果环境已经终止，这是由`step`返回的。类似地，我们也可能希望环境在固定数量的时间步长后结束，在这种情况下，环境发出一个截断的信号。如果terminated或truncated为`true`，则接下来应该调用reset来重新启动环境。

1.1.3 Action and observation spaces

Every environment specifies the format of valid actions and observations with the `env.action_space` and `env.observation_space` attributes. This is helpful for both knowing the expected input and output of the environment as all valid actions and observation should be contained with the respective space.	每个环境都使用env.action_space和env.observation_space属性指定有效动作和观察的格式。这有助于了解环境的预期输入和输出，因为所有有效的动作和观察都应该包含在各自的空间中。
In the example, we sampled random actions via `env.action_space.sample()` instead of using an agent policy, mapping observations to actions which users will want to make. See one of the agent tutorials for an example of creating and training an agent policy.	在这个例子中，我们通过env.action_space.sample()对随机动作进行采样，而不是使用代理策略，将观察结果映射到用户想要执行的动作。有关创建和训练代理策略的示例，请参见其中一个代理教程。
Every environment should have the attributes `action_space` and `observation_space`, both of which should be instances of classes that inherit from `Space`. Gymnasium has support for a majority of possible spaces users might need: `Box`: describes an n-dimensional continuous space. It’s a bounded space where we can define the upper and lower limits which describe the valid values our observations can take. `Discrete`: describes a discrete space where {0, 1, …, n-1} are the possible values our observation or action can take. Values can be shifted to {a, a+1, …, a+n-1} using an optional argument. `Dict`: represents a dictionary of simple spaces. `Tuple`: represents a tuple of simple spaces. `MultiBinary`: creates an n-shape binary space. Argument n can be a number or a list of numbers. `MultiDiscrete`: consists of a series of `Discrete` action spaces with a different number of actions in each element.	每个环境都应该有属性action_space和observation_space，这两个属性都应该是从space继承的类的实例。Gymnasium支持用户可能需要的大多数空间: Box:描述一个n维连续空间。这是一个有限的空间，我们可以定义上限和下限，这些上限和下限描述了我们的观察可以采用的有效值。 `Discrete`:描述一个离散的空间，其中{0，1，…，n-1}是我们的观察或行动可以取的可能值。可以使用可选参数将值移位到{a，a+1，…，a+n-1}。 Dict:代表一个简单空间的字典。 Tuple:表示简单空间的元组。 `MultiBinary`:创建一个n-shape二进制空间。参数n可以是一个数字或一系列数字。 `MultiDiscrete`:由一系列离散的动作空间组成，每个元素中有不同数量的动作。
For example usage of spaces, see their documentation along with utility functions. There are a couple of more niche spaces `Graph`, `Sequence` and `Text`.	有关spaces的用法，请参阅它们的documentation以及utility functions。还有几个更适合的spaces如 `Graph`, `Sequence` 和`Text`。

1.1.4 Modifying the environment

Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via gymnasium.make will already be wrapped by default using the TimeLimit, OrderEnforcing and PassiveEnvChecker. Wrappers是修改现有环境的一种便捷方式，无需直接修改底层代码。使用Wrappers将允许您避免大量样板代码，并使您的环境更加模块化。Wrappers也可以被链接起来以组合它们的效果。默认情况下，通过gymnasium.make生成的大多数环境已经使用TimeLimit、OrderEnforcing和PassiveEnvChecker进行了包装。

In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along with (possibly optional) parameters to the wrapper’s constructor: 为了包装一个环境，您必须首先初始化一个基础环境。然后，您可以将这个环境以及(可能是可选的)参数传递给包装器的构造函数:

import gymnasium as gym
from gymnasium.wrappers import FlattenObservation
env = gym.make("CarRacing-v2")
env.observation_space.shape
(96, 96, 3)
wrapped_env = FlattenObservation(env)
wrapped_env.observation_space.shape
(27648,)

Gymnasium already provides many commonly used wrappers for you. Some examples:

TimeLimit: Issue a truncated signal if a maximum number of timesteps has been exceeded (or the base environment has issued a truncated signal).
ClipAction: Clip the action such that it lies in the action space (of type Box).
RescaleAction: Rescale actions to lie in a specified interval
TimeAwareObservation: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.

Gymnasium已经为您提供了许多常用的wrappers。一些例子:

TimeLimit:如果超过了最大时间步长数，则发出截断信号(或者基本环境已经发出截断信号)。
ClipAction:Clip动作，使其位于动作空间(Box类型)。
RescaleAction:Rescale 操作，使其位于指定的间隔内
TimeAwareObservation:将时间步长的索引信息添加到观察中。在某些情况下有助于确保转移是马尔可夫的。

For a full list of implemented wrappers in gymnasium, see wrappers.

有关gymnasium中实现的包装器的完整列表，请参见 wrappers。

If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the .unwrapped attribute. If the environment is already a base environment, the .unwrapped attribute will just return itself. 如果您有一个包装的环境，并且希望在所有包装器层的下面得到一个未包装的环境(以便您可以手动调用一个函数或者更改环境的一些底层方面)，那么您可以使用.unwrapped的属性。如果该环境已经是base环境，则.unwrapped的属性将自动返回。

wrapped_env
<FlattenObservation<TimeLimit<OrderEnforcing<PassiveEnvChecker<CarRacing<CarRacing-v2>>>>>>
wrapped_env.unwrapped
<gymnasium.envs.box2d.car_racing.CarRacing object at 0x7f04efcb8850>