基于“蘑菇书”的强化学习知识点（十三）：第三章的代码：racetrack.py及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）（二）

本文链接：https://blog.csdn.net/xzs1210652636/article/details/145865653
第三章的代码：racetrack.py及其涉及的其他代码的更新以及注解（gym版本＞= 0.26）（二）

- 摘要
摘要

本系列知识点讲解基于蘑菇书EasyRL中的内容进行详细的疑难点分析！具体内容请阅读蘑菇书EasyRL！
在MonteCarlo.ipynb目录下面创建envs文件夹，然后下载racetrack.py和track.txt放到envs文件夹中。
#!/usr/bin/env python

import time
import random
import numpy as np
import os
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
from IPython.display import clear_output
from gym.spaces import Discrete,Box
from gym import Env
from matplotlib import colors

class RacetrackEnv(Env) :
    """
    Class representing a race-track environment inspired by exercise 5.12 in Sutton & Barto 2018 (p.111).
    Please do not make changes to this class - it will be overwritten with a clean version when it comes to marking.

    The dynamics of this environment are detailed in this coursework exercise's jupyter notebook, although I have
    included rather verbose comments here  for those of you who are interested in how the environment has been
    implemented (though this should not impact your solution code).ss
    """
    """
    作用：
    - 定义一个名为 RacetrackEnv 的类，它继承自 Gym 库中的 Env 类（通常是 gym.Env）。
    继承意义：
    - 继承 Env 意味着 RacetrackEnv 必须实现 Gym 环境的标准接口，
      如 reset()、step() 和 render()，以便于与强化学习算法或库
      （例如 Stable Baselines 或 OpenAI Baselines）对接。
    """
    '''
    作用：
    - 定义一个字典，将动作编号（0 到 8）映射到一个二维变化量 (d_y, d_x)。
    意义：
    - 每个动作对应改变车（或智能体）的速度。
    - 例如，动作 0 表示“加速（向前增加速度）并同时制动水平方向”（d_y=+1, d_x=-1）；
            动作 4 表示保持当前速度不变。
    '''
    ACTIONS_DICT = {
        0 : (1, -1),  # Acc Vert., Brake Horiz.
        1 : (1, 0),   # Acc Vert., Hold Horiz.
        2 : (1, 1),   # Acc Vert., Acc Horiz.
        3 : (0, -1),  # Hold Vert., Brake Horiz.
        4 : (0, 0),   # Hold Vert., Hold Horiz.
        5 : (0, 1),   # Hold Vert., Acc Horiz.
        6 : (-1, -1), # Brake Vert., Brake Horiz.
        7 : (-1, 0),  # Brake Vert., Hold Horiz.
        8 : (-1, 1)   # Brake Vert., Acc Horiz.
    }

    '''
    作用：
    - 定义地图中各个网格的类型。
    解释：
    - 值为 0 表示赛道（可行驶区域）。
    - 值为 1 表示墙（障碍物，不能通行）。
    - 值为 2 表示起点区域。
    - 值为 3 表示目标区域。
    '''
    CELL_TYPES_DICT = {
        0 : "track",
        1 : "wall",
        2 : "start",
        3 : "goal"
    }
    '''
    作用：
    - 定义环境的元数据，告诉 Gym 环境支持哪些渲染模式（这里是 'human'）以及渲染时的帧率（每秒 4 帧）。
    '''
    metadata = {'render_modes': ['human'], "render_fps": 4,}

    def __init__(self,render_mode = 'human') :
        """
        作用：
        - 构造函数用于初始化环境实例。
        参数 render_mode：
        - 默认值 'human' 表示以人类可见的方式渲染环境。
        """
        '''
        作用：
        - 利用 np.loadtxt 从文件 "track.txt" 中加载地图数据，数据类型为整数。
        - os.path.dirname(__file__) 获取当前脚本所在目录，再拼接 "/track.txt" 得到地图文件的路径。
        - np.flip(..., axis=0) 将加载的二维数组沿垂直方向翻转。
        '''
        # Load racetrack map from file.
        self.track = np.flip(np.loadtxt(os.path.dirname(__file__)+"/track.txt", dtype = int), 
                             axis = 0)

        '''
        作用：
        - 遍历整个地图（二维数组 self.track），
          查找所有网格中其类型为 "start" 的格子，并将其坐标 (y, x) 保存到列表 initial_states 中。
        举例：
        - 根据 track.txt 文件中最后一行数据，起点的编号为 2；如果该行中有数值 2，则相应的 (y, x) 坐标被记录下来。
          例如，可能有起点坐标 [(13, 0), (13, 1), (13, 2), (13, 3)]（具体位置取决于文件内容）。
        '''
        # Discover start grid squares.
        self.initial_states = []
        for y in range(self.track.shape[0]) :
            for x in range(self.track.shape[1]) :
                if (self.CELL_TYPES_DICT[self.track[y, x]] == "start") :
                    self.initial_states.append((y, x))
        '''
        观察空间：
        - 这里用 Box 定义了一个连续空间，形状为 (4,)，表示状态由 4 个浮点数构成，
          通常为 (y_pos, x_pos, y_velocity, x_velocity)。
        - low=-high, high=high 表示状态各维度的取值范围为非常大的数值区间，
          足以包含所有可能状态。
        动作空间：
        - 使用 Discrete(9) 定义离散动作空间，动作编号为 0～8，
          对应 ACTIONS_DICT 中的 9 个动作。
        '''
        '''
        np.finfo(np.float32).max
        - 这个表达式返回 float32 类型的最大可表示值，约为 3.4028235e+38。
        - 用于定义数值型空间时，我们通常用这种极大值来表示“无限大”的边界。
        np.array([...])
        - 将四个相同的最大值组合成一个 NumPy 数组。
        - 结果是一个一维数组 high，形状为 (4,)，例如：
        high = [3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38]
        '''
        high= np.array([np.finfo(np.float32).max, np.finfo(np.float32).max, 
                        np.finfo(np.float32).max, np.finfo(np.float32).max])
        '''
        运行效果：
        - 环境的 observation_space 就是一个连续空间，每个状态是一个 4 维向量，
          其每个分量可以取的值范围是从 -3.4028235e+38 到 3.4028235e+38。
          实际应用中，状态值通常远小于这些极值，但这里用这样的边界保证能容纳所有可能的数值。
        '''
        self.observation_space = Box(low=-high, high=high, shape=(4,), dtype=np.float32)
        '''表示动作空间有 9 个不同的动作，动作编号为 0 到 8。'''
        self.action_space = Discrete(9)
        self.is_reset = False

    def step(self, action : int) :
        """
        作用：
        - 执行给定动作并返回下一个状态、奖励、done 标志以及附加信息。
        参数：
        - action 是一个整数（0～8），表示采取的动作。

        Takes a given action in the environment's current state, and returns a next state,
        reward, and whether the next state is done or not.

        Arguments:
            action {int} -- The action to take in the environment's current state. Should be an integer in the range [0-8].

        Raises:
            RuntimeError: Raised when the environment needs resetting.\n
            TypeError: Raised when an action of an invalid type is given.\n
            ValueError: Raised when an action outside the range [0-8] is given.\n

        Returns:
            A tuple of:\n
                {(int, int, int, int)} -- The next state, a tuple of (y_pos, x_pos, y_velocity, x_velocity).\n
                {int} -- The reward earned by taking the given action in the current environment state.\n
                {bool} -- Whether the environment's next state is done or not.\n

        """
        # Check whether a reset is needed.
        '''
        作用：
        - 如果环境还未调用 reset()（即 is_reset 为 False），
          则抛出错误提示需要先调用 reset() 初始化环境。
        '''
        if (not self.is_reset) :
            raise RuntimeError(".step() has been called when .reset() is needed.\n" +
                               "You need to call .reset() before using .step() for the first time, and after an episode ends.\n" +
                               ".reset() initialises the environment at the start of an episode, then returns an initial state.")

        # Check that action is the correct type (either a python integer or a numpy integer).
        '''
        作用：
        - 确保传入的动作是整数类型，并且在合法范围 [0, 8] 内。
        '''
        if (not (isinstance(action, int) or isinstance(action, np.integer))) :
            raise TypeError("action should be an integer.\n" +
                            "action value {} of type {} was supplied.".format(action, type(action)))
        # Check that action is an allowed value.
        if (action < 0 or action > 8) :
            raise ValueError("action must be an integer in the range [0-8] corresponding to one of the legal actions.\n" +
                             "action value {} was supplied.".format(action))

        # Update Velocity.
        # With probability, 0.85 update velocity components as intended.
        '''
        作用：
        - 更新智能体的速度变化：
          - 80% 的概率使用 ACTIONS_DICT 中对应动作的 (d_y, d_x) 更新速度；
          - 20% 的概率不改变速度（即 (0,0)），引入一定随机性。
        举例：
        - 如果 action 为 2，则 ACTIONS_DICT[2] = (1, 1)；
          在 80% 概率下，(d_y, d_x) = (1,1)；否则为 (0,0)。
        '''
        if (np.random.uniform() < 0.8) :
            (d_y, d_x) = self.ACTIONS_DICT[action]
        # With probability, 0.15 Do not change velocity components.
        else :
            (d_y, d_x) = (0, 0)
        '''
        作用：
        - 将当前速度加上变化量 (d_y, d_x) 得到新的速度。
        '''
        self.velocity = (self.velocity[0] + d_y, self.velocity[1] + d_x)
		# Keep velocity within bounds (-10, 10).
        '''
        作用：
        - 限制速度各个分量在 -10 到 10 的范围内。
        注意：
        - 这里使用了 if-elif 结构，确保速度不超过上限或低于下限。
        '''
        if (self.velocity[0] > 10) :
            self.velocity[0] = 10
        elif (self.velocity[0] < -10) :
            self.velocity[0] = -10
        if (self.velocity[1] > 10) :
            self.velocity[1] = 10
        elif (self.velocity[1] < -10) :
            self.velocity[1] = -10

        # Update Position.
        '''
        作用：
        - 根据当前位置和更新后的速度，计算新的位置。
        '''
        new_position = (self.position[0] + self.velocity[0], self.position[1] + self.velocity[1])
        '''
        作用：
        - 初始化即时奖励为 0，终止标志为 False。
        '''
        reward = 0
        done = False

        # If position is out-of-bounds, return to start and set velocity components to zero.
        '''出界处理：
        作用：
        - 如果新位置超出地图边界，则将智能体重置到随机起点（从 initial_states 中选择），
          速度置为 0，并施加惩罚（reward 减 10）。
        '''
        if (new_position[0] < 0 or new_position[1] < 0 or new_position[0] >= self.track.shape[0] or new_position[1] >= self.track.shape[1]) :
            self.position = random.choice(self.initial_states)
            self.velocity = (0, 0)
            reward -= 10
        # If position is in a wall grid-square, return to start and set velocity components to zero.
        elif (self.CELL_TYPES_DICT[self.track[new_position]] == "wall") :
            '''墙壁处理：
            作用：
            - 如果新位置处于墙壁（障碍）区域，则同样重置到起点，速度归零，施加惩罚。
            '''
            self.position = random.choice(self.initial_states)
            self.velocity = (0, 0)
            reward -= 10
        # If position is in a track grid-squre or a start-square, update position.
        elif (self.CELL_TYPES_DICT[self.track[new_position]] in ["track", "start"]) :
            '''正常行驶：
            作用：
            - 如果新位置处于赛道或起点区域，则更新位置为 new_position。
            '''
            self.position = new_position
        # If position is in a goal grid-square, end episode.
        elif (self.CELL_TYPES_DICT[self.track[new_position]] == "goal") :
            '''目标处理：
            作用：
            - 如果新位置处于目标区域，则更新位置，给出奖励 +10，
              并将 done 设为 True，表示回合结束。
            '''
            self.position = new_position
            reward += 10
            done = True
        # If this gets reached, then the student has touched something they shouldn't have. Naughty!
        else :
            '''其他情况：
            作用：
            - 如果新位置不符合上述任何条件，则抛出错误。这通常是为了防止意外情况发生。
            '''
            raise RuntimeError("You've met with a terrible fate, haven't you?\nDon't modify things you shouldn't!")

        # Penalise every timestep.
        '''
        作用：
        - 每步都施加一个时间惩罚（-1），鼓励智能体尽快达到目标。
        '''
        reward -= 1

        # Require a reset if the current state is done.
        '''作用：
        - 如果回合结束（done True），将 is_reset 设为 False，
          提示环境需要在下一次交互前调用 reset()。
        '''
        if (done) :
            self.is_reset = False

        # Return next state, reward, and whether the episode has ended.
        return np.array([self.position[0], self.position[1], self.velocity[0], self.velocity[1]]), reward, done,{}


    def reset(self,seed=None) :
        """
        Resets the environment, ready for a new episode to begin, then returns an initial state.
        The initial state will be a starting grid square randomly chosen using a uniform distribution,
        with both components of the velocity being zero.

        Returns:
            {(int, int, int, int)} -- an initial state, a tuple of (y_pos, x_pos, y_velocity, x_velocity).
        作用：
        - 重置环境状态，开始一个新回合。
        """

        # Pick random starting grid-square.
        '''从 self.initial_states 中随机选择一个起点，作为智能体的起始位置。'''
        self.position = random.choice(self.initial_states)

        # Set both velocity components to zero.
        '''将速度重置为 (0,0)。'''
        self.velocity = (0, 0)
        '''将 is_reset 设为 True，表示环境已初始化，可以调用 step()。'''
        self.is_reset = True
        '''返回初始状态，
        形式为 NumPy 数组 [y_position, x_position, y_velocity, x_velocity].'''
        return np.array([self.position[0], self.position[1], self.velocity[0], self.velocity[1]])


    def render(self, render_mode = 'human') :
        """
        Renders a pretty matplotlib plot representing the current state of the environment.
        Calling this method on subsequent timesteps will update the plot.
        This is VERY VERY SLOW and wil slow down training a lot. Only use for debugging/testing.

        Arguments:
            sleep_time {float} -- How many seconds (or partial seconds) you want to wait on this rendered frame.

        """
        # Turn interactive render_mode on.
        '''打开 matplotlib 的交互模式，使得图形窗口可以动态更新。'''
        plt.ion()
        '''plt.figure(num="env_render") 创建或激活一个编号为 "env_render" 的图形窗口。'''
        fig = plt.figure(num = "env_render")
        '''plt.gca() 获取当前坐标轴。'''
        ax = plt.gca()
        '''ax.clear() 清除当前坐标轴内容。'''
        ax.clear()
        '''clear_output(wait=True)（通常来自 IPython.display）用于清除前一次输出，避免图形叠加。'''
        clear_output(wait = True)

        # Prepare the environment plot and mark the car's position.
        '''创建 env_plot 作为地图数据的副本；'''
        env_plot = np.copy(self.track)
        '''将当前智能体所在的位置（self.position）在 env_plot 中标记为 4
        （假设 4 对应某种颜色，比如用来标示智能体位置）；'''
        env_plot[self.position] = 4
        '''通过 np.flip 将图像沿垂直方向翻转，
        使得显示顺序与环境中地图行号对应（通常地图数据顶行对应实际图像的底部）。'''
        env_plot = np.flip(env_plot, axis = 0)

        # Plot the gridworld.
        '''
        定义颜色映射 cmap，其中：
            "white" 表示赛道或空白区域，
            "black" 表示墙或障碍，
            "green" 可能表示起点，
            "red" 表示悬崖（或其他危险区域），
            "yellow" 表示目标或智能体位置。
        '''
        cmap = colors.ListedColormap(["white", "black", "green", "red", "yellow"])
        bounds = list(range(6))
        '''使用 BoundaryNorm 将数值映射到颜色区间。'''
        norm = colors.BoundaryNorm(bounds, cmap.N)
        '''ax.imshow() 将 env_plot 以图像方式绘制到坐标轴上，zorder=0 表示在最底层绘制。'''
        ax.imshow(env_plot, cmap = cmap, norm = norm, zorder = 0)

        # Plot the velocity.
        '''作用：
        如果智能体速度不为 (0,0)，在图上绘制一个箭头表示速度方向和大小。
        - 箭头起点为当前位置（注意 x 和 y 坐标的映射）；
        - 使用 path_effects 给箭头加上描边效果，使其更清晰。
        举例：
        - 假设 self.position=(5,7) 在地图中，速度为 (2, -1)，
          箭头会从 (x=7, y=track_height-1-5) 出发，指向 (x+velocity[1], y-velocity[0])。'''
        if (not self.velocity == (0, 0)) :
            ax.arrow(self.position[1], self.track.shape[0] - 1 - self.position[0], self.velocity[1], -self.velocity[0],
                     path_effects=[pe.Stroke(linewidth=1, foreground='black')], color = "yellow", width = 0.1, length_includes_head = True, zorder = 2)

        # Set up axes.
        '''
        作用：
        - 在图中添加网格线，使得每个网格单元边界清晰。
        - 设置 x 轴和 y 轴的刻度，使得每个格子对应一个刻度位置，但不显示刻度标签。
        '''
        ax.grid(which = 'major', axis = 'both', linestyle = '-', color = 'k', linewidth = 2, zorder = 1)
        ax.set_xticks(np.arange(-0.5, self.track.shape[1] , 1));
        ax.set_xticklabels([])
        ax.set_yticks(np.arange(-0.5, self.track.shape[0], 1));
        ax.set_yticklabels([])

        # Draw everything.
        #fig.canvas.draw()
        #fig.canvas.flush_events()
        plt.show()
        # time sleep
        time.sleep(0.1)

    def get_actions(self) :
        """
        Returns the available actions in the current state - will always be a list
        of integers in the range [0-8].
        作用：
        - 返回当前环境中可用的动作列表，实际上返回 0～8 的整数列表。
        解释：
        - 使用 [*self.ACTIONS_DICT] 展开 ACTIONS_DICT 的键，得到 [0,1,2,...,8]。
        """
        return [*self.ACTIONS_DICT]
    
    
if __name__ == "__main__":
    num_steps = 1000000
    env = RacetrackEnv()
    state = env.reset()
    print(state)
    for _ in range(num_steps) :

        next_state, reward, done,_ = env.step(random.choice(env.get_actions()))
        print(next_state)
        env.render()

        if (done) :
            _ = env.reset()