OpenAI Gym中FrozenLake环境（场景）源码分析（3）

蓝天居士

已于 2023-07-14 16:59:10 修改

阅读量360

点赞数

分类专栏：强化学习 OpenAI Gym 文章标签： OpenAI Gym 强化学习 Q-learning

于 2023-07-14 16:42:46 首次发布

本文链接：https://blog.csdn.net/phmatthaus/article/details/131726211

版权

强化学习同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

OpenAI Gym

7 篇文章 0 订阅

订阅专栏

本文详细介绍了如何使用Python进行OpenAIGym中的FrozenLake环境源码调试，通过设置断点和单步执行，解释了关键步骤，包括环境创建、动作选择等。同时，文章提到了Q学习算法的实现，包括学习率、探索率等参数的设定，并展示了训练过程中的奖励和Q表更新。

摘要由CSDN通过智能技术生成

接前一篇文章：OpenAI Gym中FrozenLake环境（场景）源码分析（2）

前一篇文章讲到通过python进行调试：

$ python -m pdb frozen_lake2.py 
> /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py(1)<module>()
-> import numpy as np
(Pdb)

程序自己主动停在了第1行，等待调试。

为了便于理解，再次贴出frozen_lake2.py的源码：

import numpy as np
import gym
import random
import time
from IPython.display import clear_output
 
env = gym.make("FrozenLake-v1")
 
observation_space = env.observation_space
print("The observation space: {}".format(observation_space))
observation_space_size = env.observation_space.n
print(observation_space_size)
 
action_space = env.action_space
print("The action space: {}".format(action_space))
action_space_size = env.action_space.n
print(action_space_size)
 
q_table = np.zeros((observation_space_size, action_space_size))
# q_table = np.zeros([observation_space_size, action_space_size])
print(q_table)
 
"""
num_episodes = 10000
max_steps_per_episode = 100
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01
"""
 
total_episodes = 15000        # Total episodes 训练次数
learning_rate = 0.8           # Learning rate 学习率
max_steps = 99                # Max steps per episode 一次训练中最多决策次数
gamma = 0.95                  # Discounting rate 折扣率，对未来收益的折扣
 
# Exploration parameters
epsilon = 1.0                 # Exploration rate 探索率，就是选择动作时，随机选择动作的概率
max_epsilon = 1.0             # Exploration probability at start 初始探索率
min_epsilon = 0.01            # Minimum exploration probability 最低探索率
decay_rate = 0.001            # Exponential decay rate for exploration prob 探索率消减的指数
 
# List of rewards
rewards = []
 
# For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    state = state[0] #本来没这条代码，但是我看这个是二元组，为了后面估计Q值可以跑，我就改成这个了，我看着是不影响的
    step = 0
    done = False
    total_rewards = 0
 
    for step in range(max_steps):
        # Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(q_table[state,:])
 
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
 
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, truncated, info = env.step(action) # 这个也是，刚开始报错，来后我查了新的库这个函数输出五个数，网上说最后那个加‘_’就行
        #new_state, reward, done, info, _ = env.step(action) # 这个也是，刚开始报错，来后我查了新的库这个函数输出五个数，网上说最后那个加‘_’就行
 
        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        q_table[state, action] = q_table[state, action] + learning_rate * (reward + gamma * np.max(q_table[new_state, :]) - q_table[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
        #if truncated == True:
            #break
        
    # Reduce epsilon (because we need less and less exploration) 随着智能体对环境熟悉程度增加，可以减少对环境的探索
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
 
print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(q_table)

前文提到框架中包括4个关键步骤：

env = gym.make("FrozenLake-v1")
env.reset()
env.action_space.sample()
new_state, reward, done, truncated, info = env.step(action)

下面一个一个进行单步调试，并进行代码解析。

先看步骤1：env = gym.make("FrozenLake-v1")。

这句代码在frozen_lake2.py文件中的第7行，因此将断点设置在文件的第10行，命令及结果如下：

 python -m pdb frozen_lake2.py 
> /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py(1)<module>()
-> import numpy as np
(Pdb) b 7
Breakpoint 1 at /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py:7
(Pdb)

之后输入c，使程序继续运行（调到这个断点）。如下所示：

$ python -m pdb frozen_lake2.py 
> /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py(1)<module>()
-> import numpy as np
(Pdb) b 7
Breakpoint 1 at /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py:7
(Pdb) c
> /home/penghao/OpenAI-Gym/sample_code/frozen_lake2.py(7)<module>()
-> env = gym.make("FrozenLake-v1")
(Pdb)

可以看到，程序已经停在了之前设置的断点的位置。

输入s，细点执行，也就是通常所说的Step In，即进入到函数或方法中。如下所示：

-> env = gym.make("FrozenLake-v1")
(Pdb) s
--Call--
> /home/penghao/.local/lib/python3.11/site-packages/gym/envs/registration.py(502)make()
-> def make(
(Pdb)

可以看到，程序已经进入到了make函数中。最为关键的是，其指示出了make函数所在的位置，是在gym/envs/registration.py文件中。make方法代码如下：

def make(
    id: Union[str, EnvSpec],
    max_episode_steps: Optional[int] = None,
    autoreset: bool = False,
    apply_api_compatibility: Optional[bool] = None,
    disable_env_checker: Optional[bool] = None,
    **kwargs,
) -> Env:
    """Create an environment according to the given ID.

    To find all available environments use `gym.envs.registry.keys()` for all valid ids.

    Args:
        id: Name of the environment. Optionally, a module to import can be included, eg. 'module:Env-v0'
        max_episode_steps: Maximum length of an episode (TimeLimit wrapper).
        autoreset: Whether to automatically reset the environment after each episode (AutoResetWrapper).
        apply_api_compatibility: Whether to wrap the environment with the `StepAPICompatibility` wrapper that
            converts the environment step from a done bool to return termination and truncation bools.
            By default, the argument is None to which the environment specification `apply_api_compatibility` is used
            which defaults to False. Otherwise, the value of `apply_api_compatibility` is used.
            If `True`, the wrapper is applied otherwise, the wrapper is not applied.
        disable_env_checker: If to run the env checker, None will default to the environment specification `disable_env_checker`
            (which is by default False, running the environment checker),
            otherwise will run according to this parameter (`True` = not run, `False` = run)
        kwargs: Additional arguments to pass to the environment constructor.

    Returns:
        An instance of the environment.

    Raises:
        Error: If the ``id`` doesn't exist then an error is raised
    """
    if isinstance(id, EnvSpec):
        spec_ = id
    else:
        module, id = (None, id) if ":" not in id else id.split(":")
        if module is not None:
            try:
                importlib.import_module(module)
            except ModuleNotFoundError as e:
                raise ModuleNotFoundError(
                    f"{e}. Environment registration via importing a module failed. "
                    f"Check whether '{module}' contains env registration and can be imported."
                )
        spec_ = registry.get(id)

        ns, name, version = parse_env_id(id)
        latest_version = find_highest_version(ns, name)
        if (
            version is not None
            and latest_version is not None
            and latest_version > version
        ):
            logger.warn(
                f"The environment {id} is out of date. You should consider "
                f"upgrading to version `v{latest_version}`."
            )
        if version is None and latest_version is not None:
            version = latest_version
            new_env_id = get_env_id(ns, name, version)
            spec_ = registry.get(new_env_id)
            logger.warn(
                f"Using the latest versioned environment `{new_env_id}` "
                f"instead of the unversioned environment `{id}`."
            )

        if spec_ is None:
            _check_version_exists(ns, name, version)
            raise error.Error(f"No registered env with id: {id}")

    _kwargs = spec_.kwargs.copy()
    _kwargs.update(kwargs)

    if spec_.entry_point is None:
        raise error.Error(f"{spec_.id} registered but entry_point is not specified")
    elif callable(spec_.entry_point):
        env_creator = spec_.entry_point
    else:
        # Assume it's a string
        env_creator = load(spec_.entry_point)

    mode = _kwargs.get("render_mode")
    apply_human_rendering = False
    apply_render_collection = False

    # If we have access to metadata we check that "render_mode" is valid and see if the HumanRendering wrapper needs to be applied
    if mode is not None and hasattr(env_creator, "metadata"):
        assert isinstance(
            env_creator.metadata, dict
        ), f"Expect the environment creator ({env_creator}) metadata to be dict, actual type: {type(env_creator.metadata)}"

        if "render_modes" in env_creator.metadata:
            render_modes = env_creator.metadata["render_modes"]
            if not isinstance(render_modes, Sequence):
                logger.warn(
                    f"Expects the environment metadata render_modes to be a Sequence (tuple or list), actual type: {type(render_modes)}"
                )

            # Apply the `HumanRendering` wrapper, if the mode=="human" but "human" not in render_modes
            if (
                mode == "human"
                and "human" not in render_modes
                and ("rgb_array" in render_modes or "rgb_array_list" in render_modes)
            ):
                logger.warn(
                    "You are trying to use 'human' rendering for an environment that doesn't natively support it. "
                    "The HumanRendering wrapper is being applied to your environment."
                )
                apply_human_rendering = True
                if "rgb_array" in render_modes:
                    _kwargs["render_mode"] = "rgb_array"
                else:
                    _kwargs["render_mode"] = "rgb_array_list"
            elif (
                mode not in render_modes
                and mode.endswith("_list")
                and mode[: -len("_list")] in render_modes
            ):
                _kwargs["render_mode"] = mode[: -len("_list")]
                apply_render_collection = True
            elif mode not in render_modes:
                logger.warn(
                    f"The environment is being initialised with mode ({mode}) that is not in the possible render_modes ({render_modes})."
                )
        else:
            logger.warn(
                f"The environment creator metadata doesn't include `render_modes`, contains: {list(env_creator.metadata.keys())}"
            )

    if apply_api_compatibility is True or (
        apply_api_compatibility is None and spec_.apply_api_compatibility is True
    ):
        # If we use the compatibility layer, we treat the render mode explicitly and don't pass it to the env creator
        render_mode = _kwargs.pop("render_mode", None)
    else:
        render_mode = None

    try:
        env = env_creator(**_kwargs)
    except TypeError as e:
        if (
            str(e).find("got an unexpected keyword argument 'render_mode'") >= 0
            and apply_human_rendering
        ):
            raise error.Error(
                f"You passed render_mode='human' although {id} doesn't implement human-rendering natively. "
                "Gym tried to apply the HumanRendering wrapper but it looks like your environment is using the old "
                "rendering API, which is not supported by the HumanRendering wrapper."
            )
        else:
            raise e

    # Copies the environment creation specification and kwargs to add to the environment specification details
    spec_ = copy.deepcopy(spec_)
    spec_.kwargs = _kwargs
    env.unwrapped.spec = spec_

    # Add step API wrapper
    if apply_api_compatibility is True or (
        apply_api_compatibility is None and spec_.apply_api_compatibility is True
    ):
        env = EnvCompatibility(env, render_mode)

    # Run the environment checker as the lowest level wrapper
    if disable_env_checker is False or (
        disable_env_checker is None and spec_.disable_env_checker is False
    ):
        env = PassiveEnvChecker(env)

    # Add the order enforcing wrapper
    if spec_.order_enforce:
        env = OrderEnforcing(env)

    # Add the time limit wrapper
    if max_episode_steps is not None:
        env = TimeLimit(env, max_episode_steps)
    elif spec_.max_episode_steps is not None:
        env = TimeLimit(env, spec_.max_episode_steps)

    # Add the autoreset wrapper
    if autoreset:
        env = AutoResetWrapper(env)

    # Add human rendering wrapper
    if apply_human_rendering:
        env = HumanRendering(env)
    elif apply_render_collection:
        env = RenderCollection(env)

    return env

这个比较长（200行左右），我们只需要在后文中关注我们实际用到的，到时候再回过头来看此处的代码。

余下3个步骤的分析请看后续文章。