强化学习实战2：动手写迷宫环境

最新推荐文章于 2024-08-05 10:32:23 发布

在地球迷路的怪兽

最新推荐文章于 2024-08-05 10:32:23 发布

阅读量334

点赞数 4

分类专栏：人工智能学习文章标签：人工智能

本文链接：https://blog.csdn.net/qq_53324833/article/details/140318434

版权

人工智能学习专栏收录该内容

34 篇文章 1 订阅

订阅专栏

迷宫环境介绍与创建

迷宫环境图示如下：

在这里插入图片描述

如图所示，其为一个三乘三的网格世界，我们要让 agent 从 S0 采取策略出发，然后走到 S8，图中红线部分表示障碍不能逾越，其中 S1 和 S4 之间有一个障碍，S3 和 S4 之间有一个障碍，S3 和 S6 之间有一个障碍，S5 和 S8 之间有一个障碍。

这就是我们的迷宫环境。

代码如下：

import matplotlib.pyplot as plt

# 创建一个新的图形对象，并设置其大小为 5x5 英寸
fig = plt.figure(figsize=(5, 5))

# 获取当前图形对象的轴对象
ax = plt.gca()

# 设置坐标轴的范围
ax.set_xlim(0, 3)
ax.set_ylim(0, 3)

# 绘制红色的方格边界，表示迷宫的结构
plt.plot([2, 3], [1, 1], color='red', linewidth=2)
plt.plot([0, 1], [1, 1], color='red', linewidth=2)
plt.plot([1, 1], [1, 2], color='red', linewidth=2)
plt.plot([1, 2], [2, 2], color='red', linewidth=2)

# 在指定位置添加文字标签，表示每个状态（S0-S8）、起点和终点
plt.text(0.5, 2.5, 'S0', size=14, ha='center')
plt.text(1.5, 2.5, 'S1', size=14, ha='center')
plt.text(2.5, 2.5, 'S2', size=14, ha='center')
plt.text(0.5, 1.5, 'S3', size=14, ha='center')
plt.text(1.5, 1.5, 'S4', size=14, ha='center')
plt.text(2.5, 1.5, 'S5', size=14, ha='center')
plt.text(0.5, 0.5, 'S6', size=14, ha='center')
plt.text(1.5, 0.5, 'S7', size=14, ha='center')
plt.text(2.5, 0.5, 'S8', size=14, ha='center')
plt.text(0.5, 2.3, 'Start', ha='center')
plt.text(2.5, 0.3, 'Goal', ha='center')

# 设置坐标轴的显示参数，使得坐标轴不显示
plt.tick_params(axis='both', which='both',
                bottom=False, top=False,
                right=False, left=False,
                labelbottom=False, labelleft=False)

# 在起点位置绘制一个绿色的圆形表示当前位置
line, = ax.plot([0.5], [2.5], marker='o', color='g', markersize=60)

# 显示图形
plt.show()

运行结果如下：

在这里插入图片描述

agent 的动作策略设计

如下图所示：

在这里插入图片描述

不难看出，state 总共有九个，也就是离散（discrete）且有限（finite）的三乘三网格世界。

而动作空间共有上下左右四个选择，用一个数组来表示。

而 Π 就是我们的策略了，可以看见其有一个下标 θ，这是参数的意思，在神经网络参与的算法中这个策略参数由神经网络确定，而在本节中尚未提及神经网络，因此使用一个 table ，也就是一个状态-行为矩阵来确定策略。

下面开始来刻画环境需要的边界和障碍信息：

# 刻画环境：边界border 和 障碍barrier
theta_0 = np.array([
    [np.nan, 1, 1, np.nan],  # 表示S0时的策略，即agent不能往上、不能往左走，但可以往右和下走
    [np.nan, 1, np.nan, 1],  # S1
    [np.nan, np.nan, 1, 1],  # S2
    [1, np.nan, np.nan, np.nan],  # S3
    [np.nan, 1, 1, np.nan],  # S4
    [1, np.nan, np.nan, 1],  # S5
    [np.nan, 1, np.nan, np.nan],  # S6
    [1, 1, np.nan, 1],  # S7
    # S8 已经是终点了，因此不再需要上下左右到处走了
])

然后需要将上述的参数矩阵给转换成策略 Π 的值，也就是采取各个 action 的概率值：

# 将 theta_0 转换为 策略 Π，而 Π 其实就是概率值嘛
def cvt_theta_0_to_pi(theta):
    m, n = theta.shape
    pi = np.zeros((m, n))
    for r in range(m):
        # pi[r, :]这是一个赋值操作的左侧表达式，它使用了NumPy的索引机制。
        # 在这里，pi 应该是一个二维数组（矩阵），r 表示行索引，: 表示选择该行的所有列。
        # 因此，这一部分指定了 pi 矩阵中的第 r 行的所有列。
        # np.nansum() 这是一个NumPy函数调用。np.nansum() 用于计算数组中元素的总和，忽略 NaN 值。
        # theta[r, :] 提供了一个一维数组作为函数的参数，表示对 theta 矩阵中第 r 行的所有元素进行求和。
        # 因此下面这一行代码是一个按元素的除法操作。
        # 它将 theta 矩阵中第 r 行的每个元素分别除以该行所有元素的和（忽略 NaN 值）。
        # 这样做可以将该行的元素归一化为一个概率分布，确保它们的总和为 1。
        pi[r, :] = theta[r, :] / np.nansum(theta[r, :])
    return np.nan_to_num(pi)

pi = cvt_theta_0_to_pi(theta_0)
# 先打印一下看是什么样子的结果
print(pi)

运行结果如下：

在这里插入图片描述

可以看见此时的 state-action 矩阵就已经构建好了。

下面构建动作空间列表以及状态转移函数：

# 动作空间：0表示向上，1表示向右，2表示向下，3表示向左
actions = list(range(4))

# 状态转移函数
def step(state, action):
    if action == 0:
        state -= 3
    elif action == 1:
        state += 1
    elif action == 2:
        state += 3
    elif action == 3:
        state -= 1
    return state

策略测试

那么接下来我们就可以进行测试了，使用上述构建的环境和配置，可以进行路径的搜索了：

state = 0
action_history = []
state_history = [state]
while True:
    """
    对下面这一行代码进行的解释：
    np.random.choice：
        这是 NumPy 库中的一个函数，用于从给定的一维数组或类似序列中随机选择元素。
        在这里，我们将从 actions 数组中随机选择一个元素。
    actions：
        这是一个包含可选动作的一维数组或列表。np.random.choice 将从这个数组中进行选择。
    p=pi[state, :]：
        这是 np.random.choice 函数的一个参数，用于指定每个元素被选择的概率。
        在这里，pi 是一个二维数组（矩阵），state 是当前状态的索引，: 表示选择该行的所有列。
        因此，pi[state, :] 提供了一个概率分布，用于指定在选择时每个动作的相对概率。
    赋值操作：
    action = ... 将 np.random.choice 的结果赋值给变量 action。
    这意味着 action 将是从 actions 数组中随机选择的一个元素，选择的概率由 pi[state, :] 给出。
    """
    action = np.random.choice(actions, p=pi[state, :])
    state = step(state, action)
    if state == 8:
    	state_history.append(8)
        break
    action_history.append(action)
    state_history.append(state)

print(len(state_history))
print(state_history)

运行结果如下：

在这里插入图片描述

可以看到在这一次运行中，agent 走了 38 步才到达 S8，所走过的路径如上图所示。

动画可视化搜索过程

直接上代码：

from matplotlib import animation
from IPython.display import HTML


def init():
    line.set_data([], [])
    return (line, )


def animate(i):
    state = state_history[i]
    x = (state % 3) + 0.5
    y = 2.5 - int(state / 3)
    line.set_data(x, y)


anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(state_history), interval=200, repeat=False)
anim.save('maze_0.mp4')
# 视频观测有时候不太友好，我们还可以使用 IPython 提供的 HTML 的交互式工具
# 由于 PyCharm 不支持显示 IPython 的交互式输出，因此我们这里将 IPython 的输出转换为 HTML 文件再打开
with open('animation.html', 'w') as f:
    f.write(anim.to_jshtml())

运行结果如下：

在这里插入图片描述

比起直接观看视频，这种交互式的过程更能帮助我们从细节上一点一点看清楚 agent 的运行轨迹。

总结

上面的代码是为了讲清楚各部分的功能和实现细节，全部代码合在一起就是下面这样：

import matplotlib.pyplot as plt
import numpy as np

# 创建一个新的图形对象，并设置其大小为 5x5 英寸
fig = plt.figure(figsize=(5, 5))

# 获取当前图形对象的轴对象
ax = plt.gca()

# 设置坐标轴的范围
ax.set_xlim(0, 3)
ax.set_ylim(0, 3)

# 绘制红色的方格边界，表示迷宫的结构
plt.plot([2, 3], [1, 1], color='red', linewidth=2)
plt.plot([0, 1], [1, 1], color='red', linewidth=2)
plt.plot([1, 1], [1, 2], color='red', linewidth=2)
plt.plot([1, 2], [2, 2], color='red', linewidth=2)

# 在指定位置添加文字标签，表示每个状态（S0-S8）、起点和终点
plt.text(0.5, 2.5, 'S0', size=14, ha='center')
plt.text(1.5, 2.5, 'S1', size=14, ha='center')
plt.text(2.5, 2.5, 'S2', size=14, ha='center')
plt.text(0.5, 1.5, 'S3', size=14, ha='center')
plt.text(1.5, 1.5, 'S4', size=14, ha='center')
plt.text(2.5, 1.5, 'S5', size=14, ha='center')
plt.text(0.5, 0.5, 'S6', size=14, ha='center')
plt.text(1.5, 0.5, 'S7', size=14, ha='center')
plt.text(2.5, 0.5, 'S8', size=14, ha='center')
plt.text(0.5, 2.3, 'Start', ha='center')
plt.text(2.5, 0.3, 'Goal', ha='center')

# 设置坐标轴的显示参数，使得坐标轴不显示
plt.tick_params(axis='both', which='both',
                bottom=False, top=False,
                right=False, left=False,
                labelbottom=False, labelleft=False)

# 在起点位置绘制一个绿色的圆形表示当前位置
line, = ax.plot([0.5], [2.5], marker='o', color='g', markersize=60)

# 显示图形
# plt.show()

# 刻画环境：边界border 和 障碍barrier
theta_0 = np.array([
    [np.nan, 1, 1, np.nan],  # 表示S0时的策略，即agent不能往上、不能往左走，但可以往右和下走
    [np.nan, 1, np.nan, 1],
    [np.nan, np.nan, 1, 1],
    [1, np.nan, np.nan, np.nan],
    [np.nan, 1, 1, np.nan],
    [1, np.nan, np.nan, 1],
    [np.nan, 1, np.nan, np.nan],
    [1, 1, np.nan, 1],
    # S8 已经是终点了，因此不再需要上下左右到处走了
])


# 将 theta_0 转换为 策略 Π，而 Π 其实就是概率值嘛
def cvt_theta_0_to_pi(theta):
    m, n = theta.shape
    pi = np.zeros((m, n))
    for r in range(m):
        # pi[r, :]这是一个赋值操作的左侧表达式，它使用了NumPy的索引机制。
        # 在这里，pi 应该是一个二维数组（矩阵），r 表示行索引，: 表示选择该行的所有列。
        # 因此，这一部分指定了 pi 矩阵中的第 r 行的所有列。
        # np.nansum() 这是一个NumPy函数调用。np.nansum() 用于计算数组中元素的总和，忽略 NaN 值。
        # theta[r, :] 提供了一个一维数组作为函数的参数，表示对 theta 矩阵中第 r 行的所有元素进行求和。
        # 因此下面这一行代码是一个按元素的除法操作。
        # 它将 theta 矩阵中第 r 行的每个元素分别除以该行所有元素的和（忽略 NaN 值）。
        # 这样做可以将该行的元素归一化为一个概率分布，确保它们的总和为 1。
        pi[r, :] = theta[r, :] / np.nansum(theta[r, :])
    return np.nan_to_num(pi)


pi = cvt_theta_0_to_pi(theta_0)

# 动作空间
actions = list(range(4))
print(actions)


# 状态转移函数
def step(state, action):
    if action == 0:
        state -= 3
    elif action == 1:
        state += 1
    elif action == 2:
        state += 3
    elif action == 3:
        state -= 1
    return state


state = 0
action_history = []
state_history = [state]
while True:
    """
    对下面这一行代码进行的解释：
    np.random.choice：
        这是 NumPy 库中的一个函数，用于从给定的一维数组或类似序列中随机选择元素。
        在这里，我们将从 actions 数组中随机选择一个元素。
    actions：
        这是一个包含可选动作的一维数组或列表。np.random.choice 将从这个数组中进行选择。
    p=pi[state, :]：
        这是 np.random.choice 函数的一个参数，用于指定每个元素被选择的概率。
        在这里，pi 是一个二维数组（矩阵），state 是当前状态的索引，: 表示选择该行的所有列。
        因此，pi[state, :] 提供了一个概率分布，用于指定在选择时每个动作的相对概率。
    赋值操作：
    action = ... 将 np.random.choice 的结果赋值给变量 action。
    这意味着 action 将是从 actions 数组中随机选择的一个元素，选择的概率由 pi[state, :] 给出。
    """
    action = np.random.choice(actions, p=pi[state, :])
    state = step(state, action)
    if state == 8:
        state_history.append(8)
        break
    action_history.append(action)
    state_history.append(state)

print(len(state_history))
print(state_history)

from matplotlib import animation
from IPython.display import HTML


def init():
    line.set_data([], [])
    return (line, )


def animate(i):
    state = state_history[i]
    x = (state % 3) + 0.5
    y = 2.5 - int(state / 3)
    line.set_data(x, y)


anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(state_history), interval=200, repeat=False)
anim.save('maze_0.mp4')
# 视频观测有时候不太友好，我们还可以使用 IPython 提供的 HTML 的交互式工具
# 由于 PyCharm 不支持显示 IPython 的交互式输出，因此我们这里将 IPython 的输出转换为 HTML 文件再打开
with open('animation.html', 'w') as f:
    f.write(anim.to_jshtml())

封装迷宫环境 API

上面的代码虽然已经完成了环境的编写了，但是过于混乱，因此这一节我们仿照 Gym 的格式将其封装一下。

主要就是两个部分，一个是环境，还有一个是 Agent。

对于环境其最重要的就是 step 函数，step 函数接收一个 action 然后返回一个新的状态以及 reward 等各种信息。

对于 Agent 而言其最重要的是一个动作策略的选取。

基本结构如下：

import gym
import numpy as np
import matplotlib.pyplot as plt


class MazeEnv(gym.Env):
    def __init__(self):
        pass

    def reset(self):
        pass

    def step(self, action):
        pass


class Agent:
    def __init__(self):
        pass

    def choose_action(self, state):
        pass

封装如下：

import gym
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from IPython.display import HTML


# MazeEnv 类维护着状态，以及 step 函数的返回
class MazeEnv(gym.Env):
    def __init__(self):
        self.state = 0

    def reset(self):
        self.state = 0
        return self.state

    def step(self, action):
        if action == 0:
            self.state -= 3
        elif action == 1:
            self.state += 1
        elif action == 2:
            self.state += 3
        elif action == 3:
            self.state -= 1
        done = False
        if self.state == 8:
            done = True
        # 1 表示 reward, done 表示是否结束
        # {} 是一个空的字典字面量，表示返回的第四个值是一个空字典。
        return self.state, 1, done, {}


# Agent 类基于当前环境中的状态选择动作形成策略
class Agent:
    def __init__(self):
        # action space
        self.actions = list(range(4))
        # 刻画环境：边界 border 和 障碍 barrier
        self.theta_0 = np.array([
            [np.nan, 1, 1, np.nan],  # 表示S0时的策略，即agent不能往上、不能往左走，但可以往右和下走
            [np.nan, 1, np.nan, 1],
            [np.nan, np.nan, 1, 1],
            [1, np.nan, np.nan, np.nan],
            [np.nan, 1, 1, np.nan],
            [1, np.nan, np.nan, 1],
            [np.nan, 1, np.nan, np.nan],
            [1, 1, np.nan, 1],
            # S8 已经是终点了，因此不再需要上下左右到处走了
        ])
        # 策略 Π
        self.pi = self._cvt_theta_0_to_pi(self.theta_0)

    # 将 theta_0 转换为 策略 Π，而 Π 其实就是概率值嘛
    def _cvt_theta_0_to_pi(self, theta):
        m, n = theta.shape
        pi = np.zeros((m, n))
        for r in range(m):
            pi[r, :] = theta[r, :] / np.nansum(theta[r, :])
        return np.nan_to_num(pi)

    def choose_action(self, state):
        action = np.random.choice(self.actions, p=self.pi[state, :])
        return action


# MazeAnimation 类封装了可视化展示代码的 API
class MazeAnimation:
    def __init__(self):
        # 创建一个新的图形对象，并设置其大小为 5x5 英寸
        self.fig = plt.figure(figsize=(5, 5))
        # 获取当前图形对象的轴对象
        self.ax = plt.gca()
        # 设置坐标轴的范围
        self.ax.set_xlim(0, 3)
        self.ax.set_ylim(0, 3)
        # 绘制红色的方格边界，表示迷宫的结构
        plt.plot([2, 3], [1, 1], color='red', linewidth=2)
        plt.plot([0, 1], [1, 1], color='red', linewidth=2)
        plt.plot([1, 1], [1, 2], color='red', linewidth=2)
        plt.plot([1, 2], [2, 2], color='red', linewidth=2)
        # 在指定位置添加文字标签，表示每个状态（S0-S8）、起点和终点
        plt.text(0.5, 2.5, 'S0', size=14, ha='center')
        plt.text(1.5, 2.5, 'S1', size=14, ha='center')
        plt.text(2.5, 2.5, 'S2', size=14, ha='center')
        plt.text(0.5, 1.5, 'S3', size=14, ha='center')
        plt.text(1.5, 1.5, 'S4', size=14, ha='center')
        plt.text(2.5, 1.5, 'S5', size=14, ha='center')
        plt.text(0.5, 0.5, 'S6', size=14, ha='center')
        plt.text(1.5, 0.5, 'S7', size=14, ha='center')
        plt.text(2.5, 0.5, 'S8', size=14, ha='center')
        plt.text(0.5, 2.3, 'Start', ha='center')
        plt.text(2.5, 0.3, 'Goal', ha='center')
        # 设置坐标轴的显示参数，使得坐标轴不显示
        plt.tick_params(axis='both', which='both',
                        bottom=False, top=False,
                        right=False, left=False,
                        labelbottom=False, labelleft=False)

        # 在起点位置绘制一个绿色的圆形表示当前位置
        self.line, = self.ax.plot([0.5], [2.5], marker='o', color='g', markersize=60)

    # state_history 是状态历史数组
    def save_as_mp4_html(self, state_history):
        def init():
            self.line.set_data([], [])
            return (self.line,)

        def animate(i):
            state = state_history[i]
            x = (state % 3) + 0.5
            y = 2.5 - int(state / 3)
            self.line.set_data(x, y)

        anim = animation.FuncAnimation(self.fig, animate, init_func=init, frames=len(state_history), interval=200,
                                       repeat=False)
        anim.save('maze.mp4')
        with open('maze_animation.html', 'w') as f:
            f.write(anim.to_jshtml())


# -------------------------------测试代码---------------------------------

# 创建迷宫环境
env = MazeEnv()
# 重置迷宫环境到初始状态
state = env.reset()
# 创建 agent
agent = Agent()
# 结束标志 done
done = False
# 动作历史
action_history = []
# 状态历史，含一开始位于 S0 的状态
state_history = [state]
# 一个循环就是一个 episode
while not done:
    action = agent.choose_action(state)
    state, reward, done, _ = env.step(action)
    state_history.append(state)
    action_history.append(action)

# 打印状态历史长度
print(len(state_history))
# 可视化展示
# 创建封装好的可视化对象
show = MazeAnimation()
# 调用该对象中的可视化展示函数
show.save_as_mp4_html(state_history)

运行结果和之前一样，不再赘述。

在地球迷路的怪兽

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
强化学习实战2：动手写迷宫环境

创建一个新的图形对象，并设置其大小为 5x5 英寸# 获取当前图形对象的轴对象# 设置坐标轴的范围# 绘制红色的方格边界，表示迷宫的结构# 在指定位置添加文字标签，表示每个状态（S0-S8）、起点和终点# 设置坐标轴的显示参数，使得坐标轴不显示# 在起点位置绘制一个绿色的圆形表示当前位置# 显示图形# 刻画环境：边界border 和障碍barrier[np.nan, 1, 1, np.nan], # 表示S0时的策略，即agent不能往上、不能往左走，但可以往右和下走。
复制链接

扫一扫