在走迷宫任务中实现强化学习（持续更新中）——第一课：建立迷宫环境（小川雄太郎《边做边学深度强化学习》项目复刻）-CSDN博客

本文链接：https://blog.csdn.net/weixin_45871797/article/details/127616871

博客围绕走迷宫任务中的强化学习展开。首先介绍了强化学习中策略的定义与表示，接着阐述智能体在迷宫中的移动规则及概率计算，还讲解了numpy相关函数的使用，如np.nansum、np.nanmean等。最后定义持续移动函数，用matplotlib打印运行路径。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在走迷宫任务中实现强化学习

第一课：建立迷宫环境

import numpy as np
import matplotlib.pyplot as plt



#迷宫初始位置
#声明图的大小以及图的变量名
fig = plt.figure(figsize=(5,5))
ax = plt.gca()
#画出红色墙壁
plt.plot([1,1],[0,1],color='red',linewidth =2)
plt.plot([1,2],[2,2],color='red',linewidth =2)
plt.plot([2,2],[2,1],color='red',linewidth =2)
plt.plot([2,3],[1,1],color='red',linewidth =2)
#画出文字表示的状态
plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(1.5,2.5,'S1',size=14,ha='center')
plt.text(2.5,2.5,'S2',size=14,ha='center')
plt.text(0.5,1.5,'S3',size=14,ha='center')
plt.text(1.5,1.5,'S4',size=14,ha='center')
plt.text(2.5,1.5,'S5',size=14,ha='center')
plt.text(0.5,0.5,'S6',size=14,ha='center')
plt.text(1.5,0.5,'S7',size=14,ha='center')
plt.text(2.5,0.5,'S8',size=14,ha='center')
plt.text(0.5,2.3,'start',ha='center')
plt.text(2.5,0.3,'goal',ha='center')
##设定画图范围
ax.set_xlim(0,3)
ax.set_ylim(0,3)
plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
#将当前位置S0用绿色圆圈画出
line,=ax.plot([0.5],[2.5],marker='o',color='g',markersize=60)
plt.show()

第一步结果

1.2

在强化学习中定义智能体行为方式的规则叫做策略（Policy），策略用希腊字母Πθ（s，a）来表示——在状态s下采取动作a的概率遵循由参数θ确定的策略Π

本例中智能体科研向上，向下、向左、向右移动，但不能突破红色墙壁，这里我们用行表示状态s，列表示动作a，表的指标是采取该动作的概率。
如果策略pi是函数，则θ是函数中的参数，对于神经网络而言，


#设定参数sheta的初始值theta_0.用于确定初始方案
#
#[向上，向右，向下，向左]
theta_0=np.array([[np.nan,1,1,np.nan], #S0
                  [np.nan,1,np.nan,1],#s1
                  [np.nan, np.nan,1,1],#s2
                  [1, 1, 1, np.nan],#s3
                  [np.nan, np.nan,1,1],#s4
                  [ 1, np.nan,np.nan, np.nan],#s5
                  [ 1, np.nan,np.nan, np.nan],#s6
                  [1, 1, np.nan, np.nan],#s7
                  #s8是目标，无策略
                 ])
print(theta_0)

这段代码中：np.nan代表有墙壁无法前进，表示不包含内容的缺省值，，这几个矩阵分别代表的是科研向上/向左/向下还是向右走；比如S7位置：由图中得知智能向上或者向右走。
————————————————————————————————————————————

下一行代码的意义是


def simple_convert_into_pi_from_theta(theta):
    [m,n]=theta.shape
    pi = np.zeros((m,n))
    for i in range(0,m):
        pi[i,:]=theta[i,:] / np.nansum(theta[i,:])
    pi = np.nan_to_num(pi)
    return pi
pi_0=simple_convert_into_pi_from_theta(theta_0)
print(pi_0)

这段代码的作用是：对行矩阵采用直接求和之后平均的方法粗略算某一点处向四周前进的概率。
转换结果：

[[nan  1.  1. nan]
 [nan  1. nan  1.]
 [nan nan  1.  1.]
 [ 1.  1.  1. nan]
 [nan nan  1.  1.]
 [ 1. nan nan nan]
 [ 1. nan nan nan]
 [ 1.  1. nan nan]]
[[0.         0.5        0.5        0.        ]
 [0.         0.5        0.         0.5       ]
 [0.         0.         0.5        0.5       ]
 [0.33333333 0.33333333 0.33333333 0.        ]
 [0.         0.         0.5        0.5       ]
 [1.         0.         0.         0.        ]
 [1.         0.         0.         0.        ]
 [0.5        0.5        0.         0.        ]]

其中：利用了np.nansum函数，其具体解释是：
在一个numpy数组求和、均值时，如果这个数组里包含了nan，则程序会报错或者求出来的值是nan，如下代码所示：

>>> arr = np.array([1, 2, 3, 4, np.nan])
>>> arr.sum()
nan
>>> arr.mean()
nan

>>> arr = np.array([1, 2, 3, 4, np.nan])

>>> arr.sum()
nan
>>> arr.mean()
nan

>>> np.nansum(arr) # np.nansum()中：nan取值为 0
10.0
>>> np.nanmean(arr) # np.nanmean()中：nan取值为0且取均值时忽略它，如：(1 + 2 + 3 + 4) / 4 = 2.5，而不是(1 + 2 + 3 + 4 + 0) / 5 = 2
2.5
>>>

如何在求和、均值时忽略nan？
使用：np.nansum()、np.nanmean()

————————————————————————————————————————————


#步移动后球的状态s的函数的定义：
def get_next_s(pi,s):
    direction = ['up','right','down','left']
    next_direction=numpy.random.choice(direction,p=pi[s,:])
    #根据概率pi[s,:]选择direction

    if next_direction =='up':
        s_next = s - 3
    elif next_direction == 'down':
        s_next = s + 3
    elif  next_direction == 'left':
        s_next = s - 1
    elif next_direction == 'right':
        s_next = s + 1

    return s_next

此段是对移动体位置的定义：向下是+3，向上-3，向右+1，向左-1
numpy中np.random.choice()的用法详解及其参考代码

np.random.choice(a,b):从矩阵a中随机取出b个元素（可以重复）
get_next_s(pi,s)函数对给定矩阵pi与数s，

此段在通过pi中初坐标s参数后返回next_direction信息,进而得到s_next坐标位置参数

最后一步：

不断使用get_next_s移动智能体，直到智能体达到目标为止。
接下来我们定义一个持续移动函数goal_maze:使用goal_maze中while循环直到智能体达到目标，这个过程中轨迹状态存在列表state_history中最终返回它即可查看路径。

#循环使其大到目标函数定义：循环
def goal_maze(pi):
    s = 0
    state_history = [0]

    while(1):#死循环：通过初始s与矩阵pi计算next_s直至达到终点
        next_s = get_next_s(pi,s)
        state_history.append(next_s)
        if next_s ==8:
            break
        else:
            s = next_s
           # print(state_history)
    return state_history

#然后执行定义完的goal_maze函数，根据策略Πθ0（s，a）移动智能体，并将运动轨迹存储在state—_history中
#朝向目标移动
state_history = goal_maze(pi_0)
print(pi_0)

print(state_history)

print('求解迷宫的路径所需的步骤是：'+ str(len(state_history)-1))

最后我们用matplotlib 将运行路径打印出来，输出魏test1.gif，即可查看该过程的东台路经图：


from matplotlib import animation
from IPython.display import HTML
from matplotlib.animation import FuncAnimation
def init():
    #初始化背景图像
    line.set_data([],[])
    return (line,)

def animate(i):
    state = state_history[i]
    x = (state %3)+0.5
    y= 2.5- int(state/3)
    line.set_data(x,y)

    return (line,)
anim = animation.FuncAnimation(fig,func=animate,init_func=init,frames=len(state_history),interval=200,blit=True)

#HTML(anim.to_jshtml())
anim.save('test1.gif', writer='pillow', fps=100)