强化学习:价值迭代求解迷宫寻路问题

最新推荐文章于 2024-07-10 16:35:17 发布

赛亚茂

最新推荐文章于 2024-07-10 16:35:17 发布

阅读量1.2k

点赞数 5

分类专栏：集群机器人文章标签：人工智能机器学习

本文链接：https://blog.csdn.net/shengzimao/article/details/126102114

版权

集群机器人专栏收录该内容

33 篇文章 33 订阅

订阅专栏

1.问题描述

如图所示下图的迷宫，白色代表可以通行的路径，黑色代表障碍物，要寻找一条从左下角出发能够避开障碍达到出口的路径。

在这里插入图片描述

2.问题建模

利用价值迭代方法求解上述问题的思路在于对迷宫进行编码，确定MDP的状态空间 $S$ ，动作空间 $A$ 与确定奖励函数 $r$ 。首先对上述迷宫进行编码如下:

在这里插入图片描述

很显然，每个格子就成了状态空间的一个状态，因此状态空间为: $S=\{0,1,2,..24\}$ 。而对于其中行走的任一个智能体而言，其动作空间 $A=\{'north','south','east','west'\}$ 分别代表4个能走的方向，规定每次机器人只能走一个格子。而对于奖励的设置，设置为如下:
$r(s_t,a_t)=\begin{cases} -1,下个时刻状态s_{t+1}(s_t,a_t)为可通行白色区域\\ -100,下个时刻状态s_{t+1}(s_t,a_t)为黑色障碍区域\\ 100,下个时刻状态为终点s_{t+1}(s_t,a_t)=14 \end{cases}$
在编程阶段，我们打算将该环境封装成一个类migong_game()，在初始化函数中实现其问题的编码。

注意代码中的P指的是转移矩阵即: $s_{t+1}=P(s_t,a_t)$

    def __init__(self,epsiodes=100,gamma=1.0,epslion=1e-5):
        self.epsiodes = epsiodes
        self.gamma = gamma
        self.epslion = epslion
        self.states = np.arange(0,25)
        actions = {'north': 0, 'south': 1, 'east': 2, 'west': 3}
        self.actions = actions
        self.terminate_states = [2,3,4,10,11,18,23,14]
        # 下面定义奖励函数Rsa
        reward = -1*np.ones((25,4))
        # 2号点是障碍物
        reward[1][actions['east']] = -100
        reward[7][actions['south']] = -100
        reward[3][actions['west']] = -100
        # 3号点是障碍物
        reward[2][actions['east']] = -100
        reward[8][actions['south']] = -100
        reward[4][actions['west']] = -100
        # 4号点是障碍物
        reward[3][actions['east']] = -100
        reward[9][actions['south']] = -100
        # 10号点是障碍物
        reward[5][actions['north']] = -100
        reward[11][actions['west']] = -100
        reward[15][actions['south']] = -100
       # 11号点是障碍物
        reward[6][actions['north']] = -100
        reward[12][actions['west']] = -100
        reward[16][actions['south']] = -100
        reward[10][actions['east']] = -100
        # 18是障碍点
        reward[17][actions['east']] = -100
        reward[19][actions['west']] = -100
        reward[13][actions['north']] = -100
        reward[23][actions['south']] = -100
        # 23是障碍点
        reward[22][actions['east']] = -100
        reward[24][actions['west']] = -100
        reward[18][actions['north']] = -100
        # 14号点出去时有奖励
        reward[13][actions['east']] = 100
        reward[9][actions['north']] = 100
        reward[19][actions['south']] = 100
        self.reward = reward
        # 下面定义状态转移函数Pss'a
        location = np.array([[20,21,22,23,24],
                             [15,16,17,18,19],
                             [10,11,12,13,14],
                             [5,6,7,8,9],
                             [0,1,2,3,4]])
        P = -1*np.ones((25,4))
        for ix in [1,2,3]:
            for iy in [1,2,3]:
               loc = ix + 5*iy
               P[loc][actions['north']] = ix + 5*(iy + 1)
               P[loc][actions['south']] = ix + 5*(iy - 1)
               P[loc][actions['east']] = ix + 1 + 5*iy
               P[loc][actions['west']] = ix - 1+5*iy
        # 下面是四条边
        for ix in [1,2,3]:
            iy = 4
            loc = ix + 5*iy
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
            P[loc][actions['west']] = ix - 1 + 5 * iy
        for ix in [1,2,3]:
            iy = 0
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
            P[loc][actions['west']] = ix - 1 + 5 * iy
        for iy in [1,2,3]:
            ix = 0
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
        for iy in [1,2,3]:
            ix = 4
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['west']] = ix - 1 + 5 * iy
        # 下面是四个顶点
        P[0][actions['east']] = 1
        P[0][actions['north']] = 5
        P[20][actions['south']] = 15
        P[20][actions['east']] = 21
        P[24][actions['west']] = 23
        P[24][actions['south']] = 19
        P[4][actions['north']] = 9
        P[4][actions['west']] = 3
        # 将走不到的地方都原地踏步
        for i in range(P.shape[0]):
            for j in range(P.shape[1]):
                if P[i][j] == -1:
                    P[i][j] = i
        self.t = P
        # 初始状态为0
        self.state = 0

3. 更新函数

该函数的目的是重新确定当前的初值状态为0号状态点。

    def reset(self):
        # self.state = int(random.random()*len(self.states))
        self.state = 0
        return self.state

4.动作函数

该函数的目的是根据当前的状态 $s_t$ 与动作 $a_t$ 返回下一个状态 $s_{t+1}$ 与当前奖励 $r_t$ ,还有任务是否完成标记is_done。

    def step(self,action):
        state = self.state
        if state == 14:
            return state,100,True,{}
        obstacle = [2,3,4,10,11,18,23]
        if state in obstacle:
            return state,-100,True,{}
        next_state = self.t[state][self.actions[action]]
        if (next_state in obstacle)or(next_state == 14):
            is_done = True # 如果误闯障碍/终点就结束
        else:
            is_done = False
        self.state = next_state
        r = self.reward[state][self.actions[action]]
        return next_state,r,is_done,{}

5.概率转移矩阵

是为了得到以下形式的矩阵:
$P_{ss^{'}}^a=\mathbf{Pr}[s_{t+1}=s^{'}|s_{t}=s,a_t=a]$

 def prob(self):
        P = np.zeros((len(self.states),len(self.states),len(self.actions)))
        for s in range(len(self.states)):
            for a in self.actions:
                self.state = s
                next_state, r, is_done, _ = self.step(a)
                if is_done == False:
                    P[s,int(next_state),self.actions[a]] = 1
        self.P = P
        return P

6.价值迭代

采用以下价值函数不断迭代得到最优策略:
$v_{k+1}(s)\leftarrow\max_{a}\{R_{s}^q+\gamma\sum_{ss^{'}}P_{ss^{'}}^av_k(s)\}\\ \pi(.|s)\leftarrow \arg \max_{a}\{R_{s}^q+\gamma\sum_{ss^{'}}P_{ss^{'}}^av_k(s)\} \\ \forall s\in S$
直到满足条件 $||\mathbf v_{k+1}-\mathbf v_k||<\varepsilon$ ,结束迭代。

   # 价值迭代
    def iterate_value(self):
        v1 = np.zeros(len(self.states))
        v = np.zeros(len(self.states))
        pi = np.zeros(len(self.states))
        v_iter = np.zeros(self.epsiodes)
        for epsiode in range(self.epsiodes):
            for state in self.states:
                action_array = np.zeros(len(self.actions))
                for action in self.actions:
                    temp_sum = 0
                    for next_state in self.states:
                        temp_sum += v[next_state]*self.P[state,next_state,self.actions[action]]
                    action_array[self.actions[action]] = self.reward[state,self.actions[action]] + self.gamma*temp_sum
                v1[state] = np.max(action_array)
                pi[state] = np.argmax(action_array)
            if np.linalg.norm(v - v1) <= self.epslion:
                v_iter[epsiode] = np.linalg.norm(v - v1)
                break
            else:
                v = v1
        self.pi = pi
        return pi,v_iter,epsiode

7.轨迹显示

更具策略得到当前轨迹:

    def get_trad(self):
        road_key = {0:'north',1:'south',2:'east',3:'west'}
        # 更新初始位置
        self.reset()
        state = self.state
        road = [self.state]
        terminate_state = [2,3,4,10,11,18,23,14]
        while True:
            action_index = self.pi[int(state)]
            if road_key[action_index] == 'north':
                state = state + 5
            elif road_key[action_index] == 'south':
                state = state - 5
            elif road_key[action_index] == 'east':
                state = state + 1
            elif road_key[action_index] == 'west':
                state = state - 1
            road.append(state)
            if state in terminate_state:
                break
        return road

8.主函数

在主函数中调用以下代码实现问题求解:

if __name__ == "__main__":
    migong = migong_game(gamma=0.8)
    # 对象更新
    migong.reset()
    # 对象求P
    migong.prob()
    # 价值迭代
    pi,v_iter,epsiode = migong.iterate_value()
    road = migong.get_trad()
    print("pi:")
    print(pi)
    print("最大迭代次数:")
    print(epsiode)
    print('轨迹为:')
    print(road)

9.结果

pi:
[0. 0. 0. 0. 0. 2. 2. 0. 0. 0. 0. 0. 2. 2. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0.
 1.]
最大迭代次数:
1
轨迹为:
[0, 5, 6, 7, 12, 13, 14]

轨迹可视化如下，确实为最短路径:

在这里插入图片描述

10.附录(完整代码)

import numpy as np
import random
class migong_game():
    def __init__(self,epsiodes=100,gamma=1.0,epslion=1e-5):
        self.epsiodes = epsiodes
        self.gamma = gamma
        self.epslion = epslion
        self.states = np.arange(0,25)
        actions = {'north': 0, 'south': 1, 'east': 2, 'west': 3}
        self.actions = actions
        self.terminate_states = [2,3,4,10,11,18,23,14]
        # 下面定义奖励函数Rsa
        reward = -1*np.ones((25,4))
        # 2号点是障碍物
        reward[1][actions['east']] = -100
        reward[7][actions['south']] = -100
        reward[3][actions['west']] = -100
        # 3号点是障碍物
        reward[2][actions['east']] = -100
        reward[8][actions['south']] = -100
        reward[4][actions['west']] = -100
        # 4号点是障碍物
        reward[3][actions['east']] = -100
        reward[9][actions['south']] = -100
        # 10号点是障碍物
        reward[5][actions['north']] = -100
        reward[11][actions['west']] = -100
        reward[15][actions['south']] = -100
       # 11号点是障碍物
        reward[6][actions['north']] = -100
        reward[12][actions['west']] = -100
        reward[16][actions['south']] = -100
        reward[10][actions['east']] = -100
        # 18是障碍点
        reward[17][actions['east']] = -100
        reward[19][actions['west']] = -100
        reward[13][actions['north']] = -100
        reward[23][actions['south']] = -100
        # 23是障碍点
        reward[22][actions['east']] = -100
        reward[24][actions['west']] = -100
        reward[18][actions['north']] = -100
        # 14号点出去时有奖励
        reward[13][actions['east']] = 100
        reward[9][actions['north']] = 100
        reward[19][actions['south']] = 100
        self.reward = reward
        # 下面定义状态转移函数Pss'a
        location = np.array([[20,21,22,23,24],
                             [15,16,17,18,19],
                             [10,11,12,13,14],
                             [5,6,7,8,9],
                             [0,1,2,3,4]])
        P = -1*np.ones((25,4))
        for ix in [1,2,3]:
            for iy in [1,2,3]:
               loc = ix + 5*iy
               P[loc][actions['north']] = ix + 5*(iy + 1)
               P[loc][actions['south']] = ix + 5*(iy - 1)
               P[loc][actions['east']] = ix + 1 + 5*iy
               P[loc][actions['west']] = ix - 1+5*iy
        # 下面是四条边
        for ix in [1,2,3]:
            iy = 4
            loc = ix + 5*iy
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
            P[loc][actions['west']] = ix - 1 + 5 * iy
        for ix in [1,2,3]:
            iy = 0
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
            P[loc][actions['west']] = ix - 1 + 5 * iy
        for iy in [1,2,3]:
            ix = 0
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['east']] = ix + 1 + 5 * iy
        for iy in [1,2,3]:
            ix = 4
            loc = ix + 5*iy
            P[loc][actions['north']] = ix + 5 * (iy + 1)
            P[loc][actions['south']] = ix + 5 * (iy - 1)
            P[loc][actions['west']] = ix - 1 + 5 * iy
        # 下面是四个顶点
        P[0][actions['east']] = 1
        P[0][actions['north']] = 5
        P[20][actions['south']] = 15
        P[20][actions['east']] = 21
        P[24][actions['west']] = 23
        P[24][actions['south']] = 19
        P[4][actions['north']] = 9
        P[4][actions['west']] = 3
        # 将走不到的地方都原地踏步
        for i in range(P.shape[0]):
            for j in range(P.shape[1]):
                if P[i][j] == -1:
                    P[i][j] = i
        self.t = P
        # 初始状态为0
        self.state = 0

    # 更新函数
    def reset(self):
        # self.state = int(random.random()*len(self.states))
        self.state = 0
        return self.state

    # 动作函数
    def step(self,action):
        state = self.state
        if state == 14:
            return state,100,True,{}
        obstacle = [2,3,4,10,11,18,23]
        if state in obstacle:
            return state,-100,True,{}
        next_state = self.t[state][self.actions[action]]
        if (next_state in obstacle)or(next_state == 14):
            is_done = True # 如果误闯障碍/终点就结束
        else:
            is_done = False
        self.state = next_state
        r = self.reward[state][self.actions[action]]
        return next_state,r,is_done,{}

    def prob(self):
        P = np.zeros((len(self.states),len(self.states),len(self.actions)))
        for s in range(len(self.states)):
            for a in self.actions:
                self.state = s
                next_state, r, is_done, _ = self.step(a)
                if is_done == False:
                    P[s,int(next_state),self.actions[a]] = 1
        self.P = P
        return P

    # 价值迭代
    def iterate_value(self):
        v1 = np.zeros(len(self.states))
        v = np.zeros(len(self.states))
        pi = np.zeros(len(self.states))
        v_iter = np.zeros(self.epsiodes)
        for epsiode in range(self.epsiodes):
            for state in self.states:
                action_array = np.zeros(len(self.actions))
                for action in self.actions:
                    temp_sum = 0
                    for next_state in self.states:
                        temp_sum += v[next_state]*self.P[state,next_state,self.actions[action]]
                    action_array[self.actions[action]] = self.reward[state,self.actions[action]] + self.gamma*temp_sum
                v1[state] = np.max(action_array)
                pi[state] = np.argmax(action_array)
            if np.linalg.norm(v - v1) <= self.epslion:
                v_iter[epsiode] = np.linalg.norm(v - v1)
                break
            else:
                v = v1
        self.pi = pi
        return pi,v_iter,epsiode

    # 得到轨迹
    def get_trad(self):
        road_key = {0:'north',1:'south',2:'east',3:'west'}
        # 更新初始位置
        self.reset()
        state = self.state
        road = [self.state]
        terminate_state = [2,3,4,10,11,18,23,14]
        while True:
            action_index = self.pi[int(state)]
            if road_key[action_index] == 'north':
                state = state + 5
            elif road_key[action_index] == 'south':
                state = state - 5
            elif road_key[action_index] == 'east':
                state = state + 1
            elif road_key[action_index] == 'west':
                state = state - 1
            road.append(state)
            if state in terminate_state:
                break
        return road

if __name__ == "__main__":
    migong = migong_game(gamma=0.8)
    # 对象更新
    migong.reset()
    # 对象求P
    migong.prob()
    # 价值迭代
    pi,v_iter,epsiode = migong.iterate_value()
    road = migong.get_trad()
    print("pi:")
    print(pi)
    print("最大迭代次数:")
    print(epsiode)
    print('轨迹为:')
    print(road)