【强化学习】马尔可夫决策链MDP 寻找有陷阱的迷宫最优路径解

最新推荐文章于 2023-08-10 23:54:15 发布

是算法不是法术

最新推荐文章于 2023-08-10 23:54:15 发布

阅读量902

点赞数

分类专栏：强化学习文章标签： python 强化学习算法

本文链接：https://blog.csdn.net/weixin_45839693/article/details/111997818

版权

本文介绍了使用马尔可夫决策过程（MDP）解决带有陷阱的迷宫问题。迷宫中包含起点、终点和陷阱，触碰到陷阱或墙壁会立即结束。行走会消耗能力，目标是找到最短的最优路径。代码实现部分展示了如何应用MDP算法来解决此类问题。然而，MDP对于无限空间的迷宫或动态变化的迷宫求解存在挑战。

摘要由CSDN通过智能技术生成

Trap maze:

MDP is used to solve the problem of trap maze. In the maze, in addition to the starting point and the end point, there is also the tendency of trap. When the trap area or the outer wall is touched, it will directly end and cannot continue. Moreover, walking in the maze needs to consume the ability, so we hope to use MDP algorithm to find the optimal shortest path.
In addition, mazes can interfere with your nerves, and when you decide to take action in one direction, there’s a small chance you’ll go the wrong way.

MAZE：
[[-100,-100, -100, -100,-100, -100 ],
[-100,-1, -1, -1, -1, -100 ],
[-100,-1, -100, -100, -100, -100 ],
[-100,-1, -100, -1, -1, -100 ],
[-100,-1, -1, -1, 1, -100 ],
[-100,-100, -100, -100,-100, -100]]
Set walls and traps with R of -100, end point with R of 1, and consume 1 energy without taking a step.

Goal:
Find the optimal path to the destination
[[0. 1. 1. 1. 1. 0.]
[3. 1. 2. 2. 2. 2.]
[3. 1. 2. 1. 1. 0.]
[3. 1. 1. 1. 1. 2.]
[3. 3. 3. 3. 1. 2.]
[0. 0. 0. 0. 0. 0.]]

0: go up
1: go down
2: go left
3: go right

See the code

Limits :

MDP algorithm for finite space maze can be easily solved, but it is difficult to solve infinite space maze. （There’s an infinite number of states）
For dynamic mazes, or multi-task mazes (which must go through a path), MDP algorithm is not easy to implement.

代码实现

import numpy as np

class maze_mdp:
    def __init__(self,maze,gamma):
        self.maze = maze
        self.row = maze.shape[0]
        self.col = maze.shape[1]
        self.r_s = np.array([x for y in maze for x in y])
        self.end = np.where(self.r_s > 0)[0][0]
        self.trap = np.where(self.r_s < -1)[0]
        self.v_s = np.zeros(len(self.r_s))
        self.maze_number = np.arange(0, len(self.r_s)).reshape(maze.shape)
        # 0 : up 85%
        # 1 : down 85%
        # 2 : left 85%
        # 3 : right 85%
        self.p1 = 0.85
        self.p2 = 0.05
        self.action = np.zeros(len(self.r_s))
        self.gamma = gamma

    def go_up(self,x):
        if 0 <= x-self.col:
            return x-self.col
        else:
            return x

    def go_left(self,x):
        if x in  self.maze_number[:,0]:
            return x
        else:
            return x-1

    def go_right(self,x):
        if x in  self.maze_number[:,-1]:
            return x
        else:
            return x+1

    def go_down(self,x):
        if x+self.col <= self.end:
            return x+self.col
        else:
            return x

    def go(self,x):
        return self.go_up(x),self.go_down(x),self.go_left(x),self.go_right(x)

    def update_value_pai(self):
        old_v_s = self.v_s.copy()
        for i in range(len(self.r_s)):
            if i == self.end or i in self.trap:
                self.v_s[i] = 0
                continue
            up , down , left ,right  = self.go(i)

            if self.action[i] == 0:
                self.v_s[i] = self.p1 *(self.r_s[up] +  self.gamma *old_v_s[up]) + \
                              self.p2 *(self.r_s[down] + self.gamma *old_v_s[down]) + \
                              self.p2 *(self.r_s[left] + self.gamma *old_v_s[left]) + \
                              self.p2 *(self.r_s[right] + self.gamma *old_v_s[right])

最低0.47元/天解锁文章

是算法不是法术

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【强化学习】马尔可夫决策链MDP 寻找有陷阱的迷宫最优路径解

Trap maze:MDP is used to solve the problem of trap maze. In the maze, in addition to the starting point and the end point, there is also the tendency of trap. When the trap area or the outer wall is touched, it will directly end and cannot continue. Moreo
复制链接

扫一扫

专栏目录