【强化学习】马尔可夫决策链MDP 寻找有陷阱的迷宫最优路径解

本文介绍了使用马尔可夫决策过程(MDP)解决带有陷阱的迷宫问题。迷宫中包含起点、终点和陷阱,触碰到陷阱或墙壁会立即结束。行走会消耗能力,目标是找到最短的最优路径。代码实现部分展示了如何应用MDP算法来解决此类问题。然而,MDP对于无限空间的迷宫或动态变化的迷宫求解存在挑战。
摘要由CSDN通过智能技术生成

Trap maze:

MDP is used to solve the problem of trap maze. In the maze, in addition to the starting point and the end point, there is also the tendency of trap. When the trap area or the outer wall is touched, it will directly end and cannot continue. Moreover, walking in the maze needs to consume the ability, so we hope to use MDP algorithm to find the optimal shortest path.
In addition, mazes can interfere with your nerves, and when you decide to take action in one direction, there’s a small chance you’ll go the wrong way.

MAZE:
[[-100,-100, -100, -100,-100, -100 ],
[-100,-1, -1, -1, -1, -100 ],
[-100,-1, -100, -100, -100, -100 ],
[-100,-1, -100, -1, -1, -100 ],
[-100,-1, -1, -1, 1, -100 ],
[-100,-100, -100, -100,-100, -100]]
Set walls and traps with R of -100, end point with R of 1, and consume 1 energy without taking a step.

Goal:
Find the optimal path to the destination
[[0. 1. 1. 1. 1. 0.]
[3. 1. 2. 2. 2. 2.]
[3. 1. 2. 1. 1. 0.]
[3. 1. 1. 1. 1. 2.]
[3. 3. 3. 3. 1. 2.]
[0. 0. 0. 0. 0. 0.]]

0: go up
1: go down
2: go left
3: go right

See the code

Limits :

  1. MDP algorithm for finite space maze can be easily solved, but it is difficult to solve infinite space maze. (There’s an infinite number of states)
  2. For dynamic mazes, or multi-task mazes (which must go through a path), MDP algorithm is not easy to implement.

代码实现

import numpy as np

class maze_mdp:
    def __init__(self,maze,gamma):
        self.maze = maze
        self.row = maze.shape[0]
        self.col = maze.shape[1]
        self.r_s = np.array([x for y in maze for x in y])
        self.end = np.where(self.r_s > 0)[0][0]
        self.trap = np.where(self.r_s < -1)[0]
        self.v_s = np.zeros(len(self.r_s))
        self.maze_number = np.arange(0, len(self.r_s)).reshape(maze.shape)
        # 0 : up 85%
        # 1 : down 85%
        # 2 : left 85%
        # 3 : right 85%
        self.p1 = 0.85
        self.p2 = 0.05
        self.action = np.zeros(len(self.r_s))
        self.gamma = gamma

    def go_up(self,x):
        if 0 <= x-self.col:
            return x-self.col
        else:
            return x

    def go_left(self,x):
        if x in  self.maze_number[:,0]:
            return x
        else:
            return x-1

    def go_right(self,x):
        if x in  self.maze_number[:,-1]:
            return x
        else:
            return x+1

    def go_down(self,x):
        if x+self.col <= self.end:
            return x+self.col
        else:
            return x

    def go(self,x):
        return self.go_up(x),self.go_down(x),self.go_left(x),self.go_right(x)

    def update_value_pai(self):
        old_v_s = self.v_s.copy()
        for i in range(len(self.r_s)):
            if i == self.end or i in self.trap:
                self.v_s[i] = 0
                continue
            up , down , left ,right  = self.go(i)

            if self.action[i] == 0:
                self.v_s[i] = self.p1 *(self.r_s[up] +  self.gamma *old_v_s[up]) + \
                              self.p2 *(self.r_s[down] + self.gamma *old_v_s[down]) + \
                              self.p2 *(self.r_s[left] + self.gamma *old_v_s[left]) + \
                              self.p2 *(self.r_s[right] + self.gamma *old_v_s[right])
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值