基于Q-learning算法的机器人迷宫路径规划研究（Matlab代码实现）

最新推荐文章于 2024-11-09 20:38:27 发布

宇哥预测优化代码学习

最新推荐文章于 2024-11-09 20:38:27 发布

阅读量1k

点赞数 22

文章标签：算法机器人 matlab

本文链接：https://blog.csdn.net/w56782888/article/details/140779539

版权

💥💥💞💞欢迎来到本博客❤️❤️💥💥

🏆博主优势：🌞🌞🌞博客内容尽量做到思维缜密，逻辑清晰，为了方便读者。

⛳️座右铭：行百里者，半于九十。

📋📋📋本文目录如下：🎁🎁🎁

目录

💥1 概述

1. Q-learning 算法简介

2. 在迷宫路径规划中的应用

3. 挑战与优化

📚2 运行结果

🎉3 参考文献

🌈4 Matlab代码、数据

💥1 概述

基于Q-learning算法的机器人迷宫路径规划是一项很有趣的研究课题！Q-learning是一种基于强化学习的算法，通过探索-利用策略学习到一个最优的行动策略。在迷宫路径规划中，机器人需要在未知的环境中找到一条最短的路径从起点到终点，而Q-learning正是可以用来实现这一目标。

首先，你需要构建一个迷宫环境的模型，包括起点、终点、墙壁等障碍物。然后，你可以使用Q-learning算法来训练机器人，使其在不断地探索迷宫的过程中学习到最优的行动策略。这个过程中，机器人会根据当前的状态选择行动，并根据所获得的奖励来更新Q值，最终形成一个最优的Q值表。

在实现过程中，你需要注意一些关键的步骤：

1. **状态表示：** 将迷宫中的每个位置都视作一个状态，机器人在迷宫中的位置就是它的状态。
2. **行动选择：** 根据当前状态和Q值表，机器人选择下一步的行动。通常可以使用ε-greedy策略，在一定概率下进行探索，而在其他情况下选择Q值最大的行动。
3. **奖励更新：** 当机器人执行一个行动后，根据所到达的新状态和奖励函数更新Q值。
4. **训练过程：** 机器人在迷宫中不断地执行行动，更新Q值，直到达到收敛条件为止。

最终，当Q值收敛时，机器人就学会了在迷宫中找到最优路径的策略。这个策略可以被用于实际的迷宫导航中，让机器人能够智能地避开障碍物，快速到达目的地。

本文演示了强化学习（Q-learning）算法，以迷宫为例，其中一个机器人必须通过向左、向右、向上和向下移动来到达目的地。在每一步中，基于机器人行动的结果，它被教导和重新教导是否是一个好的动作，直到最终整个过程重复多次直到到达目的地。在这一点上，过程将重新开始，以便验证已经学到的内容，并且在第一次通过中进行的不必要的移动可以被遗忘等等。这是一个很好的教程示例，用于必须在行进中学习的情况，即不使用训练示例的情况。可以在游戏中使用，以学习和提高AI算法与人类玩家的竞争能力以及其他几种情况。

在小迷宫上，收敛速度会很快，而在大迷宫上，收敛可能需要一些时间。可以通过修改代码使Q-learning更有效，从而提高收敛速度。

有四个.m文件：
QLearning_Maze_Walk.m - 演示了在选定的迷宫上工作的Q-learning算法
Random_Maze_Walk.m - 演示了用于比较的随机选择的工作方式
Read_Maze.m - 将读取提供的迷宫并将其转换为数字表示以供处理
Textscanu.m - 读取原始迷宫文本文件

包含两个迷宫文件：
maze-9-9.txt
maze-61-21.txt

基于Q-learning算法的机器人迷宫路径规划是一种强化学习方法在自动化导航和路径规划领域的应用。这种方法使得机器人能够在未知环境中自主学习最优路径，而无需人类事先提供详细的环境地图或明确的指引。下面是对这一研究主题的基本介绍和关键点概述：

1. Q-learning 算法简介

Q-learning是强化学习中的一种无模型、离策略学习算法，主要用于解决决策过程中的最优策略选择问题。它通过学习一个称为Q表（或Q函数）的动作值函数来实现。Q表记录了在给定状态下采取某个动作所能获得的预期回报。通过不断探索和利用已知信息，Q-learning逐渐优化这个表，引导智能体（如机器人）做出最优决策。

2. 在迷宫路径规划中的应用

环境建模：首先，将迷宫环境建模为一个有限状态空间，每个状态代表机器人在迷宫中的一个位置，动作则代表机器人可以进行的移动（上、下、左、右），奖励机制设计为当机器人接近出口时给予正奖励，撞墙或重复访问同一位置给予负奖励或零奖励。
初始化Q表：对所有状态-动作对初始化一个Q值，通常设为0或一个小的随机数，表示初始时对每个动作的价值一无所知。
探索与利用：通过ε-greedy策略平衡“探索”（尝试不同的路径以发现可能的更好解）和“利用”（根据当前的Q表选择最佳动作）。即以一定概率ε随机选择一个动作，以1-ε的概率选择当前Q表中评估价值最高的动作。
更新Q值：每当机器人执行一个动作并到达新的状态后，根据以下公式更新Q表中的值：
目标：经过多轮迭代，Q表逐渐收敛，机器人学会在不同状态下选择能够引导其尽快到达终点的动作序列，从而实现从起点到终点的最优路径规划。

3. 挑战与优化

状态空间大：对于复杂的迷宫，状态空间可能非常庞大，这要求高效的Q表表示和存储方式，以及可能的函数近似方法（如使用深度Q网络DQN）。
避免陷入局部最优：通过调整学习参数和探索策略，如动态调整ε值或采用其他探索机制，来提高找到全局最优解的能力。
实时性与效率：在实际应用中，需要考虑计算资源限制和实时性要求，优化算法以减少计算量和加快收敛速度。

基于Q-learning的机器人迷宫路径规划研究不仅有助于理解强化学习在复杂环境决策中的应用，也为自主导航、机器人技术乃至更广泛的人工智能领域提供了有价值的理论和技术基础。

📚2 运行结果

部分代码：

DISPLAY_FLAG = 1; % 1 means display maze and 0 means no display
NUM_ITERATIONS = 100; % change this value to set max iterations
% initialize global variable about robot orientation
currentDirection = 1; % robot is facing up

% row col will be initalized with the position of starting point of robot
% in the loop in which maze is read below
fileName = 'maze-9-9.txt';
[maze2D,row,col] = Read_Maze(fileName);
imagesc(maze2D) % show the maze

% make some copies of maze to use later for display
orgMaze2D = maze2D;
orgMaze2D(row,col) = 50;
[goalX,goalY,val] = find(orgMaze2D == 100);
tempMaze2D = orgMaze2D;

% record robots starting location for use later
startX = row;
startY = col;

% build a state action matrix by finding all valid states from maze
% we have four actions for each state.
Q = zeros(size(maze2D,1),size(maze2D,2),4);

% only used for priority visiting for larger maze
%visitFlag = zeros(size(maze2D,1),size(maze2D,2));

% status message for goal and bump
GOAL = 3;
BUMP = 2;

% learning rate settings
alpha = 0.8;
gamma = 0.5;

for i=1:NUM_ITERATIONS
tempMaze2D(goalX,goalY) = 100;
row = startX; col = startY;
status = -1;
countActions = 0;
currentDirection = 1;

% only used for priority visiting for larger maze
% visitFlag = zeros(size(maze2D,1),size(maze2D,2));
% visitFlag(row,col) = 1;

while status ~= GOAL
% record the current position of the robot for use later
prvRow = row; prvCol = col;

% select an action value i.e. Direction
% which has the maximum value of Q in it
% if more than one actions has same value then select randomly from them
[val,index] = max(Q(row,col,:));
[xx,yy] = find(Q(row,col,:) == val);
if size(yy,1) > 1
index = 1+round(rand*(size(yy,1)-1));
action = yy(index,1);
else
action = index;
end

% based on the selected actions correct the orientation of the
% robot to conform to rules of simulator
while currentDirection ~= action
currentDirection = TurnLeft(currentDirection);
% count the actions required to reach the goal
countActions = countActions + 1;
end

% do the selected action i.e. MoveAhead
[row,col,status] = MoveAhead(row,col,currentDirection);

% count the actions required to reach the goal
countActions = countActions + 1;

% Get the reward values i.e. if final state then max reward
% if bump into a wall then -1 is the reward for that action
% other wise the reward value is 0
if status == BUMP
rewardVal = -1;
elseif status == GOAL
rewardVal = 1;
else
rewardVal = 0;
end

% enable this piece of code if testing larger maze
% if visitFlag(row,col) == 0
% rewardVal = rewardVal + 0.2;
% visitFlag(row,col) = 1;
% else
% rewardVal = rewardVal - 0.2;
% end

% update information for robot in Q for later use
Q(prvRow,prvCol,action) = Q(prvRow,prvCol,action) + alpha*(rewardVal+gamma*max(Q(row,col,:)) - Q(prvRow,prvCol,action));

% display the maze after some steps
if rem(countActions,1) == 0 & DISPLAY_FLAG == 1
X = [row col];
Y = [goalX goalY];
dist = norm(X-Y,1);
s = sprintf('Manhattan Distance = %f',dist);
imagesc(tempMaze2D);%,colorbar;
title(s);
drawnow
end
end

iterationCount(i,1) = countActions;

% display the final maze
imagesc(tempMaze2D);%,colorbar;
disp(countActions);
%bar(iterationCount);
drawnow
end

figure,bar(iterationCount)
disp('----- Mean Result -----')
meanA = mean(iterationCount);
disp(meanA);
%save Q_Learn_9-9.mat;

%-------------------------------%
% 1
% 2 3
% 4
% Current Direction
% 1 - means robot facing up
% 2 - means robot facing left
% 3 - means robot facing right
% 4 - means robot facing down
%------------------------------%
% based on the current direction and convention rotate the robot left
function currentDirection = TurnLeft(currentDirection)
if currentDirection == 1
currentDirection = 2;
elseif currentDirection == 2
currentDirection = 4;
elseif currentDirection == 4
currentDirection = 3;
elseif currentDirection == 3
currentDirection = 1;
end

% based on the current direction and convention rotate the robot right
function currentDirection = TurnRight(currentDirection)
if currentDirection == 1
currentDirection = 3;
elseif currentDirection == 3
currentDirection = 4;
elseif currentDirection == 4
currentDirection = 2;
elseif currentDirection == 2
currentDirection = 1;
end

% return the information just in front of the robot (local)
function [val,valid] = LookAhead(row,col,currentDirection)
global maze2D;
valid = 0;
if currentDirection == 1
if row-1 >= 1 & row-1 <= size(maze2D,1)
val = maze2D(row-1,col);
valid = 1;
end
elseif currentDirection == 2
if col-1 >= 1 & col-1 <= size(maze2D,2)
val = maze2D(row,col-1);
valid = 1;
end
elseif currentDirection == 3
if col+1 >= 1 & col+1 <= size(maze2D,2)
val = maze2D(row,col+1);
valid = 1;
end
elseif currentDirection == 4
if row+1 >= 1 & row+1 <= size(maze2D,1)
val = maze2D(row+1,col);
valid = 1;
end
end

% status = 1 then move ahead successful
% status = 2 then bump into wall or boundary
% status = 3 then goal achieved
% Move the robot to the next location if no bump
function [row,col,status] = MoveAhead(row,col,currentDirection)
global tempMaze2D;

% based on the current direction check whether next location is space or
% bump and get information of use below
[val,valid] = LookAhead(row,col,currentDirection);
% check if next location for moving is space
% other wise set the status
% this checks the collision with boundary of maze
if valid == 1
% now check if the next location for space or bump
% this is for walls inside the maze
if val > 0
oldRow = row; oldCol = col;
if currentDirection == 1
row = row - 1;
elseif currentDirection == 2
col = col - 1;
elseif currentDirection == 3
col = col + 1;
elseif currentDirection == 4
row = row + 1;
end
status = 1;

if val == 100
% goal achieved
status = 3;
disp(status);
end

% update the current position of the robot in maze for display
tempMaze2D(oldRow,oldCol) = 50;
tempMaze2D(row,col) = 60;
elseif val == 0
% bump into wall

🎉3 参考文献

文章中一些内容引自网络，会注明出处或引用为参考文献，难免有未尽之处，如有不妥，请随时联系删除。

[1]王子强,武继刚.基于RDC-Q学习算法的移动机器人路径规划[J].计算机工程, 2014, 40(006):211-214.DOI:10.3969/j.issn.1000-3428.2014.06.045.

[2]张燕,王志祥,董美琪,等.基于改进Q-learning算法的移动机器人路径规划方法:CN202310368455.8[P].CN116380102A[2024-04-15].

[3]刘志荣,姜树海,袁雯雯,等.基于深度Q学习的移动机器人路径规划[J].测控技术, 2019, 38(7):5.DOI:10.19708/j.ckjs.2018.00.002.

[4]段建民,陈强龙.利用先验知识的Q-Learning路径规划算法研究[J].电光与控制, 2019, v.26;No.255(09):33-37.DOI:10.3969/j.issn.1671-637X.2019.09.007.