Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning

18 篇文章 2 订阅
13 篇文章 4 订阅

Pygame module in PYTHON is used to establish dynamic environments.
以激光雷达信号和局部目标位置为输入,利用卷积神经网络(CNNs)对环境状态进行泛化,而Q-learning增强了环境中动态避障和智能体局部规划的能力.结果表明,经过不同动态环境的训练和新环境的测试,该agent能够在未知的动态环境中成功地到达局部目标位置。


之前的工作
In [11, 12], deep reinforcement learning has been applied to autonomous navigation based on the inputs of visual information, which has achieved remarkable success. In[11]只是静态的迷宫,变化的start和target,没有dynamic obstacles.


我们的工作
通过laser information训练DDQN
In the aspect of the recent deep reinforcement learning models, the original training mode results in a large number of samples which are moving states in the free zone in the pool, and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence.
说别人的工作如果直接套进去,那么会因为缺乏失败和成功案例,只是不断的徘徊导致不收敛.
So, we constrain the starting position and target position by randomly setting target position in the area that is not occupied by the obstacles to expand the state space distribution of the pool of sample.
估计是说在起点的一定范围内非障碍物区域设置障碍物来保证采样得到的分布更好.
为了评估我们的算法,我们使用TensorFlow来构建DDQN仿真训练框架,并在真实世界中演示该方法。在仿真中,agent在较低层次和中等层次的动态环境中进行训练。最后做了真实的实验.

Deep Reinforcement Learning with DDQN Algorithm的内容看论文

Local Path Planning with DDQN Algorithm

The sensor range is regarded as observation window.
在这里插入图片描述
The accessible point nearest to the global path at the edge of the observation windowis considered as the local target point ofthe local path planning.
The network receives lidar dot matrix information and local target point coordinates as the inputs and outputs the direction of movement.
果然
这他丫只是个local planner 输入costmap(方的)然后输出现在的cmd_msg
local plan不是对laser信息处理的 而是对costmap处理的:https://blog.csdn.net/weixin_42048023/article/details/83987653
但是文章说他们是接受雷达的信息和local target(局部地图的目标点)然后输出运动方向的,加入接受的雷达是costmap区域内的雷达,那么就和传统costmap的输入输出差不多了了.

we set angle resolution to be 1 degree and range limit to be 2 meters so each observation consists of360 points indicating the distance to obstacles within a two-meter circle around the robot.
就是说我们的分辨率设置,costmap就2m,巴拉巴拉

然后输入360°的角度和距离的laser信息和40份target point的copy,一共800维的向量,输出8种方向的one hot 前后左右和斜向的前后左右
看看是如何设计奖励函数的
在这里插入图片描述
p是移动之后的position,g是local target point,o代表障碍物,包括其inflation.
也就是说reward是如果到了目标点,则为+1,撞到障碍物为-1,其他的为-0.01.
由于一个laser point的长度可以是0~200cm,所以用qtable就很傻逼
DDQN就可以有这个能力存储或者接近所有的state.所以就算环境变化了也整挺好.
To ensure the deep reinforcement learning training converging normally, the pool should be large enough to store state-action of each time-step and keep the training samples of neural network be independent identical distribution;保持神经网络训练样本的独立同分布.
这一点我就有点不明白了,为啥?
而且我们训练的时候是sample一次然后再随机,再sample一次,看结果,这样吗,估计如果这样说的话,就不能是在一次的path planning中进行训练,而是上面说的这样,但是,这是因为神经网络训练需要这样,(DDQN也是这样训练的吗,在线等急.)
besides, the environment punishment and reward should reach to a certain proportion. If the sample space was too sparse, namely, the main states were random movements in free space, it was difficult to achieve a stable training effect.
此外,环境惩罚和奖励应该达到一定的比例。如果样本空间过于稀疏,即主要状态是自由空间中的随机运动,则很难达到稳定的训练效果。
针对ddqn在训练中的不稳定性和状态空间的奖赏稀疏性,将起点随机设置在以目标点为中心,半径为L的圆内。L的初始值较小,从而增加了agent在随机探索中从起点到达目标点的概率,保证了幅值内的正激励.
随着神经网络的更新和贪心法则概率的增加,L的值逐渐增大。代理搜索的本地空间展开如下。
在这里插入图片描述
n是当前的iteration i m i n i_{min} imin是L的最小值. i m a x i_{max} imax是L的最大值.m是空间搜索的速度,N1 N2是循环次数的阈值的超参数.
The termination of each episode is to achieve a fixed number of moving steps instead of directly terminating the current training episode when encountering obstacles or reaching the target point.
The original training mode results in a large number of samples which are moving states in the free zone in the pool,and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence.
原本的那种方法,因为大量的数据都是在free space进行移动,这就导致了-1的样本特别少,大量样本是在自由移动,导致不能收敛.

在这里插入图片描述
具体CNN细节看论文

Local Path Planning Simulation with Pygame Module

最最期待的环境!能不能白嫖呢?
Pygame module is to build dynamic environment.
貌似只说了这个
然后说一开始就是随机探索,然后一个sample就是[s,a,r,s1,d] d代表当前epoch是否结束
并且pool size=40000
当sample存在pool中到达一定数量的时候,the network was to be trained with the randomly selected samples in the pool.
In the first 5000 time steps of the random exploration, the network parameters are not updated, but the samples of the pool are increased.After the sample size reaches to 5000, the network is trained in every four time-step movements.
一开始不训练(没有更新参数),当sample size到达5000之后,每四步移动网络就要训练一次.

性感蓝字,在线教学,神经网络就该这么train

随机draw 32个samples拿去update Q 估计网络,低学习率来训练Q目标网络,使得Q目标网络 approach Q估计网络.这样可以保证Q 目标网络稳定的学习.如果pool满了就以FIFO的规则把之前的sample剔除


下面是吹b时间
迅速收敛之后
We store the network parameters and test in an experimental environment map.
Figure 6 is a new environmental map which has never tested in training
There are three free moving obstacles with constant speed and the moving directions are denoted by the blue arrows.

In the figure, state A demonstrates that the local path is blocked by dynamic obstacle No. 2; then, the agent waits for the obstacle to move downwards. When the obstacle is out of the path, the agent moves towards upper right. As shown in state B, the agent successfully reaches the third local target. In state C, obstacle No. 3 moves towards theagent.TheagentperceivesobstacleNo. 3beforecollision and gets to the sixth local target point by moving toward the bottom right. In stateD,theagentreachestheendpointand completes the path planning without any accident collision with the obstacles.
在图中,状态A表示该局部路径被动态障碍物2阻断;然后,代理等待障碍物向下移动。当障碍物移出路径时,agent向右上角移动。如状态B所示,代理成功到达第三个本地目标。在C状态中,第三个障碍向代理移动,代理感知到障碍。预测,通过向右下角移动到达第6个局部目标点。综上所述,该机构在没有任何事故碰撞的情况下完成路径规划。
The whole process demonstrates that the agent after being trained by DDQN is able to perceive the moving obstacles in advance with the knowledge oflidar data in unknown dynamic environment. An intelligent planning method is presented by Q target network which makes each step move towards a higher cumulative reward
整个过程表明,经过DDQN训练的agent能够利用激光雷达数据在未知的动态环境中提前感知移动的障碍物。提出了一种基于Q目标网络的智能规划方法,使每一步都能获得更高的累积奖励

Data Availability
The datasets and codes generated and analyzed during the current study are available in the Github repository [https://github.com/raykking].

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值