[强化学习-探索] RND 尝试

fangzhang233

已于 2023-10-16 14:41:14 修改

阅读量69

点赞数

文章标签： python conda web3.py httpx gunicorn

于 2023-10-14 18:09:51 首次发布

本文链接：https://blog.csdn.net/fangzhang233/article/details/133828177

版权

尝试一下RND

似乎效果很不好啊，一直卡在局部最优不动，好像很怕死的样子。（偶尔会死，但是不多，基本上都在中部和上面来回蠕动）

主要是俩问题，第一我没有做reward normaliztion。

第二，agent 一旦死了，reward就会很低，所以我要像文章里一样，不能搞episodic的，而是要死了直接重开。因为伽马值很高，所以来回蠕动的话，尽管每一步的reward很小，只要狗的时间足够长，还是可以得到一个不错的回报。。

上面是训练的最后一步，可以看到，尽管episode已经2690步了，还是没出第一个房间，agent只是在来回蠕动。

看来intrinsic reward也会让agent陷入奇怪的local minima啊？

其实因为chaos和组合排列的爆炸，比如怪物来回动，agent也来回蠕动，这两个之间的异步时间（相位差1S？2S？3S？）会造成极大地组合爆炸的可能性。这样，来回不规律的蠕动，就是可以产生非0的intrinsic rewards。

这里面的noise schedulling也是需要考虑的。让我再次改进吧

关于episodic 原文是这样的

In preliminary experiments that used only intrinsic rewards, treating the problem as non-episodic resulted in better exploration. In that setting the return is not truncated at “game over”. We argue that this is a natural way to do exploration in simulated environments, since the agent’s intrinsic return should be related to all the novel states that it could find in the future, regardless of whether they all occur in one episode or are spread over several. It is also argued in (Burda et al., 2018) that using episodic intrinsic rewards can leak information about the task to the agent.

We also argue that this is closer to how humans explore games. For example let’s say Alice is playing a videogame and is attempting a tricky maneuver to reach a suspected secret room. Because the maneuver is tricky the chance of a game over is high, but the payoff to Alice’s curiosity will be high if she succeeds. If Alice is modelled as an episodic reinforcement learning agent, then her future return will be exactly zero if she gets a game over, which might make her overly risk averse. The real cost of a game over to Alice is the opportunity cost incurred by having to play through the game from the beginning (which is presumably less interesting to Alice having played the game for some time).

刚刚的情况也确实agents are overly risk averse

fangzhang233

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
[强化学习-探索] RND 尝试

第二，agent 一旦死了，reward就会很低，所以我要像文章里一样，不能搞episodic的，而是要死了直接重开。因为伽马值很高，所以来回蠕动的话，尽管每一步的reward很小，只要狗的时间足够长，还是可以得到一个不错的回报。上面是训练的最后一步，可以看到，尽管episode已经2690步了，还是没出第一个房间，agent只是在来回蠕动。似乎效果很不好啊，一直卡在局部最优不动，好像很怕死的样子。看来intrinsic reward也会让agent陷入奇怪的local minima啊？
复制链接

扫一扫