[强化学习-探索] RND 尝试

尝试一下RND

似乎效果很不好啊,一直卡在局部最优不动,好像很怕死的样子。(偶尔会死,但是不多,基本上都在中部和上面来回蠕动)

主要是俩问题,第一我没有做reward normaliztion。

第二,agent 一旦死了,reward就会很低,所以我要像文章里一样,不能搞episodic的,而是要死了直接重开。因为伽马值很高,所以来回蠕动的话,尽管每一步的reward很小,只要狗的时间足够长,还是可以得到一个不错的回报。。

上面是训练的最后一步,可以看到,尽管episode已经2690步了,还是没出第一个房间,agent只是在来回蠕动。

看来intrinsic reward也会让agent陷入奇怪的local minima啊?

其实因为chaos和组合排列的爆炸,比如怪物来回动,agent也来回蠕动,这两个之间的异步时间(相位差1S?2S?3S?)会造成极大地组合爆炸的可能性。这样,来回不规律的蠕动,就是可以产生非0的intrinsic rewards。

这里面的noise schedulling也是需要考虑的。让我再次改进吧


关于episodic 原文是这样的

In preliminary experiments that used only intrinsic rewards, treating the problem as non-episodic resulted in better exploration. In that setting the return is not truncated at “game over”. We argue that this is a natural way to do exploration in simulated environments, since the agent’s intrinsic return should be related to all the novel states that it could find in the future, regardless of whether they all occur in one episode or are spread over several. It is also argued in (Burda et al., 2018) that using episodic intrinsic rewards can leak information about the task to the agent.

We also argue that this is closer to how humans explore games. For example let’s say Alice is playing a videogame and is attempting a tricky maneuver to reach a suspected secret room. Because the maneuver is tricky the chance of a game over is high, but the payoff to Alice’s curiosity will be high if she succeeds. If Alice is modelled as an episodic reinforcement learning agent, then her future return will be exactly zero if she gets a game over, which might make her overly risk averse. The real cost of a game over to Alice is the opportunity cost incurred by having to play through the game from the beginning (which is presumably less interesting to Alice having played the game for some time).

刚刚的情况也确实agents are overly risk averse

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值