关于强化学习探索的金句

DIRECTED EXPLORATION FOR REINFORCEMENT LEARNING

总的来说和Go-explore差不多

These uncertainty-based methods use a reward bonus approach, where they compute a measure of uncertainty and transform that into a bonus that is then added into the reward function. Unfortunately this reward bonus approach has some drawbacks. The main drawback is that reward bonuses may take many, many updates before they propagate and change agent behavior.

This is due to two main factors: the first is that function approximation itself needs many updates before converging; the second is that the reward bonuses are non-stationary and change as the agent explores, meaning the function approximator needs to update and converge to a new set of values every time the uncertainties change.

This makes it necessary to ensure that uncertainties do not change too quickly, in order to give enough time for the function approximation to catch up and propagate the older changes before needing to catch up to the newer changes.

RND 里的0.25% mask原来就是干这个用的。

If the reward bonuses change too quickly, or are too noisy, then it becomes possible for the function approximator to prematurely stop propagation of older changes and start trying to match the newer changes, resulting in missed exploration opportunies or even converging to a suboptimal mixture of old and new uncertainties.

Non-stationarity has already been a difficult problem for RL in learning a Q-value function, which the DQN algorithm is able to tackle by slowing down the propagation of changes through the use of a target network [Mnih et al., 2013]. These two factors together result in slow adaptation of reward bonuses and lead to less efficient exploration.

用goal-conditioned policy 因为 This results in an algorithm that is completely stationary, because the goal-conditioned policy is independent of the uncertainty.

全部算法:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Python强化学习是一种使用Python编程语言和强化学习算法来训练智能体(agent)通过与环境互动去寻找金币的方法。 在这个问题中,强化学习是一种机器学习方法,智能体将通过与环境的交互来学习如何最大化累积奖励。智能体将观察环境的状态,并采取行动来与环境互动。环境将返回奖励和下一个状态,智能体根据这些信息来调整自己的策略。 在寻找金币的问题中,可以将环境建模为一个网格世界,智能体从一个起始位置开始,每个格子上可能存在不同数量的金币。智能体的目标是通过选择合适的动作来最大化获得的奖励,即获得尽可能多的金币。 一种常用的强化学习算法是Q学习。在该算法中,智能体维持一个行动-值函数(Q值函数),用于估计在某个状态下采取某个行动的预期回报。该函数会随着智能体的探索和学习进行更新,使得智能体能够逐渐学习到最佳策略。 在寻找金币的场景中,智能体可以根据当前状态选择相应的行动。例如,如果智能体当前位于一个没有金币的格子上,它可以选择随机移动到相邻的格子中。而如果它处于一个有金币的格子上,它可以选择采取直接移动到该格子上的行动。通过与环境交互并根据Q值函数的更新,智能体可以逐渐学会如何选择最优的行动,最终找到更多的金币。 总之,Python强化学习可以通过建立一个网格世界环境,使用Q学习算法来训练一个智能体,让它在这个环境中学会如何寻找金币。智能体通过与环境互动,根据当前状态选择合适的行动,并根据获得的奖励和下一个状态来更新自己的行动-值函数。通过不断的学习和探索,智能体能够逐渐找到更多的金币,并最大化累积奖励。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值