强化学习与环境不确定_不确定性意识强化学习

强化学习与环境不确定

Model-based Reinforcement Learning (RL) gets most of its favour from sample efficiency. It’s generous and undemanding on the amount desired as input, with a cap on what we should expect the model to achieve.

基于模型的强化学习(RL)最受样本效率的青睐。 它对输入的期望数量是慷慨和不需要的,并且对我们期望模型实现的数量有上限。

It’s unlikely for the model to turn out a perfect representation of the environment. While interacting with the real world through the trained agent, we might meet states and rewards different from the ones seen during training. For model-based RL to work, we need to overcome this problem. It’s vital. It’s what will help our agent know what it’s doing.

该模型不太可能完美展现环境。 通过受过训练的代理人与现实世界互动时,我们可能会遇到与训练期间所见不同的状态和奖励。 为了使基于模型的RL正常工作,我们需要克服此问题。 至关重要 这将帮助我们的代理商了解其工作情况。

First, what of model-free RL? Model-free RL uses the ground-truth transitions of the environment in training and testing the agent, always. Unless there are offsets we introduce, such as simulation-to-real transfer, in which case we can’t blame the algorithm.Uncertainty is, therefore, not a big worry here. For something like a Q function Q(s, a), which optimizes over actions, we can attempt to integrate certainty awareness on action selection. But since it works well anyway, it’s no harm, for now, closing our eyes and pretending we didn’t see that.

首先,什么是无模型RL? 无模型RL始终在训练和测试代理时使用环境的真实过渡。 除非我们引入了偏移量,例如模拟到真实的传输,否则这种情况下我们不能怪算法,因此不确定性在这里不是大问题。 对于像Q函数Q(s,a)这样的东西它可以优化动作,我们可以尝试将确定性意识整合到动作选择中。 但是由于无论如何它都能很好地工作,所以没有害处,现在,闭上眼睛,假装我们没有看到。

内容 (Contents)

1. Source of uncertainty in Model-based RL2. The benefit of uncertainty awareness3. Building uncertainty aware models- What might seem to work- What does work4. Conclusion

1.基于模型的RL2中的不确定性来源。 不确定性意识的好处3。 建立不确定性感知模型-似乎可行的方法-可行的方法4。 结论

不确定性来源 (Source of Uncertainty)

Model uncertainty results from the distribution mismatch between the data the model sees during testing and that used to train the model. We test the agent on a distribution different from that seen during training.

模型不确定性是由模型在测试期间看到的数据与用于训练模型的数据之间的分布不匹配导致的。 我们以与培训期间不同的分布测试代理。

不确定性意识到底会带来什么价值变化? (What worth of difference would uncertainty awareness make, exactly?)

At the start of training, the model p(sₜ ₊₁| s, a) has exposed itself to quite small real-world data. We hope the function doesn’t over-fit to this small quantity because we need it to be expressive enough to capture the transitions in later time steps. Then, real data will have accumulated to learn the precise model.

在训练开始时,模型P(Sₜ₊₁| s ^ₜ, 一个 ₜ)将自己暴露在相当小的真实数据。 我们希望该功能不会过小,因为我们需要它具有足够的表达能力,以便在以后的时间步中捕获过渡。 然后,将积累实际数据以学习精确模型。

This is challenging to achieve in Model-based RL. Why? The simple goal of RL is maximizing the future cumulative reward. The planner, while aiming for this, attempts to follow plans for which the model predicts high reward. So if the model overestimates the rewards it will get for a particular action sequence, the planner will be glad to follow that gleaming but erroneous estimate. Selecting such actions in the real-world then results in funny behaviour. In short, the planner is motivated to exploit the positive mistakes of the model.

这在基于模型的RL中实现具有挑战性。 为什么? RL的简单目标是使未来的累积奖励最大化。 计划者为此目的,尝试遵循模型预测高回报的计划。 因此,如果模型高估了特定操作序列所能获得的回报,则计划者将很乐意遵循那种闪闪发光但错误的估计。 在现实世界中选择此类动作会导致有趣的行为。 简而言之,计划者有动机去利用模型的积极错误。

(We can think of the planner as the method we use to select optimal actions given the world states).

(我们可以将计划者视为在给定世界状态的情况下用于选择最佳行动的方法)。

And can this get worse? In high dimensional spaces — where input is an image, for instance — the mistakes the model makes will be a lot more owing to latent variables. It’s common in model-based RL to alleviate the distribution mismatch by using on-policy data collection — transitions observed in the real world are added to the training data and used to replan and correct deviations in the model. In this case, though, the mistakes will be too plenty for the on-policy fix to catch up with the lost model. The plenty of errors might result in the policy changing every time we re-plan, and as a result, the model may never converge.

这会变得更糟吗? 在高维空间(例如,输入是图像)中,由于潜在变量 ,模型所犯的错误将更多。 在基于模型的RL中,通过使用基于策略的数据收集来缓解分布不匹配是很常见的-现实世界中观察到的过渡会添加到训练数据中,并用于重新计划和纠正模型中的偏差。 但是,在这种情况下,错误将太多,无法按策略修复以赶上丢失的模型。 大量的错误可能会导致每次重新计划时都会更改策略,因此,该模型可能永远无法收敛。

We may choose to collect data for every mistake the model might make, but wouldn’t it better if we could detect where the model might go wrong, so the model-based planner can avoid actions likely to result in severe outcomes?

我们可能会针对模型可能犯的每个错误选择收集数据,但是如果我们能够检测出模型可能在哪里出错,这岂不是更好,因此基于模型的计划者可以避免可能导致严重后果的行动?

估计不确定度 (Estimating Uncertainty)

First, let’s phrase what we know as a simple story.

首先,让我们说一个简单的故事。

A loving couple gets the blessing of a baby and a robot — not necessarily at the same time. The robot’s goal is, as a babysitter, to keep baby Juliet happy. While it’s motivated with rewards to achieve this, it’s also desirable that the robot avoids anything damaging, or that might injure the baby.

一对充满爱心的夫妇会得到婴儿和机器人的祝福-不一定同时。 作为保姆,机器人的目标是让朱丽叶(Juliet)开心。 尽管奖励的动机是实现这一目标,但也希望机器人避免任何损坏或可能伤害婴儿的事情。

Now the baby grows fond of crying while pointing at bugs — because good babies can do that — and the robot’s optimal-reward plan becomes squashing the bug and letting baby Juliet watch it feed the vermin to the cat.

现在,婴儿变得很喜欢在指着虫子时哭泣-因为好婴儿可以做到这一点 -并且机器人的最佳奖励计划是挤压虫子,让朱丽叶宝宝看着它把害虫喂给猫。

For a change, though, say the robot encounters the baby crying while pointing at something scary on the Television — an unfamiliar state, seemingly close to baby Juliet’s cry-pointing behaviour. Unsure of the dynamics, the robot’s best plan might be to squash the TV and feed it to the cat. We are not sure if that will make the baby happy, but it’s sure to cause damage.

不过,要进行更改,可以说机器人遇到婴儿哭闹的同时,在电视上指着一些可怕的东西,这是一种陌生的状态,似乎与朱丽叶婴儿的哭泣行为接近。 不确定动力学,机器人的最佳计划可能是压扁电视并将其喂给猫。 我们不确定这是否会使婴儿开心,但一定会造成伤害。

However, if the model, being unconfident, evaded that action, it would have been better off not touching the TV and avoiding the damage at the expense of a sad Juliet.

但是,如果模型不自信地回避了该动作,那么最好不要触摸电视并避免损坏,而这要以朱丽叶伤心为代价。

An uncertainty-aware model would let the agent know where it has a high chance of an undesired outcome — where it needs to be more careful. But if the model is unconfident about what will result after taking action, then it’s probably good to use that to reach its goal.

不确定性感知模型将使业务代表知道在何处极有可能发生不希望的结果-在哪里需要格外小心。 但是,如果模型对采取行动后的结果不确定,则最好使用该模型来达到目标​​。

If our robot is confident that pickles calm baby Juliet while posing no risk, then it might consider running to the kitchen and letting her chew on one, because then, it will achieve its goal of keeping her happy.

如果我们的机器人确信泡菜可以让婴儿朱丽叶平静下来,同时又不冒任何风险,那么它可以考虑跑到厨房并让其咀嚼,因为那样,它将实现使其保持快乐的目标。

A model that can get accurate estimates of its uncertainty gives the model-based planner ability to avoid actions with a non-slight chance of resulting in undesired outcomes. Gradually, the model will learn to make better estimates. Uncertainty awareness will also inform the model on states it needs to explore more.

一个能够准确估计其不确定性的模型,使基于模型的计划者能够避免采取行动的可能性很小,从而导致不良后果。 该模型将逐步学习做出更好的估计。 不确定性意识还将告知模型需要进一步探索的状态。

看起来像什么解决方案? (What seems like a solution?)

Using Entropy

使用熵

We know entropy as a measure of randomness, or the degree of spread in a probability distribution of a random variable.

我们知道熵是衡量随机性的指标,或者说是随机变量的概率分布中的扩散程度。

Entropy
  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值