奖励稀疏_好奇心解决稀疏奖励任务

最新推荐文章于 2024-08-01 07:51:37 发布

culiao6493

最新推荐文章于 2024-08-01 07:51:37 发布

阅读量1.3k

点赞数 2

文章标签： python 人工智能深度学习 java 机器学习

原文链接：https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/

版权

奖励稀疏

We just released the new version of ML-Agents toolkit (v0.4), and one of the new features we are excited to share with everyone is the ability to train agents with an additional curiosity-based intrinsic reward.

我们刚刚发布了ML-Agents工具包(v0.4)的新版本，我们很高兴与大家分享的新功能之一就是能够通过基于好奇心的额外内在奖励来训练代理商。

Since there is a lot to unpack in this feature, I wanted to write an additional blog post on it. In essence, there is now an easy way to encourage agents to explore the environment more effectively when the rewards are infrequent and sparsely distributed. These agents can do this using a reward they give themselves based on how surprised they are about the outcome of their actions. In this post, I will explain how this new system works, and then show how we can use it to help our agent solve a task that would otherwise be much more difficult for a vanilla Reinforcement Learning (RL) algorithm to solve.

由于此功能有很多要解压的地方，因此我想在上面写一篇额外的博客文章。本质上，当奖励很少且稀疏分配时，现在有一种简单的方法可以鼓励代理商更有效地探索环境。这些代理人可以根据他们对自己的行动结果感到惊讶的程度给予奖励，从而做到这一点。在这篇文章中，我将解释这个新系统的工作原理，然后说明如何使用它来帮助我们的代理解决一项任务，否则该任务将无法通过香草强化学习(RL)算法来解决。

好奇心驱动的探索 (Curiosity-driven exploration)

When it comes to Reinforcement Learning, the primary learning signal comes in the form of the reward: a scalar value provided to the agent after every decision it makes. This reward is typically provided by the environment itself and specified by the creator of the environment. These rewards often correspond to things like +1.0 for reaching the goal, -1.0 for dying, etc. We can think of this kind of rewards as being extrinsic because they come from outside the agent. If there are extrinsic rewards, then that means there must be intrinsic ones too. Rather than being provided by the environment, intrinsic rewards are generated by the agent itself based on some criteria. Of course, not any intrinsic reward would do. We want intrinsic rewards which ultimately serve some purpose, such as changing the agent’s behavior such that it will get even greater extrinsic rewards in the future, or that the agent will explore the world more than it might have otherwise. In humans and other mammals, the pursuit of these intrinsic rewards is often referred to as intrinsic motivation and tied closely to our feelings of agency.

当涉及到强化学习时，主要的学习信号是以奖励的形式出现的：在代理商做出的每个决定之后，它都会向代理商提供一个标量值。该奖励通常由环境本身提供，并由环境的创建者指定。这些奖励通常对应于达到目标的+1.0，即将死去的-1.0等。我们可以将这种奖励视为外部奖励，因为它们来自特工的外部。如果有外部奖励，那意味着也必须有内部奖励。内在报酬不是由环境提供的，而是由代理商本身根据某些标准生成的。当然，没有任何内在的奖励会起作用。我们希望内在的报酬最终能达到某些目的，例如改变代理人的行为，使其将来会获得更大的外在报酬，或者使代理人比其他人更能探索世界。在人类和其他哺乳动物中，对这些内在奖励的追求通常被称为内在动机，并且与我们的代理感觉紧密相关。

Researchers in the field of Reinforcement Learning have put a lot of thought into developing good systems for providing intrinsic rewards to agents which endow them with similar motivation as we find in nature’s agents. One popular approach is to endow the agent with a sense of curiosity and to reward it based on how surprised it is by the world around it. If you think about how a young baby learns about the world, it isn’t pursuing any specific goal, but rather playing and exploring for the novelty of the experience. You can say that the child is curious. The idea behind curiosity-driven exploration is to instill this kind of motivation into our agents. If the agent is rewarded for reaching states which are surprising to it, then it will learn strategies to explore the environment to find more and more surprising states. Along the way, the agent will hopefully also discover the extrinsic reward as well, such as a distant goal position in a maze, or sparse resource on a landscape.

强化学习领域的研究人员已经在开发良好的系统上投入了很多思想，这些系统可以为代理提供内在的奖励，赋予他们与自然代理一样的动机。一种流行的方法是赋予代理人一种好奇心，并根据周围世界的惊讶程度对其进行奖励。如果您考虑一个小婴儿如何学习这个世界，那不是追求任何特定的目标，而是玩耍和探索这种体验的新颖性。您可以说孩子很好奇。好奇心驱动的探索背后的想法是将这种动机灌输给我们的代理商。如果代理因到达令人惊讶的状态而获得奖励，则它将学习探索环境以发现越来越多的令人惊讶状态的策略。一路上，特工也有望发现外部奖励，例如迷宫中遥远的目标位置或景观上的稀疏资源。

We chose to implement one specific such approach from a recent paper released last year by Deepak Pathak and his colleagues at Berkeley. It is called Curiosity-driven Exploration by Self-supervised Prediction, and you can read the paper here if you are interested in the full details. In the paper, the authors formulate the idea of curiosity in a clever and generalizable way. They propose to train two separate neural-networks: a forward and an inverse model. The inverse model is trained to take the current and next observation received by the agent, encode them both using a single encoder, and use the result to predict the action that was taken between the occurrence of the two observations. The forward model is then trained to take the encoded current observation and action and predict the encoded next observation. The difference between the predicted and real encodings is then used as the intrinsic reward, and fed to the agent. Bigger difference means bigger surprise, which in turn means bigger intrinsic reward.

去年Deepak Pathak和他在伯克利的同事发布的最新论文中，我们选择实施一种特定的此类方法。这被称为“自我监督预测好奇心驱动的探索”，如果您对全部细节感兴趣，可以在这里阅读本文。在本文中，作者巧妙而概括地提出了好奇心的概念。他们提议训练两个独立的神经网络：正向模型和逆向模型。训练逆模型以获取代理接收到的当前和下一个观察值，使用单个编码器对它们进行编码，并使用结果预测两次观察之间发生的动作。然后训练前向模型以采取编码的当前观察和动作并预测编码的下一个观察。然后将预测编码与实际编码之间的差异用作内在奖励，并馈给代理。更大的差异意味着更大的惊喜，这反过来又意味着更大的内在回报。

By using these two models together, the reward not only captures surprising things, but specifically captures surprising things that the agent has control over, based on its actions. Their approach allows an agent trained without any extrinsic rewards to make progress in Super Mario Bros simply based on its intrinsic reward. See below for a diagram from the paper outlining the process.

通过将这两个模型一起使用，奖励不仅捕获了令人惊讶的事情，而且还特别捕获了代理基于其行为控制的令人惊讶的事情。他们的方法允许训练没有任何外部奖励的特工仅凭其内在奖励就可以在《超级马里奥兄弟》中取得进步。请参见下面的概述该过程的图表。

Diagram showing Intrinsic Curiosity Module. White boxes correspond to input. Blue boxes correspond to neural network layers and outputs. Filled blue lines correspond to flow of activation in the network. Green dotted lines correspond to comparisons used for loss calculation. Green box corresponds to intrinsic reward calculation.

该图显示了固有好奇心模块。白框对应于输入。蓝色框对应于神经网络层和输出。填充的蓝线对应于网络中的激活流程。绿色虚线对应于用于损失计算的比较。绿框对应于内在奖励计算。

金字塔环境 (Pyramids environment)

In order to test out curiosity, no ordinary environment will do. Most of the example environments we’ve released through v0.3 of ML-Agents toolkit contain rewards which are relatively dense and would not benefit much from curiosity or other exploration enhancement methods. So to put our agent’s newfound curiosity to the test, we created a new sparse rewarding environment called Pyramids. In it, there is only a single reward, and random exploration will rarely allow the agent to encounter it. In this environment, our agent takes the form of the familiar blue cube from some of our previous environments. The agent can move forward or backward and turn left or right, and it has access to a view of the surrounding world via a series of ray-casts from the front of the cube.

为了测试好奇心，没有任何普通的环境可以做到。我们通过ML-Agents v0.3发布的大多数示例环境都包含相对丰厚的奖励，并且不会从好奇心或其他探索增强方法中受益良多。因此，为了检验代理人的新发现的好奇心，我们创建了一个新的稀疏奖励环境，称为金字塔。在其中，只有一个奖励，而随机探索将很少允许特工遇到它。在这种环境下，我们的代理采用我们先前某些环境中熟悉的蓝色立方体的形式。代理可以向前或向后移动以及向左或向右旋转，并且可以通过从立方体正面进行的一系列射线投射访问周围的世界。

An agent observing the surroundings using ray-casts (Visualized here in black for illustrative purposes)

一名使用射线观察周围环境的特工(此处以黑色表示，仅供参考)

This agent is dropped into an enclosed space containing nine rooms. One of these rooms contains a randomly positioned switch, while the others contain randomly placed un-movable stone pyramids. When the agent interacts with the switch by colliding with it, the switch then turns from red to green. Along with this change of color, a pyramid of movable sand bricks is then spawned randomly in one of the many rooms of the environment. On top of this pyramid is a single golden brick. When the agent collides with this brick, the agent receives +2 extrinsic reward. The trick is that there are no intermediate rewards for moving to new rooms, flipping the switch, or knocking over the tower. The agent has to learn to perform this sequence without any intermediate help.

该代理被放到一个包含九个房间的封闭空间中。这些房间之一包含一个随机放置的开关，而其他房间包含随机放置的不可移动的石金字塔。当代理通过与交换机碰撞而与之交互时，交换机将从红色变为绿色。随着颜色的变化，随后在环境的许多房间之一中随机生成了一块活动的沙砖金字塔。在这座金字塔的顶部是一块金砖。当特工与这个砖块碰撞时，特工会获得+2外部奖励。诀窍是，搬到新房间，拨动开关或撞倒塔楼不会获得中间奖励。代理必须学会在没有任何中间帮助的情况下执行此序列。

Agent trained with PPO+Curiosity moving to pyramid after interacting with the switch.

与PPO +好奇心一起训练的特工在与开关互动后移至金字塔。

Agents trained using a vanilla Proximal Policy Optimization (PPO, our default RL algorithm in ML-Agents) on this task do poorly, often failing to do better than chance (average -1 reward), even after 200,000 steps.

使用香草近端策略优化(PPO，即ML-Agents中我们的默认RL算法)进行培训的座席在此任务上的表现较差，即使经过200,000步后，也往往做不到胜于机会(平均-1奖励)。

演示地址

In contrast, agents trained with PPO and the Curiosity-Driven intrinsic reward consistently solve it within 200,000 episodes, and often even in half that time.

相比之下，受过PPO和好奇心驱动的内在奖励训练的特工始终在200,000集内(甚至在一半的时间内)解决该问题。

演示地址

Cumulative extrinsic reward over time for PPO+Curiosity agent (blue) and PPO agent (red). Averaged over five runs each.

PPO +好奇味剂(蓝色)和PPO剂(红色)随时间的累积外部奖励。每次平均超过五次。

We also looked at agents trained with the intrinsic reward signal only, and while they don’t learn to solve the task, they learn a qualitatively more interesting policy which enables them to move between multiple rooms, compared to the extrinsic only policy which has the agent moving in small circles within a single room.

我们还研究了仅受过内在奖励信号训练的特工，虽然他们没有学会解决任务，但他们学习的是定性更有趣的策略，与具有唯一性的外在策略相比，他们可以在多个房间之间移动。代理在单个房间内小圈移动。

演示地址

通过PPO使用好奇心 (Using Curiosity with PPO)

If you’d like to use curiosity to help train agents in your environments, enabling it is easy. First, grab the latest ML-Agents toolkit release, then add the following line to the hyperparameter file of the brain you are interested in training: use_curiosity: true. From there, you can start the training process as usual. If you use TensorBoard, you will notice that there are now a few new metrics being tracked. These include the forward and inverse model loss, along with the cumulative intrinsic reward per episode.

如果您想借助好奇心来帮助您的环境中的代理培训，启用它很容易。首先，获取最新的ML-Agents工具包版本，然后use_curiosity: true添加到您要训练的大脑的超参数文件中： use_curiosity: true 。从那里，您可以照常开始训练过程。如果您使用TensorBoard，您会注意到现在正在跟踪一些新指标。这些包括正向和反向模型损失，以及每集的累积固有奖励。

Giving your agent curiosity won’t help in all situations. Particularly if your environment already contains a dense reward function, such as our Crawler and Walker environments, where a non-zero reward is received after most actions, you may not see much improvement. If your environment contains only sparse rewards, then adding intrinsic rewards has the potential to turn these tasks from unsolvable to easily solvable using Reinforcement Learning. This has applicability particularly when it makes the most sense for simple rewards such as win/lose or completed/failed for tasks.

保持座席人员的好奇心并不能在所有情况下都起作用。特别是如果您的环境已经包含密集的奖励功能，例如我们的Crawler和Walker环境，在大多数操作之后会收到非零奖励，那么您可能看不到太多改善。如果您的环境仅包含稀疏奖励，则添加内在奖励有可能使用强化学习将这些任务从无法解决转变为易于解决。这具有适用性，尤其是当它对于简单的奖励(例如任务的胜利/失败或完成/失败)最有意义时。

—

If you do use the Curiosity feature, I’d love to hear about your experience. Feel free to reach out to us on our GitHub issues page, or email us directly at ml-agents@unity3d.com. Happy training!

如果您确实使用了好奇心功能，那么我很想听听您的经历。请随时在我们的GitHub问题页面上与我们联系，或直接通过ml-agents@unity3d.com向我们发送电子邮件。培训愉快！