引入ML-Agents工具包v0.2：课程学习，新环境等

最新推荐文章于 2022-06-08 19:55:34 发布

culiao6493

最新推荐文章于 2022-06-08 19:55:34 发布

阅读量241

点赞数

文章标签：游戏 python java 机器学习人工智能

原文链接：https://blogs.unity3d.com/2017/12/08/introducing-ml-agents-v0-2-curriculum-learning-new-environments-and-more/

版权

The Machine Learning team at Unity is happy to announce the release of a new version of Unity Machine Learning Agents Toolkit – v0.2 Beta! With this release, we are improving the toolkit on all fronts: 1- adding new features to both the Unity SDK and Python API; 2- new example environments to build on top of; 3- improvements to our default Reinforcement Learning algorithm (PPO), in addition to many bug-fixes and other smaller improvements. In this post, we will highlight some of the major additions, but for a full list, check out the GitHub release notes. Also visit the GitHub page to download the latest release.

Unity的机器学习团队很高兴宣布发布新版本的Unity机器学习代理工具包– v0.2 Beta！在此版本中，我们在各个方面都改进了该工具包：1-向Unity SDK和Python API都添加了新功能； 2-建立新的示例环境； 3-对我们的默认强化学习算法(PPO)的改进，以及许多错误修复和其他较小的改进。在这篇文章中，我们将重点介绍一些主要的补充，但是要获得完整列表，请查看GitHub 发行说明。另请访问GitHub页面以下载最新版本。

Since we launched v0.1 Beta over two months ago, and it has been wonderful to see projects and games already being made using the Unity ML-Agents Toolkit, as well as all the helpful community feedback. To inspire more creative use cases in machine learning and beyond with ML-Agents Toolkit, we are excited to announce our very first ML-Agents Community Challenge in this post.

自从两个多月前启动v0.1 Beta以来，很高兴看到已经使用Unity ML-Agents Toolkit制作了项目和游戏，以及所有有用的社区反馈。为了通过ML-Agents Toolkit激发更多的创新性机器学习用例，我们很高兴在本文中宣布我们的第一个ML-Agents社区挑战赛。

新的持续控制和平台环境 (New Continuous Control and Platforming Environments)

One of the major requests we received was additional example environments to allow developers a greater variety of baselines from which to start building. We are happy to include four new environments in this release. These environments include two new continuous control environments, plus two platforming environments designed to show off our new Curriculum Learning feature (more on that below).

我们收到的主要要求之一是附加的示例环境，以允许开发人员从各种各样的基准开始构建。我们很高兴在此版本中包含四个新环境。这些环境包括两个新的连续控制环境，以及两个旨在展示我们新的课程学习功能的平台化环境(更多内容请参见下文)。

演示地址

新功能：课程学习，广播和更灵活的监视器 (New Features: Curriculum Learning, Broadcasting, and a more flexible monitor)

Curriculum Learning – Our Python API now includes a standardized way of utilizing Curriculum Learning during the training process. For those unfamiliar, Curriculum learning is a way of training a machine learning model where more difficult aspects of a problem are gradually introduced in such a way that the model is always optimally challenged. Here is a link to the original paper which introduces the ideal formally. More generally, this idea has been around much longer, for it is how we humans typically learn. If you imagine any childhood primary school education, there is an ordering of classes and topics. Arithmetic is taught before algebra, for example. Likewise, algebra is taught before calculus. The skills and knowledge learned in the earlier subjects provide a scaffolding for later lessons. The same principle can be applied to machine learning, where training on easier tasks can provide a scaffolding for harder tasks in the future.

课程学习 –我们的Python API现在包括在培训过程中利用课程学习的标准化方法。对于那些不熟悉的人，课程学习是一种训练机器学习模型的方法，其中逐渐引入问题的更困难方面，从而始终以最佳方式挑战模型。这是原始论文的链接，该论文正式介绍了理想情况。一般来说，这个想法已经存在了很长时间，因为这是我们人类通常学习的方式。如果您想象过任何童年的小学教育，那么班级和主题都是有序的。例如，在代数之前教授算术。同样地，在微积分之前也要教代数。在早期课程中学习的技能和知识为以后的课程提供了基础。可以将相同的原理应用于机器学习，在机器学习中，对较简单任务的培训可以为将来的较难任务提供支撑。

Example of a mathematics curriculum. Lessons progress from simpler topics to more complex ones, with each building on the last.

数学课程示例。课程从较简单的主题发展为较复杂的主题，每个主题都建立在最后。

When we think about how Reinforcement Learning actually works, the primary learning signal is a scalar reward received occasionally throughout training. In more complex or difficult tasks, this reward can often be sparse, and rarely achieved. For example, imagine a task in which an agent needs to push a block into place to scale a wall and arrive at a goal. The starting point when training an agent to accomplish this task will be a random policy. That starting policy will likely involve the agent running in circles, and will likely never, or very rarely scale the wall properly to receive the reward. If we start with a simpler task, such as moving toward an unobstructed goal, then the agent can easily learn to accomplish the task. From there, we can slowly add to the difficulty of the task by increasing the size of the wall, until the agent can complete the initially near-impossible task of scaling the wall. We are including just such an environment with Unity ML-Agents Toolkit v0.2, called Wall Area.

当我们考虑强化学习的实际工作原理时，主要的学习信号是在整个培训过程中偶尔收到的标量奖励。在更复杂或更困难的任务中，这种奖励通常很少，很少获得。例如，设想一个任务，其中特工需要将一个障碍物推入适当位置以缩放墙壁并达到目标。培训代理商完成此任务的出发点将是随机策略。该启动策略可能会涉及到代理人的循环运行，并且可能永远不会或很少会适当地扩展隔离墙以获取奖励。如果我们从一个简单的任务开始，例如朝一个通畅的目标迈进，那么座席就可以轻松地学习完成任务。从那里开始，我们可以通过增加墙的大小来逐渐增加任务的难度，直到代理可以完成最初几乎不可能进行的缩放墙的任务。 Unity ML-Agents Toolkit v0.2(称为Wall Area)就包含了这样的环境。

Demonstration of a curriculum training scenario in which a progressively taller wall obstructs the path to the goal.

演示课程培训方案，其中越来越高的墙阻碍了通往目标的道路。

To see this in action, observe the two learning curves below. Each displays the reward over time for a brain trained using PPO with the same set of training hyperparameters and data from 32 simultaneous agents. The difference is that the brain in orange was trained using the full-height wall version of the task, and the blue line corresponds to a brain trained using a curriculum version of the task. As you can see, without using curriculum learning the agent has a lot of difficulties, and after 3 million steps has still not solved the task. We think that by using well-crafted curricula, agents trained using reinforcement learning will be able to accomplish tasks otherwise much more difficult, with much less time.

要了解这一点，请观察以下两条学习曲线。每个显示随时间推移的奖励，这些奖励是使用PPO训练的大脑具有相同的训练超参数集和来自32个同时代理的数据。区别在于，橙色的大脑是使用任务的全高墙版本训练的，而蓝线对应于使用任务的课程版本来训练的大脑。如您所见，代理人没有使用课程学习方法就遇到了很多困难，经过300万步之后，任务仍然没有解决。我们认为，通过精心设计的课程，使用强化学习训练的特工将能够以更少的时间完成原本要困难得多的任务。

Two training curves on the Wall Area task. Blue line corresponds to brain trained using curriculum learning. Orange line corresponds to brain trained without curriculum learning. Dotted vertical blue lines correspond to a lesson change in the curriculum for the curriculum training session.

墙区任务上的两条训练曲线。蓝线对应于使用课程学习训练的大脑。橙色线对应于未经课程学习训练的大脑。垂直的蓝色虚线表示课程培训课程的课程更改。

So how does it work? In order to define a curriculum, the first step is to decide which parameters of the environment will vary. In the case of the Wall Area environment, what varies is the height of the wall. We can define this as a reset parameter in the Academy object of our scene, and by doing so it becomes adjustable via the Python API. Rather than adjusting it by hand, we then create a simple JSON file which describes the structure of the curriculum. Within it we can set at what points in the training process our wall height will change, either based on the percentage of training steps which have taken place, or what the average reward the agent has received in the recent past is. Once these are in place, we simply launch ppo.py using the –curriculum-file flag to point to the JSON file, and PPO we will train using Curriculum Learning. Of course we can then keep track of the current lesson and progress via TensorBoard.

那么它是怎样工作的？为了定义课程，第一步是确定环境的哪些参数会变化。在“墙壁区域”环境中，墙壁的高度会有所不同。我们可以在场景的Academy对象中将其定义为重置参数，并通过Python API对其进行调整。然后，我们将创建一个描述课程结构的简单JSON文件，而不是手动进行调整。在其中，我们可以根据已进行的培训步骤的百分比，或代理商最近获得的平均奖励，来设置培训过程中壁高的变化点。一旦这些到位，我们只需使用–curriculum-file标志启动ppo.py以指向JSON文件，然后我们将使用“课程学习”来训练PPO。当然，我们可以通过TensorBoard跟踪当前课程和进度。

Here’s an example of a JSON file which defines the Curriculum for the Wall Area environment:

这是一个JSON文件的示例，该文件定义了Wall Area环境的课程：

{

"measure" : "reward",

"thresholds" : [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],

"min_lesson_length" : 2,

"signal_smoothing" : true,

"parameters" :

{

"min_wall_height" : [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5],

"max_wall_height" : [1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0]

}

{

"measure" : "reward" ,

"thresholds" : [ 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 , 0.5 ] ,

"min_lesson_length" : 2 ,

"signal_smoothing" : true ,

"parameters" :

{

"min_wall_height" : [ 0.0 , 0.5 , 1.0 , 1.5 , 2.0 , 2.5 , 3.0 , 3.5 , 4.0 , 4.5 ] ,

"max_wall_height" : [ 1.5 , 2.0 , 2.5 , 3.0 , 3.5 , 4.0 , 4.5 , 5.0 , 5.5 , 6.0 ]

}

For those in the community who have created environments which they have had difficulty getting their agents to solve, we encourage you to try Curriculum Learning out, and we would love to hear your findings.

对于社区中创建了难以解决的问题的社区的人，我们鼓励您尝试“课程学习”，我们很乐意听到您的发现。

Broadcasting – The internal, heuristic, and player brains now all include a “Broadcast” feature, which is active by default. When active, the states, actions, and rewards for all agents linked to that brain will then be accessible from the Python API. This is in contrast to v0.1, where only the external brain could send information to the Python API. This feature can be used to record, analyze, or store information from these brain types on Python. Specifically, this feature makes imitation learning possible, where data from a player, heuristic, or internal brain can be used as the supervision signal to train a separate network without needing to define a reward function, or in addition to a reward function to augment the training signal. We think this can provide a new avenue for how game developers think of getting intelligent behavior from their systems. In a future blog post, we plan to walk through this scenario and provide an example project.

广播 –内部，启发式和玩家的大脑现在都具有“广播”功能，该功能默认情况下处于活动状态。激活后，将可以从Python API访问与该大脑链接的所有代理的状态，动作和奖励。这与v0.1相反，后者只有外部大脑才能将信息发送到Python API。此功能可用于在Python上记录，分析或存储这些大脑类型的信息。具体而言，此功能使模仿学习成为可能，在这种情况下，可以将来自玩家，启发式或内部大脑的数据用作监督信号，以训练单独的网络，而无需定义奖励功能，或者除了奖励功能以外，还可以增强奖励功能。训练信号。我们认为这可以为游戏开发人员如何从其系统中获取智能行为提供一条新途径。在以后的博客文章中，我们计划逐步介绍这种情况并提供示例项目。

Brain Inspector window. In 0.2 the “Broadcast” check-box is added.

脑部检查器窗口。在0.2中，添加了“广播”复选框。

Flexible Monitor – We have rewritten the Agent Monitor to provide more general usability. Whereas the original Monitor had a fixed set of statistics about an agent which could be displayed, the new Monitor now allows for displaying any desired information related to agents. All you have to do is call Monitor.Log() to display information either on the screen or above an agent within the scene.

灵活的监视器 –我们重写了代理监视器，以提供更多通用性。原来的Monitor具有一组可以显示的有关座席的固定统计信息，而新的Monitor现在可以显示与座席有关的任何所需信息。您所需要做的就是调用Monitor.Log()在屏幕上或场景中某个代理上方显示信息。

Ball balance environment with a variety of training information displayed using the Monitor.

球平衡环境，使用监视器显示各种训练信息。

As with any beta release, there will likely be bugs and issues. We encourage you to share feedback with us on the GitHub issues page.

与任何Beta版本一样，可能会出现错误和问题。我们鼓励您在GitHub 问题页面上与我们分享反馈。

Unity ML-Agents社区挑战 (Unity ML-Agents Community Challenge)

Last but not least, we are excited to announce that Unity will be hosting an ML-Agents Community Challenge. Whether you’re an expert in Machine Learning or just interested in how ML can be applied to games, this challenge is a great opportunity for you to learn, explore, inspire and get inspired.

最后但并非最不重要的一点是，我们很高兴宣布Unity将举办ML-Agents社区挑战赛。无论您是机器学习方面的专家，还是仅对ML如何应用于游戏感兴趣，此挑战都是您学习，探索，启发和启发的绝佳机会。

We want to see how you apply the new Curriculum Learning method. But we’re not looking for any particular genre or style, so get creative! We’ll send some gifts and surprises to the creators who get the most likes at the end of the challenge.

我们想看看您如何应用新的课程学习方法。但是，我们并不是要寻找任何特定的流派或风格，因此请发挥创意！我们将向挑战赛结束时获得最多喜欢的创作者发送一些礼物和惊喜。

参加ML-Agents挑战 (Enter the ML-Agents Challenge)

The first round of The Machine Learning Agents Challenge is from Dec 7, 2017 to Jan 31, 2018, and it is open to any developer with basic Unity knowledge and experience. Click this link to enter this challenge. If you’re not familiar with how ML Agents work, please feel free to contact us by mail or on the Unity Machine Learning channel with your questions.

The Machine Learning Agents Challenge的第一轮比赛将于2017年12月7日至2018年1月31日举行，所有具有Unity基本知识和经验的开发人员均可参加。单击此链接输入此挑战。如果您不熟悉ML代理的工作方式，请随时通过邮件或在Unity Machine Learning频道上与我们联系并提出问题。

Happy Creating!

创建愉快！

[Recommended Readings]

简介：Unity机器学习代理工具包

Unity AI – Reinforcement Learning with Q-Learning

Unity AI –通过Q学习进行强化学习

Unity AI-themed Blog Entries

Unity AI主题博客条目

Using Machine Learning Agents Toolkit in a real game: a beginner’s guide

在真实游戏中使用机器学习代理工具包：初学者指南