unity 条目换位效果_Unity AI主题博客条目

unity 条目换位效果

Welcome to the first of Unity’s new AI-themed blog entries! We have set up this space as a place to share and discuss the work Unity is doing around Artificial Intelligence and Machine Learning. In the past few years, advances in Machine Learning (ML) have allowed for breakthroughs in detecting objects, translating text, recognizing speech, and playing games, to name a few. That last point, the connection between ML and games, is something very close to our hearts here at Unity. We believe that breakthroughs in Deep Learning are going to create a sea-change in how games are built, changing everything from how textures and 3D-models are generated, to how non playable characters (NPCs) are programmed, to how we think about animating characters or lighting scenes. These blog entries are a creative space to explore all these emerging developments.

欢迎使用Unity新的以AI主题为主题的博客文章中的第一篇! 我们将这个空间设置为共享和讨论Unity在人工智能和机器学习方面所做工作的地方。 在过去的几年中,机器学习(ML)的进步为 检测物体翻译文本识别语音玩游戏等方面 取得了突破 。 最后一点,即机器学习和游戏之间的联系,非常接近我们在Unity的心。 我们相信深度学习的突破将在游戏的构建方式上产生巨大的变化,从纹理和3D模型的生成方式到不可玩角色(NPC)的编程方式,以及我们对动画制作的思考方式都将发生变化。人物或灯光场景。 这些博客条目是一个探索所有这些新兴发展的创意空间。

这些博客条目适用于谁? (Who are these blog entries for?)

It is our objective to inform Unity Game Developers about the power of AI and ML approaches in game development. We also want to show Artists the opportunities of using AI techniques in content creation. We will take this as an opportunity to demonstrate to ML Researchers the potential of Unity as platform for AI research/development. This includes, demonstrating the Industry potential of Unity as a simulation platform for robotics and self-driving cars. And finally, to getting Hobbyists/Students excited about both Unity and ML.

我们的目标是让Unity 游戏开发人员 了解AI和ML方法在游戏开发中的强大功能。 我们还想向 艺术家 展示 在内容创作中使用AI技术的机会。 我们将以此为契机向ML Researchers 展示 Unity作为 AI研究/开发 平台 的潜力 。 这包括证明 Unity作为 机器人技术自动驾驶汽车 仿真平台 的 行业 潜力 。 最后,让 业余爱好者/学生 对Unity和ML感到兴奋。

Over the next few months, we hope to use this space to start discussions and build a community around these concepts and use-cases of Unity. Multiple members of the Unity ML team and other related teams within Unity will post here discussing the different connections between Unity and Machine Learning. Whenever possible, we will release open source tools, videos, and example projects to help the different groups mentioned above utilize the ideas, algorithms, and methods we have shared. We will be monitoring this space closely, and encourage the Unity community to contribute to commentary as well.

在接下来的几个月中,我们希望利用这个空间来开始讨论,并围绕这些概念和Unity用例建立一个社区。 Unity ML团队以及Unity中其他相关团队的多个成员将在此处发布,讨论Unity与机器学习之间的不同联系。 只要有可能,我们都会发布开源工具,视频和示例项目,以帮助上述不同群体利用我们共享的思想,算法和方法。 我们将密切关注这一领域,并鼓励Unity社区也为评论做出贡献。

为什么要机器学习? (Why Machine Learning?)

To begin the conversation, we want to spend this first entry talking specifically about the relationship between ML and game AI. Most game AI that currently exists is hand coded, consisting of decision-trees with sometimes up to thousands of rules. All of which must be maintained by hand, and thoroughly tested. In contrast, ML relies on algorithms which can make sense of raw data, without the need of an expert to define how to interpret that data.

为了开始对话,我们想花掉第一篇文章专门讨论ML和游戏AI之间的关系。 当前存在的大多数游戏AI是手动编码的,包括有时多达数千个规则的决策树。 所有这些都必须手工维护,并经过全面测试。 相反,ML依靠可以理解原始数据的算法,而无需专家来定义如何解释该数据。

Take for example the computer vision problem of classifying the content of an image. Until a few years ago, experts would write filters by hand that would extract useful features for classifying an image as containing a cat or dog. In contrast, ML and in particular the newer Deep Learning approaches, only need the images and class labels and learn the useful features automatically. We believe that this automated learning can help simplify and speed up the process of creating games for developers both big and small, in addition to opening up the possibilities of the Unity platform being used in a wider array of contexts such as simulations of ML scenarios.

例如,对图像内容进行分类的计算机视觉问题。 直到几年前,专家们还会手动编写过滤器,以提取有用的功能,以将图像分类为包含猫或狗。 相比之下,机器学习,尤其是较新的深度学习方法,只需要图像和类标签即可自动学习有用的功能。 我们相信,这种自动化学习不仅可以帮助简化和加快为大小开发人员开发游戏的过程,而且还可以打开Unity平台在更广泛的环境中使用的可能性,例如ML场景的模拟。

This automated learning can be applied specifically to game agent behavior a.k.a. NPCs. We can use Reinforcement Learning (RL) to train agents to estimate the value of taking actions within an environment. Once they have been trained, these agents can take actions to receive the most value, without ever having to be explicitly programmed how to act. The rest of this post is going to consist of a simple introduction to Reinforcement Learning (RL) and a walkthrough of how to implement a simple RL algorithm in Unity! And of course all the code used in this post is available in the Github repository here.  You can also access a WebGL demo.

这种自动学习可以专门应用于游戏代理行为(也称为NPC)。 我们可以使用 强化学习 (RL)来训练代理,以评估在环境中采取行动的价值。 一旦对他们进行了培训,这些代理商就可以采取行动以获取最大的价值,而无需明确地编程如何采取行动。 这篇文章的其余部分将包括对强化学习(RL)的简单介绍,以及如何在Unity中实现简单的RL算法的演练! 当然,本文中使用的所有代码都可以在Github存储库中 找到 您还可以访问WebGL演示

与土匪的强化学习 (Reinforcement Learning with Bandits)

As mentioned above, a core concept behind RL is the estimation of value, and acting on that value estimate. Before going further, it will be helpful to introduce some terminology. In RL, what performs the acting is called an agent, and what it uses to make decisions about its actions is called a policy. An agent is always embedded within an environment and at any given moment the agent is in a certain state. From that state, it can take one of a set of actions. The value of a given state refers to how ultimately rewarding it is to be in that state. Taking an action in a state can bring an agent to a new state, provide a reward, or both. The total cumulative reward is what all RL agent try to maximize over time.

如上所述,RL背后的核心概念是价值估算,并根据该价值估算采取行动。 在继续之前,介绍一些术语将很有帮助。 在RL中,将执行操作的行为称为 代理 ,将其用于决定其操作的行为称为 策略 。 代理始终嵌入在 环境中, 并且在任何给定时刻,代理都处于某种 状态 。 从该状态可以采取一组 动作中 的一个 。 给定状态 的 价值 是指在该状态下最终获得奖励的程度。 在某种状态下采取行动可以使代理人进入新的状态,提供 奖励 或两者兼而有之。 总累积奖励是所有RL代理商试图随着时间最大化的。

The simplest version of an RL problem is called the multi-armed bandit. This name is derived from the problem of optimizing pay-out across multiple slot machines, also referred as “single-arm bandits” given their propensity for stealing quarters from their users. In this set-up, the environment consists of only a single state, and the agent can take one of n actions. Each action provides an immediate reward to the agent. The agent’s goal is to learn to pick the action that provides the greatest reward.

RL问题的最简单版本称为多臂匪。 这个名称源自优化多个老虎机支出的问题,鉴于他们倾向于从用户那里窃取钱财,因此也称为“单臂匪”。 在此设置中,环境仅包含一个状态,并且代理可以执行 n个 动作之一。 每个动作都会立即向代理提供奖励。 代理人的目标是学习选择能提供最大回报的行动。

To make this a little more concrete, let’s imagine a scenario within a dungeon-crawler game. The agent enters a room, and finds a number of chests lined up along the wall. Each of these chests have a certain probability of containing either a diamond (reward +1) or an enemy ghost (reward -1).

为了更具体一点,让我们想象一下地牢爬虫游戏中的场景。 特工进入一个房间,发现沿着墙排成一排的箱子。 这些箱子中的每个箱子都有一定的概率包含钻石(奖励 +1 )或敌方幽灵(奖励 -1 )。

The goal of the agent is to learn which chest is the most likely to have the diamond (say, for example, third from the right). The natural way to discover which chest is the most rewarding is to try each of the chests out. Indeed, until the agent has learned enough about the world to act optimally much of RL consists of simple trial and error. Bringing the example above back to the RL lingo, the “trying out” of each chest corresponds to taking a series of actions (opening each chest multiple times), and the learning corresponds to updating an estimate of the value of each action. Once we are reasonably certain about our value estimations, we can then have the agent always pick the chest with the highest estimated value.

探员的目标是了解哪个胸部最有可能拥有钻石(例如,从右数第三个)。 发现哪个箱子最有价值的自然方法是尝试每个箱子。 的确,在代理商对世界有足够的了解以采取最佳行动之前,许多RL包括简单的试验和错误。 把上面的例子带回到RL术语,每个箱子的“尝试”对应于采取一系列动作(多次打开每个箱子),学习对应于更新每个动作的价值估计。 一旦我们合理地确定了我们的价值估计,我们便可以让代理始终选择具有最高估计值的宝箱。

These value estimates can be learned using an iterative process in which we start with an initial series of estimates V(a), and then adjust them each time we take an action and observe the result. Formally, this is written as:

可以使用迭代过程来学习这些价值估算,在迭代过程中,我们从一系列初始估算V(a)开始,然后在每次采取行动并观察结果时对其进行调整。 正式的写为:

Intuitively, the above equation is stating that we adjust our current value estimate a little bit in the direction of the obtained reward. In this way we ensure we are always changing our estimate to better reflect the true dynamics of the environment. In doing so, we also ensure that our estimates don’t become unreasonably large, as might happen if we simply counted positive outcomes. We can accomplish this in code by keeping a vector of value estimates, and referencing them with the index of the action our agent took.

直观地,上面的等式表明我们朝获得的奖励的方向稍微调整了当前值估计。 通过这种方式,我们确保始终在更改估算值,以更好地反映环境的真实动态。 在此过程中,我们还确保我们的估计值不会过大,如果仅计算正面结果可能会发生这种情况。 我们可以通过保留价值估算的向量,并用我们的代理采取的行动的索引来引用代码,以完成此任务。

演示地址

情境强盗 (Contextual Bandits)

The situation described above lacks one important aspect of any realistic environment: it only has a single state. In reality (and any game world), a given environment can have anywhere from dozens (think rooms in a house) to billions (pixel configurations on a screen) of possible states. Each of these states can have their own unique dynamics in terms of how actions provide new rewards or enable movement between states. As such, we need to condition our actions, and by extension our value estimates, on the state as well. Notationally, will now use Q(s, a)instead of just V(a). Abstractly this means that the reward we expect to receive is now a function of both the action we take, and the state we were in when taking that action. In our dungeon game, the concept of state can enable us to have different sets of chests in different rooms. Each of these rooms can have a different ideal chest, and as such our agent needs to learn to pick different actions in different rooms. We can accomplish this in code by keeping a matrix of value estimates, instead of simply a vector. This matrix can be indexed with [state, action].

上述情况缺少任何现实环境的一个重要方面:它只有一个状态。 在现实世界(以及任何游戏世界)中,给定的环境可能具有数十种(想像一个房间的房间)到数十亿种(屏幕上的像素配置)的可能状态。 这些状态中的每个状态在行为如何提供新的奖励或实现状态之间的移动方面可以有自己独特的动力。 因此,我们需要根据状态来限制我们的行动,并扩展我们的价值估计。 从符号上讲,现在将使用 Q(s,a) 而不是 V(a) 。 从抽象上讲,这意味着我们期望获得的报酬现在是我们所采取的行动以及采取该行动时所处状态的函数。 在我们的地牢游戏中,状态的概念可以使我们在不同的房间拥有不同的箱子。 这些房间中的每个房间都有不同的理想箱位,因此我们的经纪人需要学习在不同的房间中选择不同的动作。 我们可以通过保留值估计矩阵而不是简单的向量来在代码中完成此操作。 该矩阵可以用 [状态,动作] 索引

探索与利用 (Exploring and Exploiting)

There is one more important piece of the puzzle to getting RL to work. Before our agent has learned a policy for taking the most rewarding actions, it needs a policy that will allow it to learn enough about the world to be sure it knows what optimal actually is. This presents us with the classic dilemma of how to balance exploration (learning about the environment’s value structure through trial and error) and exploitation (acting on the environments learned value structure). Sometimes these two goals line up, but often they do not. There are a number of strategies to take in balancing these two goals. Below we have outlined a few approaches.

使RL发挥作用,还有一个更重要的难题。 在我们的代理商了解采取最有回报的行动的政策之前,它需要一项政策,该政策将使其能够对世界有足够的了解,以确保知道实际的最佳状态。 这给我们带来了经典的难题,即如何平衡 探索 (通过反复试验来学习环境的价值结构)和 开发 (作用于所 学习的环境的价值结构)之间的 平衡 。 有时这两个目标会对齐,但通常不会。 要平衡这两个目标,可以采取多种策略。 下面我们概述了一些方法。

  • One simple, yet powerful strategy follows the principle of “optimism in the face of uncertainty.” The idea here is that the agent starts with high value estimates V(a) for each action, so that acting greedily (taking the action with the maximum value) will lead the agent to explore each of the actions at least once. If the action didn’t lead to a good reward, the value estimate will decrease accordingly, but if it did, then the value estimate will remain high, as that action might be a good candidate to try again in the future. By itself though, this heuristic is often not enough, since we might need to keep exploring a given state to find an infrequent, but large reward.

    一种简单而强大的策略遵循“面对不确定性时保持乐观”的原则。 这里的想法是,代理为每个动作以高价值估计值V(a)开始,因此贪婪地行动(采取具有最大价值的动作)将导致代理至少探索一次每个动作。 如果该操作未能带来良好的回报,则价值估算值将相应降低,但是如果确实如此,则价值估算值将保持较高,因为该操作可能是将来再次尝试的不错选择。 但是,仅凭这种启发式方法通常是不够的,因为我们可能需要继续探索给定的状态,以找到不常见但可观的回报。

  • Another strategy is to add random noise to the value estimates for each action, and then act greedily based on these new noisy estimates. With this approach, as long as the noise is less than the difference between the true optimal action and the other actions, it should converge to optimal value estimates.

    另一种策略是将随机噪声添加到每个操作的值估计中,然后根据这些新的噪声估计值贪婪地执行操作。 使用这种方法,只要噪声小于真实最佳动作与其他动作之间的差,它就应收敛到最佳值估计。

  • We could also go one step further and take advantage of the nature of the value estimates themselves by normalizing them, and taking actions probabilistically. In this case if the value estimates for each action were roughly equal, then we would take actions with equal probability. On the flip side, if one action had a much greater value estimate, then we would pick it more often. By doing this we slowly weed out unrewarding actions by taking them less and less. This is the strategy we use in the demo project.

    我们还可以更进一步,通过对价值评估进行归一化并采取概率性行动来利用价值评估本身的性质。 在这种情况下,如果每个动作的价值估算值大致相等,那么我们将以相同的概率采取动作。 另一方面,如果一项行动的价值估算值要大得多,那么我们会更频繁地选择它。 通过这样做,我们逐渐减少了不必要的行动,使它们越来越少。 这是我们在 演示项目中 使用的策略 。

向前走 (Going Forward)

With this blog and the accompanying code you should now have all the pieces needed to start working with multi-armed and contextual bandits in Unity. This is all just the beginning. In a follow-up post we will go through Q-learning in full RL problems, and from there start to tackle learning policies for increasingly complex agent behavior in visually rich game environments using deep neural networks. Using these more advanced methods, it is possible to train agents which can serve as companions or opponents in genres ranging from fighting and driving games, to first person shooter, or even to real-time strategy games. All without writing rules, and focusing on what you want the agent to achieve instead of how to achieve it.

通过此博客和随附的代码,您现在应该拥有开始在Unity中与多臂和上下文强盗一起使用所需的所有内容。 这仅仅是开始。 在后续文章中,我们将通过 Q学习 全面解决RL问题,并从那里开始使用深度神经网络解决在视觉丰富的游戏环境中日益复杂的代理行为的学习策略。 使用这些更高级的方法,可以训练在战斗和驾驶游戏,第一人称射击游戏甚至实时策略游戏等类型中可以充当同伴或对手的特工。 所有这些都无需编写规则,而是专注于您希望代理实现的目标,而不是如何实现。

In the next few postings we will also be providing an early release of tools to allow researchers interested in using Unity for Deep RL research to connect their models written with frameworks such as Tensorflow or PyTorch to environments made in Unity. On top of all that we have a lot more planned for this year beyond agent behavior, and we hope the community will join us as we explore the uncharted territory that is the future of how games are made!

在接下来的几篇文章中,我们还将提供早期版本的工具,以使对使用Unity进行深度RL研究感兴趣的研究人员可以将使用Tensorflow或PyTorch等框架编写的模型连接到Unity中的环境。 最重要的是,除了代理商行为之外,我们今年还有更多计划,我们希望社区能够加入我们的行列,探索游戏制作方式的未来!

You can read the second part of this blog series here.

您可以在此处阅读本博客系列的第二部分。

翻译自: https://blogs.unity3d.com/2017/06/26/unity-ai-themed-blog-entries/

unity 条目换位效果

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值