ml-agents_使用ML-Agents的自玩功能来训练智能对手

ml-agents

In the latest release of the ML-Agents Toolkit (v0.14), we have added a self-play feature that provides the capability to train competitive agents in adversarial games (as in zero-sum games, where one agent’s gain is exactly the other agent’s loss). In this blog post, we provide an overview of self-play and demonstrate how it enables stable and effective training on the Soccer demo environment in the ML-Agents Toolkit.

在最新版的ML-Agents工具包(v0.14)中,我们添加了一项自玩功能,该功能可在对抗游戏(如零和游戏中,一个代理商的收益恰好是其他代理人的损失)。 在此博客文章中,我们提供了自我竞赛的概述,并演示了它如何在ML-Agents Toolkit中的足球演示环境上实现稳定有效的培训。

The Tennis and Soccer example environments of the Unity ML-Agents Toolkit pit agents against one another as adversaries. Training agents in this type of adversarial scenario can be quite challenging. In fact, in previous releases of the ML-Agents Toolkit, reliably training agents in these environments required significant reward engineering. In version 0.14, we have enabled users to train agents in games via reinforcement learning (RL) from self-play, a mechanism fundamental to a number of the most high profile results in RL such as OpenAI Five and DeepMind’s AlphaStar. Self-play uses the agent’s current and past ‘selves’ as opponents. This provides a naturally improving adversary against which our agent can gradually improve using traditional RL algorithms. The fully trained agent can be used as competition for advanced human players.

Unity ML-Agents工具包的Tennis和Soccer示例环境使代理人相互对抗。 在这种对抗性情况下培训代理人可能会非常具有挑战性。 实际上,在ML-Agents Toolkit早期版本中,在这些环境中可靠地培训代理需要大量的奖励工程。 在 0.14版中 ,我们使用户能够通过自学的强化学习(RL)来训练游戏中的特工,这是RL中许多最引人注目的结果的基础机制,例如 OpenAI FiveDeepMind的AlphaStar 。 自我玩法将坐席的当前和过去“自我”用作对手。 这提供了一个自然地改进的对手,我们的代理可以使用传统的RL算法来逐步改进它。 训练有素的经纪人可以用作高级选手的比赛。

Self-play provides a learning environment analogous to how humans structure competition. For example, a human learning to play tennis would train against opponents of similar skill level because an opponent that is too strong or too weak is not as conducive to learning the game. From the standpoint of improving one’s skills, it would be far more valuable for a beginner-level tennis player to compete against other beginners than, say, against a newborn child or Novak Djokovic. The former couldn’t return the ball, and the latter wouldn’t serve them a ball they could return. When the beginner has achieved sufficient strength, they move on to the next tier of tournament play to compete with stronger opponents.  

自我游戏提供了类似于人类如何构成竞争的学习环境。 例如,人类学习打网球会训练类似技能水平的对手,因为太强或太弱的对手都不利于学习比赛。 从提高技能的角度来看,对于初学者水平的网球运动员来说,与其他初学者竞争比与新生婴儿或诺瓦克·德约科维奇竞争更有价值。 前者不能退回球,后者不能为他们提供可以退回的球。 当初学者获得足够的力量时,他们将进入下一级比赛,与实力更强的对手竞争。

In this blog post, we give some technical insight into the dynamics of self-play as well as provide an overview of our Tennis and Soccer example environments that have been refactored to showcase self-play.

在此博客文章中,我们提供了一些有关自我比赛动态的技术见解,并概述了重构后的网球和足球示例环境以展示自我比赛。

游戏中自我游戏的历史 (History of self-play in games)

The notion of self-play has a long history in the practice of building artificial agents to solve and compete with humans in games. One of the earliest uses of t

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值