Human-level performance in 3D multiplayer games with population based reinforcement learning

Background

  1. Played on both indoor- and outdoor-themed maps that are randomly generated for each game.
  2. The team with the greatest number of flag captures after five minutes wins.
  3. Agents can tag opponents by activating their laser gadget when pointed at an opponent, which sends the opponent back to their base room after a short delay, known as respawning.
  4. Agents interact with the environment and with other agents only through their observations and actions (moving forward and backward; strafing left and right; and looking by rotating,jumping,and tagging).
  5. Agents do not have access to models of the environment,state of other players,or human policy priors,nor can they communicate with each other outside of the game environment.
    在这里插入图片描述

Learning system

  1. The proposed method is based purely on end-to-end learning and generalization.
  2. 使用图像输入(RGB),目标是最大化累积reward
  3. We used a multistep actor-critic policy gradient algorithm with offpolicy correction and auxiliary tasks for RL.
  4. Actions in this model were generated conditional on a stochastic latent variable, whose distribution was modulated by a more slowly evolving prior process.
  5. 参数需要考虑在最大化期望reward和两个时间尺度的一致性上。
  6. 注重的不是分层构建显式地构建分层目标和功能的过程,而是重点在构建分层结构之外地时间表示和序列数据的递归潜在模型。
  7. “For The Win,” giving us the “FTW agent.”
  8. Self-play RL, in which an agent is trained by playing against its own policy. Although self-play variants can prove effective in some multiagent games, these methods can be unstable and in their basic form do not support concurrent training, which is crucial for scalability.
  9. 使用平行的population和不同的agents进行测试。
  10. 使用随机匹配策略,基于相似的skill(使用Elo scores计算,基于训练输出)。

Tournament evaluation

在这里插入图片描述
This highlights that even with more human-comparable reaction times, the agent exhibits human-level performance.

Agent analysis

  1. 200 binary features such as “Do I have the flag?”, “Did I see my teammate recently?”
  2. Internal agent state clustered in accordance with conjunctions of high-level game-state features: flag status(旗子在基地,对方的旗子在基地), respawn state(agent正在重生), and agent location (agent在自己的基地还是对方的基地)
    在这里插入图片描述

Conclusions

  1. A popular multiplayer first-person video game. This was achieved by combining PBT of agents, internal reward optimization, and temporally hierarchical RL with scalable computational architectures.
  2. Limitations of the current framework, which should be addressed in future work,include the difficulty of maintaining diversity in agent populations, the greedy nature of the meta-optimization performed by PBT, and the variance from temporal credit assignment in the proposed RL updates.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值