Background
- Played on both indoor- and outdoor-themed maps that are randomly generated for each game.
- The team with the greatest number of flag captures after five minutes wins.
- Agents can tag opponents by activating their laser gadget when pointed at an opponent, which sends the opponent back to their base room after a short delay, known as respawning.
- Agents interact with the environment and with other agents only through their observations and actions (moving forward and backward; strafing left and right; and looking by rotating,jumping,and tagging).
- Agents do not have access to models of the environment,state of other players,or human policy priors,nor can they communicate with each other outside of the game environment.
Learning system
- The proposed method is based purely on end-to-end learning and generalization.
- 使用图像输入(RGB),目标是最大化累积reward
- We used a multistep actor-critic policy gradient algorithm with offpolicy correction and auxiliary tasks for RL.
- Actions in this model were generated conditional on a stochastic latent variable, whose distribution was modulated by a more slowly evolving prior process.
- 参数需要考虑在最大化期望reward和两个时间尺度的一致性上。
- 注重的不是分层构建显式地构建分层目标和功能的过程,而是重点在构建分层结构之外地时间表示和序列数据的递归潜在模型。
- “For The Win,” giving us the “FTW agent.”
- Self-play RL, in which an agent is trained by playing against its own policy. Although self-play variants can prove effective in some multiagent games, these methods can be unstable and in their basic form do not support concurrent training, which is crucial for scalability.
- 使用平行的population和不同的agents进行测试。
- 使用随机匹配策略,基于相似的skill(使用Elo scores计算,基于训练输出)。
Tournament evaluation
This highlights that even with more human-comparable reaction times, the agent exhibits human-level performance.
Agent analysis
- 200 binary features such as “Do I have the flag?”, “Did I see my teammate recently?”
- Internal agent state clustered in accordance with conjunctions of high-level game-state features: flag status(旗子在基地,对方的旗子在基地), respawn state(agent正在重生), and agent location (agent在自己的基地还是对方的基地)
Conclusions
- A popular multiplayer first-person video game. This was achieved by combining PBT of agents, internal reward optimization, and temporally hierarchical RL with scalable computational architectures.
- Limitations of the current framework, which should be addressed in future work,include the difficulty of maintaining diversity in agent populations, the greedy nature of the meta-optimization performed by PBT, and the variance from temporal credit assignment in the proposed RL updates.