On Reinforcement Learning for Full-length Game of StarCraft

Research Topic

如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.

研究了一种hierarchical approach,involves two levels of abstraction:

  • macro-actions
    The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective.
  • two-layer hierarchical architecture
    is modular and easy to scale.

The main contributions of this paper are as follows:

  • We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
  • A simple yet effective training algorithm for this architecture is also presented.
  • We study in detail the impact of different training settings on our architecture.
  • Experiments results on SC2LE show that our methods achieves state-of-the-art results.

Method

Overall Architecture

Hierarchical Architecture

两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.

The whole process:

  1. At time t c t_{c} tc, the controller gets its own global observation s t c c s_{t_{c}}^{c} stcc, and it will choose a sub-policy i i i based on its state:
    a t c c = ∏ ( s t c c ) , s t c c ∈ S c a_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c} atcc=(stcc),stccSc
  2. The controller will wait for K time units and the i i ith sub-policy begins to make its move. We assume its current time is t i t_{i} ti and its local observation is s t i i s_{t_{i}}^{i} stii, so it get the macro-action:
    a t i i = π i ( s t i i ) a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i}) atii=πi(stii)
  3. After the i i ith sub-policy doing the macro-action a t i i a_{t_{i}}^{i} atii in the game, it will get the reward and its next local observation, the tuple ( s t i i , a t i i , r t i i , s t i + 1 i ) (s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i}) (stii,atii,rtii,sti+1i) will be stored for the future training.
    r t i i = R i ( s t i i , a t i i ) r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i}) rtii=Ri(stii,atii)
  4. After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy π i \pi_{i} πi and compute the reward of its action a t c c a_{t_{c}}^{c} atcc as follows:
    r t c c = r t i i + r t i + 1 i + . . . + r t i + K − 1 i r_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i} rtcc=rtii+rti+1i+...+rti+K1i
    Also, the controller will get the next global state s t c + 1 c s_{t_{c+1}}^{c} stc+1c and the tuple will be stored in its local buffer.

这种hierarchical architecture 的优势:

  • Each sub-policy and the high-level controller have different state space.
  • The hierarchical structure can also split the tremendous action space A.
  • The hierarchical architecture can effectively reduce the execution step size of the strategy.

Generation of Macro-actions

The generation process of macro-actions is as follow:

  1. We collect some expert trajectories which are sequence of operations a ∈ A a \in A aA from game replays.
  2. 用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions a s e q a^{seq} aseq of which max length is C C C and constructed a set A s e q A^{seq} Aseq which is defined as
    A s e q = ( a s e q = ( a 0 , a 1 , a 2 , . . . , a i ) ∣ a i ∈ A a n d i ⩽ C ) A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C) Aseq=(aseq=(a0,a1,a2,...,ai)aiAandiC)
  3. We sort this set by frequency ( a s e q ) (a^{seq}) (aseq)
  4. We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
  5. The reduced set are marked as newly generated macro-action space A η A^{\eta} Aη.

在这里插入图片描述

Training Algorithm

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值