Research Topic
如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.
研究了一种hierarchical approach,involves two levels of abstraction:
- macro-actions
The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective. - two-layer hierarchical architecture
is modular and easy to scale.
The main contributions of this paper are as follows:
- We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
- A simple yet effective training algorithm for this architecture is also presented.
- We study in detail the impact of different training settings on our architecture.
- Experiments results on SC2LE show that our methods achieves state-of-the-art results.
Method
Hierarchical Architecture
两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.
The whole process:
- At time
t
c
t_{c}
tc, the controller gets its own global observation
s
t
c
c
s_{t_{c}}^{c}
stcc, and it will choose a sub-policy
i
i
i based on its state:
a t c c = ∏ ( s t c c ) , s t c c ∈ S c a_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c} atcc=∏(stcc),stcc∈Sc - The controller will wait for K time units and the
i
i
ith sub-policy begins to make its move. We assume its current time is
t
i
t_{i}
ti and its local observation is
s
t
i
i
s_{t_{i}}^{i}
stii, so it get the macro-action:
a t i i = π i ( s t i i ) a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i}) atii=πi(stii) - After the
i
i
ith sub-policy doing the macro-action
a
t
i
i
a_{t_{i}}^{i}
atii in the game, it will get the reward and its next local observation, the tuple
(
s
t
i
i
,
a
t
i
i
,
r
t
i
i
,
s
t
i
+
1
i
)
(s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i})
(stii,atii,rtii,sti+1i) will be stored for the future training.
r t i i = R i ( s t i i , a t i i ) r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i}) rtii=Ri(stii,atii) - After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy
π
i
\pi_{i}
πi and compute the reward of its action
a
t
c
c
a_{t_{c}}^{c}
atcc as follows:
r t c c = r t i i + r t i + 1 i + . . . + r t i + K − 1 i r_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i} rtcc=rtii+rti+1i+...+rti+K−1i
Also, the controller will get the next global state s t c + 1 c s_{t_{c+1}}^{c} stc+1c and the tuple will be stored in its local buffer.
这种hierarchical architecture 的优势:
- Each sub-policy and the high-level controller have different state space.
- The hierarchical structure can also split the tremendous action space A.
- The hierarchical architecture can effectively reduce the execution step size of the strategy.
Generation of Macro-actions
The generation process of macro-actions is as follow:
- We collect some expert trajectories which are sequence of operations a ∈ A a \in A a∈A from game replays.
- 用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions
a
s
e
q
a^{seq}
aseq of which max length is
C
C
C and constructed a set
A
s
e
q
A^{seq}
Aseq which is defined as
A s e q = ( a s e q = ( a 0 , a 1 , a 2 , . . . , a i ) ∣ a i ∈ A a n d i ⩽ C ) A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C) Aseq=(aseq=(a0,a1,a2,...,ai)∣ai∈Aandi⩽C) - We sort this set by frequency ( a s e q ) (a^{seq}) (aseq)
- We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
- The reduced set are marked as newly generated macro-action space A η A^{\eta} Aη.