On Reinforcement Learning for Full-length Game of StarCraft

最新推荐文章于 2022-06-15 15:20:19 发布

Vic_Hao

最新推荐文章于 2022-06-15 15:20:19 发布

阅读量345

点赞数

分类专栏： Hierarchical Reinforcement Learning

本文链接：https://blog.csdn.net/weixin_42018112/article/details/89735020

版权

Hierarchical Reinforcement Learning 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Research Topic

如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.

研究了一种hierarchical approach，involves two levels of abstraction:

macro-actions
The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective.
two-layer hierarchical architecture
is modular and easy to scale.

The main contributions of this paper are as follows:

We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
A simple yet effective training algorithm for this architecture is also presented.
We study in detail the impact of different training settings on our architecture.
Experiments results on SC2LE show that our methods achieves state-of-the-art results.

Method

Overall Architecture

Hierarchical Architecture

两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.

The whole process:

At time $t_{c}$ , the controller gets its own global observation $s_{t_{c}}^{c}$ , and it will choose a sub-policy $i$ based on its state:
$a_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c}$
The controller will wait for K time units and the $i$ th sub-policy begins to make its move. We assume its current time is $t_{i}$ and its local observation is $s_{t_{i}}^{i}$ , so it get the macro-action:
$a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i})$
After the $i$ th sub-policy doing the macro-action $a_{t_{i}}^{i}$ in the game, it will get the reward and its next local observation, the tuple $s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i})$ will be stored for the future training.
$r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i})$
After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy $\pi_{i}$ and compute the reward of its action $a_{t_{c}}^{c}$ as follows:
$r_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i}$
Also, the controller will get the next global state $s_{t_{c+1}}^{c}$ and the tuple will be stored in its local buffer.

这种hierarchical architecture 的优势：

Each sub-policy and the high-level controller have different state space.
The hierarchical structure can also split the tremendous action space A.
The hierarchical architecture can effectively reduce the execution step size of the strategy.

Generation of Macro-actions

The generation process of macro-actions is as follow:

We collect some expert trajectories which are sequence of operations $\in A$ from game replays.
用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions $a^{seq}$ of which max length is $C$ and constructed a set $A^{seq}$ which is defined as
$A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C)$
We sort this set by frequency $a^{seq})$
We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
The reduced set are marked as newly generated macro-action space $A^{\eta}$ .