Fan, Zhou, Rui Su, Weinan Zhang和Yong Yu. 《Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space》. arXiv, 2019年5月30日. http://arxiv.org/abs/1903.01344
主题:主要提出了一种可以解决混合动作空间(action中既含有连续变量也有离散变量)的 actor-critic网络架构。
解决的主要问题:
action space在一些情况下并不是flat的,即有等级架构。
However, action space could also have some hierarchical structure instead of being a flat set.
离散和连续action的定义:
在网络中,包含两个actor网络,即 discrete actor network 和 continuous actor network。
one discrete actor network learns a stochastic policy πθd to select the discrete action a and one continuous actor network learns a stochastic policy πθc to choose the continuous parameters xa1 , xa2 , . . . , xak for all discrete actions.
critic network,critic network 选择的是 state value function 而不是 state action value function。
因而critic function 提供的是优势函数的评估。
3.1 Hybrid Proximal Policy Optimization
Actor network 产生 discrete action 和 continuous action 的方式:
The discrete actor network of H-PPO outputs k values for the k discrete actions, and the discrete action a to take is randomly sampled from the distribution.
The continuous actor network of H-PPO generates the stochastic policy for continuous parameters by outputting the mean and variance of a Gaussian distribution for each of the parameters.
离散和连续策略分别做更新:
The discrete policy and the continuous policy are updated separately by minimizing their respective clipped surrogate objective.
The objective for the discrete policy is given by
The objective for the continuous policy is
and are viewed as two separate distributions instead of a joint distribution in policy optimization.