RL+混合参数 Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space

最新推荐文章于 2024-04-18 11:27:22 发布

远离科研，保命要紧

最新推荐文章于 2024-04-18 11:27:22 发布

阅读量580

点赞数

分类专栏： # RL+CO 文章标签： ai

本文链接：https://blog.csdn.net/qq_38480311/article/details/126696434

版权

RL+CO 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Fan, Zhou, Rui Su, Weinan Zhang和Yong Yu. 《Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space》. arXiv, 2019年5月30日. http://arxiv.org/abs/1903.01344

主题：主要提出了一种可以解决混合动作空间（action中既含有连续变量也有离散变量）的 actor-critic网络架构。

解决的主要问题：

action space在一些情况下并不是flat的，即有等级架构。

However, action space could also have some hierarchical structure instead of being a flat set.

离散和连续action的定义：

在网络中，包含两个actor网络，即 discrete actor network 和 continuous actor network。

one discrete actor network learns a stochastic policy πθd to select the discrete action a and one continuous actor network learns a stochastic policy πθc to choose the continuous parameters xa1 , xa2 , . . . , xak for all discrete actions.

critic network，critic network 选择的是 state value function 而不是 state action value function。

因而critic function 提供的是优势函数的评估。

3.1 Hybrid Proximal Policy Optimization

Actor network 产生 discrete action 和 continuous action 的方式：

The discrete actor network of H-PPO outputs k values $f_{a_1},f_{a_2},...,f_{a_k}$ for the k discrete actions, and the discrete action a to take is randomly sampled from the $softmax(f)$ distribution.

The continuous actor network of H-PPO generates the stochastic policy $\pi _{\theta _c}$ for continuous parameters by outputting the mean and variance of a Gaussian distribution for each of the parameters.

离散和连续策略分别做更新：

The discrete policy $\pi _{\theta _d}$ and the continuous policy $\pi _{\theta _c}$ are updated separately by minimizing their respective clipped surrogate objective.