RL+混合参数 Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space

Fan, Zhou, Rui Su, Weinan Zhang和Yong Yu. 《Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space》. arXiv, 2019年5月30日. http://arxiv.org/abs/1903.01344 

 

主题:主要提出了一种可以解决混合动作空间(action中既含有连续变量也有离散变量)的 actor-critic网络架构。

解决的主要问题:

action space在一些情况下并不是flat的,即有等级架构。

However, action space could also have some hierarchical structure instead of being a flat set. 

 

 离散和连续action的定义:

在网络中,包含两个actor网络,即 discrete actor network 和 continuous actor network。

one discrete actor network learns a stochastic policy πθd to select the discrete action a and one continuous actor network learns a stochastic policy πθc to choose the continuous parameters xa1 , xa2 , . . . , xak for all discrete actions.

critic network,critic network 选择的是 state value function 而不是 state action value function。

因而critic function 提供的是优势函数的评估。

 

3.1 Hybrid Proximal Policy Optimization

 

Actor network 产生  discrete action 和 continuous action 的方式:

The discrete actor network of H-PPO outputs k values  f_{a_1},f_{a_2},...,f_{a_k}for the k discrete actions, and the discrete action a to take is randomly sampled from the softmax(f)distribution. 

The continuous actor network of H-PPO generates the stochastic policy \pi _{\theta _c} for continuous parameters by outputting the mean and variance of a Gaussian distribution for each of the parameters.

离散和连续策略分别做更新:

The discrete policy \pi _{\theta _d} and the continuous policy \pi _{\theta _c} are updated separately by minimizing their respective clipped surrogate objective.

The objective for the discrete policy \pi _{\theta _d} is given by

 

 

 The objective for the continuous policy \pi _{\theta _c} is

 \pi _{\theta _d} and \pi _{\theta _c} are viewed as two separate distributions instead of a joint distribution in policy optimization.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

远离科研,保命要紧

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值