Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

最新推荐文章于 2022-05-19 14:55:36 发布

HoJ Ray

最新推荐文章于 2022-05-19 14:55:36 发布

阅读量426

点赞数

分类专栏： DRL文章阅读笔记文章标签：强化学习

本文链接：https://blog.csdn.net/qq_19005887/article/details/106059167

版权

DRL文章阅读笔记专栏收录该内容

10 篇文章 4 订阅

订阅专栏

Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

文章来自清华大学，主要解决的问题是 RL 的探索问题，文中的方法对拥有large state space, sparse reward, deceptive reward的任务很有效。

一般的RL探索方法：

$\epsilon$ greedy or entropy regularization;
provide bonus rewards after visiting a novel state;
add random noise to the action/parameter space;

本文的具体做法是：
在 off- or on- policy DRL的损失函数中加入一项策略距离的度量，则新的损失函数表达式为：

$L$ 是原本DRL算法的损失函数, $\alpha$ 是超参数。公式的直觉理解是，当current policy 与past policy的距离越小，则损失函数 $L_D$ 变得越大，这就给DRL的优化算法提供了一个信息，即告诉算法需要去找一个current policy使得其与之前的pass policy 距离更远。

Contribution:
1）修改了DRL的损失函数，进而增加了算法的探索性。
2）给出了损失函数中超参数 $\alpha$ 的计算方法。

文中用了三个算法进行实验，A2C(on-policy), DQN(off-policy, discrete), DDPG(off-policy, continuous)，在此先列出原DRL算法的loss functions:
$L_{A2C\_actor}=-\mathbb{E}_{s,a\sim \pi}[G_t - V(s) + \beta H(\pi(.|s,\theta))]$
$L_{DQN}=\mathbb{E}_{s,a,r,s'\sim U(z)}[(r(s,a)+\gamma max_{a'}Q(s',a',\theta^-) -Q(s,a,\theta))^2]$
$L_{DDPG\_actor}=-\mathbb{E}_{s\sim z}[Q(s,\pi(s))]$
(备注：A2C使用熵 $H(\pi(.|s,\theta))$ 进行探索, DQN用 $\epsilon-$ greedy探索，而DDPG用随机策略 $\hat\pi(s)=\pi(s)+N$ 探索，其中 $N$ 为OU过程噪声)

然后修改上面的loss functions:

由于添加项引入了超参数 a scaling factor $\alpha$ ，然后文章给出了确定该超参数的办法。其中，Off-policy方法只要用distance-based，而On-policy方法要结合distance-based 和 performance-based两者一起用。

HoJ Ray

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

文章来自清华大学，主要解决的问题是 RL 的探索问题，文中的方法对拥有large state space, sparse reward, deceptive reward的任务很有效。
复制链接

扫一扫