《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记

最新推荐文章于 2023-10-05 18:53:13 发布

hehedadaq

最新推荐文章于 2023-10-05 18:53:13 发布

阅读量738

点赞数 2

分类专栏： DRL 论文阅读笔记 RL 文章标签： n-step 强化学习重要性采样 Suttun

本文链接：https://blog.csdn.net/hehedadaq/article/details/112608641

版权

DRL 同时被 3 个专栏收录

33 篇文章 16 订阅

订阅专栏

论文阅读笔记

22 篇文章 4 订阅

订阅专栏

12 篇文章 1 订阅

订阅专栏

《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记

文章目录

《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记

前言：

最近一直沉迷强化里的重要性采样（Importance Sampling, IS），花了不少时间在这上面，先是补概率统计的基础，再是看off-on-policy的东西。

然后看到了，为什么Q-learning没用IS呢？好吧，弄懂了，单步Q-learning不需要IS后，又被多步加不加IS迷惑，简直离谱。被天启大佬推荐了这篇文章，看完后就更迷惑了。

上一篇为什么Q-learning不用重要性采样（importance sampling）？其实已经推导出来，在这种确定性的目标策略中，重要性采样比率其实非常诡异。在这种情况下，和几个知乎答主一样，我也预感多步加IS是不合适的。
看完了这篇文章之后，发现果然是不合适的。
对了，强化里面的合不合适，理论推导说得通最好，但最重要的是性能得能打。

参考链接：

论文链接

一. 论文简介

1. 作者：

一作是J. Fernando Hernandez-Garcia
Department of Computing Science
University of Alberta，
二作是还是Richard S. Sutton。
老爷子还是在探索强化的基础组件，对性能的影响。

2. 期刊杂志：

NIPS 2018

3. 引用数：

4. 论文背景，领域

论文讲了三个东西：

(1) using off-policy correction can have an adverse effect on the performance of Sarsa and Q(σ);
单步或者多步的off-policy的Q-learning+IS效果不好

(2) increasing the backup length n consistently improved performance across all the different algorithms;
n-step-Q-learning中n越大，效果越好

(3) the performance of Sarsa and Q-learning was more robust to the effect of the target network update frequency than the performance of Tree Backup, Q(σ), and Retrace in this particular task.
target-net更新频率慢了，性能会下降，可能的原因，叫网络不匹配。

一句话描述研究背景：

验证上面三个因素，对性能的影响，发现之前的工作用的，都是符合的。
即多步不该用IS，多步越多越好，目标网络更新不能太慢。

实验分析：

为了验证这个效果，作者在mountain car这个简单的环境中测试，验证了Retrace, Q-learning, Tree Backup, Sarsa, and Q(σ) with an architecture analogous to DQN.这几个算法，而我只关注Q-learning和DQN.
作者分为三类实验，做了统计学分析。

Off-Policy vs On-Policy

Our first experiment was motivated by the results obtained with n-step Q-learning without off-policy corrections in the Ape-X architecture (Horgan et al., 2018). In light of those results, we investigated how other algorithms would perform without any off-policy correction.

在这里插入图片描述
这里面有六个鬼设置：
On-Policy Sarsa：
Off-Policy Sarsa：加了IS
On-Policy Q(σ = 0.5):这个好像是n-step-learning.
Off-Policy Q(σ = 0.5):加了IS
On-Policy Decaying σ:
Off-Policy Decaying σ:加了IS
总之都是on-policy都比加了IS的off-policy性能好。
关于Q(σ = 0.5)的定义如下，花里胡哨的，我也是第一次见，没有把这个式子展开过，大概就是n-step。
在这里插入图片描述

看第二个实验：

Comparison of n-Step Algorithms

在这里插入图片描述

Based on results obtained in the linear function approximation case with n-step Sarsa (Sutton & Barto, 2018), we hypothesized that using a backup length greater than 1 would result in better performance, but using too big of a value would have an adverse effect.

在之前线性拟合的基础上，他们拿到了一些结果，即n大于1会好点，但是过大，会有损性能。基于此，做了同样的假设，这次在非线性模型，看看假设是否成立。

但是同样的n值，在非线性模型中，我们看到，是没有不利影响的。

Target Network Update Frequency

下面的500是，间隔步，越大，频率越低，可以看出来频率低了，sarsa和q的还好，其他的就不行了。
在这里插入图片描述

总结：

在确定性RL中，IS就是不好使。实锤了。

联系方式

ps: 欢迎做强化的同学加群一起学习：
深度强化学习-DRL：799378128
欢迎关注知乎帐号：未入门的炼丹学徒
CSDN帐号：https://blog.csdn.net/hehedadaq

hehedadaq

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
2
评论
《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记

《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记文章目录《Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Targe》阅读笔记前言：参考链接：一. 论文简介1. 作者：2. 期刊杂志：3. 引用数：4. 论文背景，领域一句话描述研究背景：实验分析：
复制链接

扫一扫