A Minimalist Approach to Offline Reinforcement Learning[TD3+BC]阅读笔记

最新推荐文章于 2024-08-31 08:59:06 发布

hehedadaq

最新推荐文章于 2024-08-31 08:59:06 发布

阅读量1.1k

点赞数 3

分类专栏：论文阅读笔记 RL DRL 文章标签：论文阅读 offline 强化学习深度强化学习 DRL

本文链接：https://blog.csdn.net/hehedadaq/article/details/122161632

版权

DRL 同时被 3 个专栏收录

33 篇文章 18 订阅

订阅专栏

论文阅读笔记

22 篇文章 4 订阅

订阅专栏

12 篇文章 1 订阅

订阅专栏

A Minimalist Approach to Offline Reinforcement Learning[TD3+BC]阅读笔记

文章目录

A Minimalist Approach to Offline Reinforcement Learning[TD3+BC]阅读笔记

前言：

最近好奇offline到底目前有哪些靠谱的简洁的工作。
然后好几位群友就推荐了TD3作者最新工作：TD3+BC。
有意思的是我当时调研BC的时候，也看到了这篇，但是没有细看，当时觉得就这？

等我看了它的实验结果和代码，以及openreview的时候，才发现，原来还真的有点东西。

至少它的方法简洁，代码就改了两行代码：
在这里插入图片描述

整个代码库就三个主文件，连个乱七八糟的继承文件夹都没有，相比隔壁的baseline好太多。

经典再现：

1. diss同行：

While there are many proposed approaches to offline RL, we remark that few are truly “simple”, and even the algorithms which claim to work with minor additions to an underlying online RL algorithm make a significant number of implementation-level adjustments.

2. 如何和审稿人battle自己的idea直观：

A Minimalist Approach to Offline Reinforcement Learning-OpenReview
第二个审稿人：

First, it seems that the novelty of the method is a bit limited. The authors seem to directly adapt RL+BC to the offline setting except that they add the state normalization, which is also not new. The authors also didn’t theoretically justify the approach. For example, the authors should show that the method can guarantee safe policy improvement and moreover enjoys comparable or better policy improvement guarantees w.r.t. prior approaches. Without the theoretical justification and given the current form of the method, I think the method is a bit incremental.

然后直接给了个5分弱拒…
作者的回复也很妙：

On novelty: We don’t disagree at all that our algorithm is incremental in novelty (we highlight a number of similar algorithms in the related work). However, our main claim/contribution is not so much that this is the best possible offline RL algorithm, or that it is particularly novel, but rather the surprising observation that the use of very simple techniques can match/outperform current algorithms. The hope is that TD3+BC could be used as an easy-to-implement baseline or starting point for other additions (such as S4RL), while eliminating a lot of unnecessary complexity, hyperparameter tuning, or computational cost, required by more sophisticated methods.

算法的创新确实不多，但是我们的简洁且效果好。可能也只有大佬battle才能被接受吧…
论文的related work我看了，但是看着非常迷惑，其实已经非常多的算法里面都用过BC来约束policy不要偏离数据集的动作分布。可能只有他们是只利用了BC？

3. 和sota算法的结构性能对比：

在这里插入图片描述
抛开创新点不谈，只看结构和性能，这篇方法是不是又简洁，性能又好？

offline的背景知识：

由于之前也做过阅读笔记，简单说几句offline的东西。

一般来说强化必须要和环境进行交互，对于一些没有经历过的(s,a) pair，刚开始可能会有错误的value estimate，尤其是那些高估的值，原则上是会被真实反馈拿到的奖励给纠正掉的。
但是offline，除了最后一次测试策略 $\pi$ 的性能外，是不会和环境进行交互的。所有的数据都来自于一个固定的dataset。那么对于那些数据集里没有的(s,a)，有了错误的高估，是不会被纠正的，那么用这个错的值函数，梯度上升优化的策略网络妥妥的是走偏了。这就是所谓的distributional shifts issues。

目前的那些offline的算法，就是各种加约束，尽可能的让策略网络的输出值，不要偏离数据集里的动作，包括这篇。

之前他们用各种乱七八糟的方法，有的是增加了计算量，有的是增加了多余的超参数。

TD3+BC的核心内容：

虽然我没看懂它的性能对比图，但是不妨碍我直接认为它的性能达到了sota…
在这个基础上我们来看看它论文里第五章，一页就将整个算法讲完了，但是整篇文章却写了17页…一篇顶会的工作量真滴难搞…

更新策略网络时多了一个 $\gamma$ ,一个BC-loss。

在这里插入图片描述

但是要考虑到两个变量的数量级不要差的太多，由于BC的项受到动作值的影响，一般来说动作值的范围都是[-1, 1]之间，二范数最大也不过是[1-(-1)]^2=4罢了，那么对于Q这部分的loss，也要做一个约束。
最近我自己也在做bc和-Q的融合，还没考虑过对二者进行加权。

文中对Q的加权是，直接对当前的mini-batch 的Q，除以绝对值的均值，再乘上一个系数 $\alpha$ ，文中给的是2.5.即保证Q项的值在[-2.5, 2.5]之间。但是实际上对Q的求导，经过critic网络，传递给actor的输出那儿的loss好像也不是那么回事儿了？我当时也是没想明白这件事儿。希望有大佬给讲讲…

对state的归一化，虽然这个方案很多算法都在用，但是作者说，为了体现对TD3修改的透明性，这个也单独拎出来了。但是我好奇的是，在offline的dataset里拿到的均值方差，迁移到online的场景，保熟吗？

实验结果分析–更新版

我对那个数据的random Medium, Export 没闹明白是什么意思，文中也没细讲。
最终我还是打开了D4RL的论文，对这几个鬼数据集做一个归纳总结吧：

The “medium” dataset is generated by first training a policy online using Soft Actor-Critic (Haarnoja et al., 2018a), early-stopping the training, and collecting 1M samples from this partially-trained policy.
“中等”数据集是通过使用Soft Actor-Critic (Haarnoja等人，2018a)在线训练一个策略生成的，提前停止训练，并从这个部分训练的策略中收集100万个样本。即全是中等水平，没有乐色。
The “medium-replay” dataset consists of recording all samples in the replay buffer observed during training until the policy reaches the “medium” level of performance.
“中等重放”数据集包括在训练期间观察到的重放缓冲区中记录所有样本，直到策略达到“中等”性能水平。即从乐色到中等都有
The “random” datasets are generated by unrolling a randomly initialized policy on these three domains.
“随机”数据集是通过在这三个域上展开一个随机初始化的策略生成的。即都是随机的乐色。
we further introduce a “medium-expert” dataset by mixing equal amounts of expert demonstrations and suboptimal data, generated via a partially trained policy or by unrolling a uniform-at-random policy.
通过混合等量的专家演示和次优数据，我们进一步引入“中等专家”数据集，这些数据是通过部分训练的策略生成的，或通过展开均匀随机策略生成的。即中等+专家
a large amount of expert data from a fine-tuned RL policy (“expert”).
专家数据就是从一个调参好的策略来的。

有了上面的先验知识，我们去看图123 在这里插入图片描述
就容易多了：
在图1中，他们测试了在CQL和fishbrc中删除所有组件后，性能的百分比差异。许多任务的性能都有显著下降。
麻了，我还是看不出Percent difference是什么意思