强化学习的学习之路（八）_2021-01-08:强化学习的学习资源及学习建议

最新推荐文章于 2021-12-01 06:26:18 发布

Chou_pijiang

最新推荐文章于 2021-12-01 06:26:18 发布

阅读量711

点赞数 3

分类专栏：强化学习-基础知识文章标签：强化学习

本文链接：https://blog.csdn.net/zyh19980527/article/details/112592306

版权

强化学习-基础知识专栏收录该内容

60 篇文章 112 订阅

订阅专栏

作为一个新手，写这个强化学习-基础知识专栏是想和大家分享一下自己学习强化学习的学习历程，希望对大家能有所帮助。这个系列后面会不断更新，希望自己在2021年能保证平均每日一更的更新速度，主要是介绍强化学习的基础知识，后面也会更新强化学习的论文阅读专栏。本来是想每一篇多更新一点内容的，后面发现大家上CSDN主要是来提问的，就把很多拆分开来了（而且这样每天任务量也小一点哈哈哈哈偷懒大法）。但是我还是希望知识点能成系统，所以我在目录里面都好按章节系统地写的，而且在github上写成了书籍的形式，如果大家觉得有帮助，希望从头看的话欢迎关注我的github啊，谢谢大家！另外我还会分享深度学习-基础知识专栏以及深度学习-论文阅读专栏，很早以前就和小伙伴们花了很多精力写的，如果有对深度学习感兴趣的小伙伴也欢迎大家关注啊。大家一起互相学习啊！可能会有很多错漏，希望大家批评指正！不要高估一年的努力，也不要低估十年的积累，与君共勉！

强化学习的一些资源

这里主要给大家分享一些课程，书籍还有代码库。

课程

CS 285 at UC Berkeley ：这个课程是2020年底Sergey Levine大神在UC Berkeley教的课程，我觉得讲的特别详细，而且里面加了很多一些最新的研究成果。
David Silver’s course ：这个课程是David Sliver在2015年出的课程，非常的经典，相信很多入门强化学习的同学都是看的这个课程，里面的很多东西课程刚开始听会比较难懂，可能得多听几遍。
Berkeley’s Deep RL Bootcamp：这个课我其实没看过，但是看了一下lectures，是一群大佬讲的这个课程，质量应该也很好。
Stanford CS234: Reinforcement Learning ：这个课暂时没看过，不过看评价感觉挺好的。
Stanford CS330: Multi-Task and Meta-Learning, 2019 ：Finn关于多任务学习和元学习的课程，强推。
李宏毅老师的机器学习课程：李宏毅老师的机器学习课程里面有一些也涉及到了强化学习的内容，李宏毅老师是用中文讲的课程，而且总是能把一个比较复杂的问题以一种很简单的方式讲出来。
港中文周博磊老师的课程：我觉得大家如果看英文课比较吃力的话建议大家可以先看周博磊老师的课程先入个门，这个课程的内容相对比较简单，比较好入门。
课程的话暂时就只想到这么多，上面的这些课程我听了的大部分都记了笔记，大部分是直接记在课件上的，大家如果有想要的也可以留言告诉我。

书籍

Sutton的强化学习书籍：这个算是强化学习里面最经典的书了，Sutton巨佬的作品，基本上不管是哪个老师上强化学习的课程都会推荐这本书，不过这本书里面将的大多数都是value-based的方法。
- 英文版
- 中文版这个翻译其实还存在挺多问题的，有些语句和专业词汇翻译的也不太好，建议大家可以购买俞凯老师翻译的纸质版书籍。
- 代码
- 书中的提到的文献
- sliders
Algorithms for Reinforcement Learning, Szepesvari：很多经典的强化学习算法里面都有。
《强化学习精要：核心算法与TensorFlow 实现》：这本书是我偶然在实验室的一个书柜里面找到的，随手翻了一下发现写的挺好的，这本书和Sutton的书是我平时用的最多的了。

代码库

OpenAI Spinning Up
Baselines
Stable-Baselines
Ray/Rlib
Pytorch-DRL
rlpyt
Tianshou

这些库有些有tensorflow写的，有些用pytorch写的，具体的对比和测评可以看Tianshou的作者写的这个测评。

强化学习的学习建议

下面的很多建议我都是从Stable Baselines 上面的tips总结的，然后加上了自己一些粗浅的看法。

Read about RL and Stable Baselines ：多看一些强化学习算法的Baseline代码，比如我上面提供的代码库。
Do quantitative experiments and hyperparameter tuning if needed. This factor, among others, explains that results in RL may vary from one run to another (i.e., when only the seed of the pseudo-random generator changes). For this reason, you should always do several runs to have quantitative results. ：因为强化学习算法跑出来的结果有时候会差特别多，所以应该多跑几组实验得到量化的结果。
Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3) normally require little hyperparameter tuning, however, don’t expect the default ones to work on any environment. 虽然最近的一些算法比如PPO、SAC、TD3不太需要调参，但是不要期待一个默认的超参数设置可以在所有环境下都能用，所有强化学习中的一些好的结果都需要一些超参数的调节。
When applying RL to a custom problem, you should always normalize the input to the agent (e.g. using VecNormalize for PPO2/A2C) and look at common preprocessing done on other environments (e.g. for Atari, frame-stack, …). Please refer to Tips and Tricks when creating a custom environment paragraph below for more advice related to custom environments. 应该根据智能体的不同，对输入进行规范化。
Evaluate the performance using a separate test environment ：有时候训练的时候会加入一些epsilon贪婪等来提升探索，但是在测试的时候应该用一个分开的环境直接对模型进行测试。
As a general advice, to obtain better performances, you should augment the budget of the agent (number of training timesteps). 为了更好的效果，训练的久一点。
In order to achieve the desired behavior, expert knowledge is often required to design an adequate reward function. This reward engineering (or RewArt as coined by Freek Stulp), necessitates several iterations. As a good example of reward shaping, you can take a look at Deep Mimic paper which combines imitation learning and reinforcement learning to do acrobatic moves. 加入一些专家经验去设计比较好的奖励。
Reproducibility：Completely reproducible results are not guaranteed across Tensorflow releases or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.In order to make computations deterministic on CPU, on your specific problem on one specific platform, you need to pass a seed argument at the creation of a model and set n_cpu_tf_sess=1 (number of cpu for Tensorflow session). If you pass an environment to the model using set_env(), then you also need to seed the environment first.设置随机种子增强结果的复现性。
没有哪个算法会适合所有的任务，那么平时怎么选择算法呢？首先第一个标准就是看看问题的动作空间是离散的还是连续的，比如像DQN只适合离散动作，SAC只适合连续的动作；还有就是你要不要并行化你的训练：具体怎么选看下面：
- Discrete Actions - Single Process： DQN with extensions (double DQN, prioritized replay, …) and ACER are the recommended algorithms. DQN is usually slower to train (regarding wall clock time) but is the most sample efficient (because of its replay buffer).离散动作、不并行化：DQN及其扩展，ACER等。DQN训练很慢，但是因为有replay buffer的存在样本利用率比较高。
  - Discrete Actions - Multiprocessed： You should give a try to PPO2, A2C and its successors (ACKTR, ACER). If you can multiprocess the training using MPI, then you should checkout PPO1 and TRPO. 离散动作、并行化：PPO2、A2C及其对其改进（ACKER、ACER）等
  - Continuous Actions - Single Process： Current State Of The Art (SOTA) algorithms are SAC and TD3. Please use the hyperparameters in the RL zoo for best results. 连续动作、不并行化：SAC、TD3。
  - Continuous Actions - Multiprocessed： Take a look at PPO2, TRPO or A2C. Again, don’t forget to take the hyperparameters from the RL zoo for continuous actions problems (cf Bullet envs). If you can use MPI, then you can choose between PPO1, TRPO and DDPG. 连续动作、并行化: SAC、TD3。
在创建一个自定义的环境的时候也有一些tricks：
- always normalize your observation space when you can, i.e., when you know the boundaries 规范化你的状态空间。
- normalize your action space and make it symmetric when continuous (cf potential issue below). A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment 规范化你的动作空间，比如把你的动作缩放到[-1, 1]之间。
- start with shaped reward (i.e. informative reward) and simplified version of your problem 使用一些shaped reward，开始的时候使用问题的简单形式。
- debug with random actions to check that your environment works and follows the gym interface: 用随机动作去测试你的环境是否正常工作。
实现强化学习算法的一些trick：
- Read the original paper several times 多看几遍原始论文。
- Read existing implementations (if available) 看算法的现有的实现。
- Try to have some “sign of life” on toy problems 先在简单的问题试起来。
- Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo). You usually need to run hyperparameter optimization for that step. You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf issue #75) and when to stop the gradient propagation 通过在越来越难的环境中运行来验证实现(可以将结果与RL zoo进行比较)
  
  下面列举了哪些环境是比较简单的，哪些环境是比较难的。
A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions:
- Pendulum (easy to solve)
- HalfCheetahBullet (medium difficulty with local minima and shaped reward)
- BipedalWalkerHardcore (if it works on that one, then you can have a cookie)
in RL with discrete actions:
- CartPole-v1 (easy to be better than random agent, harder to achieve maximal performance)
- LunarLander
- Pong (one of the easiest Atari game)
- other Atari games (e.g. Breakout)