RL Value-Based: off-policy DQN(Deep Q-Learning),on-policy

基于值的方法:V值,Q值。
有价值的是Q值方法,后续Value-Based,一般是指Q值。

Q-Learning,代表一大类相关的算法。

RL  Value-Based:  off-policy DQN(Deep Q-Learning), on-policy

Q Learning->Approximate Q-Learning -> Deep Q-Learning.

DQN(Deep Q-Learning):

Deep Q-Learning was introduced in 2013. Since then, a lot of improvements have been made.
So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:

  • raw DQN  & fixed Q-targets
  • double DQNs
  • dueling DQN (aka DDQN)
  • Prioritized Experience Replay ( PER)

DeepMind 2013,DQN  &  Fixed Q-Targets
Playing Atari with Deep Reinforcement Learning, NIPS 2013
https://arxiv.org/abs/1312.5602

Human Level Control Through Deep Reinforcement Learning, Nature 2015
https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Code:
https://sites.google.com/a/deepmind.com/dqn/

DeepMind 2015,Double DQN
Deep Reinforcement Learning with Double Q-learningAAAI 2016
https://arxiv.org/abs/1509.06461

DeepMind 2015,Prioritized Experience Replay,PER,
Prioritized Experience Replayhttps://arxiv.org/abs/1511.05952 ,ICLR 2016

DeepMind 2015,Dueling DQN,
Dueling Network Architectures for Deep Reinforcement Learning , ICML 2016
https://arxiv.org/abs/1511.06581

=================

Alpha Go, 2016-2017
Mastering the game of Go with deep neural networks and tree search.   Nature 2016
https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

Mastering the game of Go without human knowledge.    Nature 2017
https://www.nature.com/articles/nature24270.epdf
http://augmentingcognition.com/assets/Silver2017a.pdf
https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-spring18/slides/cs885-lecture14a.pdf
https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf

Robot 2016,
Levine, S., Pastor, P., Krizhevsky, A., and Quillen, D. (2016).
Learning hand-eye coordination for robotic grasping with large-scale data collection
https://arxiv.org/abs/1603.02199

drone control 2017,
Kahn, G., Zhang, T., Levine, S., and Abbeel, P. (2017).
PLATO: policy learning using adaptive trajectory optimization, in Proceedings of IEEE International Conference on Robotics and Automation (Singapore), 3342–3349.
https://arxiv.org/abs/1603.00622

============================================

Ref:
https://zhuanlan.zhihu.com/p/107172115
https://theaisummer.com/Taking_Deep_Q_Networks_a_step_further/

https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/

https://cugtyt.github.io/blog/rl-notes/201807201658.html

深度增强学习(DRL)漫谈 - 从DQN到AlphaGo
https://blog.csdn.net/jinzhuojun/article/details/52752561

强化学习之DQN进阶的三大法宝(Pytorch)
https://blog.csdn.net/MR_kdcon/article/details/111245496
DQN三大改进
https://cloud.tencent.com/developer/article/1092132
https://cloud.tencent.com/developer/article/1092124
https://cloud.tencent.com/developer/article/1092121
使用Python的OpenAI Gym对Deep Q-Learning的实操介绍
https://blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/90306841

请问DeepMind和OpenAI身后的两大RL流派有什么具体的区别
https://www.zhihu.com/question/316626294

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值