DRL 算法
文章平均质量分 53
RL\DRL算法
xyp99
这个作者很懒,什么都没留下…
展开
-
[RL 16] Revisiting Fundamentals of Experience Replay (ICML, 2020)
定义replay capacity: buffer大小 Dage of the oldest policy: 一个transition存于buffer期间, 策略更新的次数 Nreplay ratio: 每个step更新策略的次数 K, dqn为0.25关系: N = K*D (当D, N同比增加时)注意: 策略更新次数与batch size无关实验结果Rinbow增大buffer大小...原创 2021-01-19 16:08:21 · 243 阅读 · 1 评论 -
[RL 15] QTRAN (ICML, 2019)
paper: QTRAN: Learning to Factorize with Transformation forCooperative Multi-Agent Reinforcement learning原创 2021-01-19 15:20:53 · 356 阅读 · 0 评论 -
[RL 14] QMIX (ICML, 2018, Oxford)
论文: QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning背景同VDN4. QMIX假设 Qtot 与 Qi 有如(4)式的关系.(4)式可以通过(5)式实现.(5)式可以通过如下Fig2的QMIX网络架构实现agent networks: 进行local决策DRQN, Qimixing network: 实现线性并保证单调(式(5))保证单调的方原创 2021-01-17 16:23:22 · 256 阅读 · 1 评论 -
[RL 13] VDN (201706, DeepMind)
paper: Value-Decomposition Networks For Cooperative Multi-Agent Learning背景cooperative setting (reward 相同)centralized MARL approach 存在不足可能会出现 lazy agent: lazy agent 的 exploration 可能导致 reward 变低independent learning 存在不足non-stationaryspurious rewa原创 2021-01-17 12:50:27 · 282 阅读 · 1 评论 -
[RL 12] Multi-Agent Reinforcement Learning A Selective Overview of Theories and Algorithms
To be continuedpaper: Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms4 MARL Algorithms with TheoryNotationMarkov Game = Stocastic GameMulti-Agent MDP = Markov Teams4.1 Cooperative Setting4.1.1 Homogeneous Agents原创 2020-12-12 09:28:35 · 937 阅读 · 1 评论 -
[RL 11] Asynchronous Methods for Deep Reinforcement Learning (A3C) (ICML, 2016)
Asynchronous Methods for Deep Reinforcement Learning (A3C) (ICML, 2016)1. IntroductionOn-line DRLproblemsrelated samples lead to unstablesolutionERonly for off-policy algresource costAsynchronous Methods4. Asynchronous RL Framework原创 2020-10-30 20:27:25 · 164 阅读 · 0 评论 -
[RL 10] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)1 Introductionreferences for brittle DRLmotivationhow do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performa原创 2020-10-30 10:55:47 · 554 阅读 · 0 评论 -
[RL 9] Trust Region Policy Optimization (ICML, 2015)
Trust Region Policy Optimization (ICML, 2015)1 Introductionpolicy optimization categoriespolicy iteration (GPI)PG (e.g. TRPO)derivative-free(无导数) optimization methods2 PreliminariesConsider an infinite-horizon discounted MDPinstead an averag原创 2020-10-30 09:34:04 · 149 阅读 · 0 评论 -
[RL 8] Proximal Policy Optimization Algorithms (arXiv, 1707)
Proximal Policy Optimization Algorithms (arXiv, 1707)1.Introductionroom for RLscalable: support for parallel implementations, to make use of resourcesdata efficientrobust: non-sensitive to hyperparameterproblemsA3C: poor data efficiencyTRPO: c原创 2020-10-28 14:59:32 · 253 阅读 · 0 评论 -
[RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)
Deep Deterministic Policy Gradient (ICLR, 2016)0.Abstract“end-to-end” learning: directly from raw pixel inputs1.IntroductionDQN is not natually suitable for continous action space2.BackgroundBellman equationStochastic Policy Qπ(st,at)=Ert,st+原创 2020-10-27 17:33:44 · 263 阅读 · 0 评论 -
[RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)
Deterministic Policy Gradient Algorithms (ICML, 2014)Stochastic PGT (SPGT)Theorem ∇θJ(πθ)=∫Sρπ(s)∫A∇θπθ(a∣s)Qπ(s,a)dads=Es∼ρπ,a∼πθ[∇θlogπθ(a∣s)Qπ(s,a)]\begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \i原创 2020-10-27 10:08:02 · 251 阅读 · 0 评论 -
[RL 5] Reinforcement Learning An Introduction: Ch9 On-policy Prediction with Approximation
Chapter 9 On-policy Prediction with Approximation9.0 Preliminaryfunction approximationgeneralization参数数量 << |S|V(s)变会影响其他的V(s’)9.1 Value-function Approximationproblemsno static trainning set: learn from increasing datanon-stationay原创 2020-10-26 09:39:54 · 143 阅读 · 0 评论 -
[RL 4] Reinforcement Learning An Introduction: Ch13 Policy Gradient Algorithm
Chapter 13 策略梯度算法13.1 PG优点stochastic policyPG学习stochastic policy(policy输出分布, 通过采样得到action); 而value-based算法采用ϵ\epsilonϵ-greedy policy部分研究问题中, optimal policy为stochastic policyexploration随机策略有利于explorationpolicy可以逐渐变deterministic, 即自动调整exploration原创 2020-10-26 08:01:00 · 138 阅读 · 0 评论 -
[RL 3] Soft Actor-Critic
论文:Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic ActorSoft Actor-Critic Algorithms and ApplicationsSoft Actor-Critic for Discrete Action Settings算法理论 (tabular)motivation设计actor-critic + max entropy +原创 2020-10-07 07:56:43 · 339 阅读 · 0 评论 -
[RL 2] Soft Q-learning
作者想要提出一种最大化熵和并且expressive的policy, policy的分布形如π(a∣s)∼exp(Qπ(s,a))\pi (a|s) \sim \exp (Q^\pi(s,a))π(a∣s)∼exp(Qπ(s,a)), 即energy-based的形式. 在定理1中, 作者定义了Q, V函数作者通过证明形如(17)形式的策略符合策略提升定理, 从而证明最优策略符合energy-based policy. 定理2, 指明了...原创 2020-10-05 19:28:00 · 1158 阅读 · 0 评论 -
[RL 1] Deep Recurrent Q-Learning (DRQN)
paper: Deep Recurrent Q-Learning for Partially Observable MDPs (AAAI’15)paper source code: https://github.com/mhauskn/dqn/tree/recurrentMotivationDQN局限:在求解POMDP问题时(take action需history), 我们缺少一种有效整合history的机制. 在DQN中提出的方法是stack 4-frame的图像, 但这种方法只适用于整合图像类原创 2020-10-02 10:53:09 · 335 阅读 · 0 评论