自定义博客皮肤VIP专享

*博客头图:

格式为PNG、JPG,宽度*高度大于1920*100像素,不超过2MB,主视觉建议放在右侧,请参照线上博客头图

请上传大于1920*100像素的图片!

博客底图:

图片格式为PNG、JPG,不超过1MB,可上下左右平铺至整个背景

栏目图:

图片格式为PNG、JPG,图片宽度*高度为300*38像素,不超过0.5MB

主标题颜色:

RGB颜色,例如:#AFAFAF

Hover:

RGB颜色,例如:#AFAFAF

副标题颜色:

RGB颜色,例如:#AFAFAF

自定义博客皮肤

-+
  • 博客(21)
  • 收藏
  • 关注

原创 [RL 16] Revisiting Fundamentals of Experience Replay (ICML, 2020)

定义replay capacity: buffer大小 Dage of the oldest policy: 一个transition存于buffer期间, 策略更新的次数 Nreplay ratio: 每个step更新策略的次数 K, dqn为0.25关系: N = K*D (当D, N同比增加时)注意: 策略更新次数与batch size无关实验结果Rinbow增大buffer大小...

2021-01-19 16:08:21 247 1

原创 [RL 15] QTRAN (ICML, 2019)

paper: QTRAN: Learning to Factorize with Transformation forCooperative Multi-Agent Reinforcement learning

2021-01-19 15:20:53 361

原创 [RL 14] QMIX (ICML, 2018, Oxford)

论文: QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning背景同VDN4. QMIX假设 Qtot 与 Qi 有如(4)式的关系.(4)式可以通过(5)式实现.(5)式可以通过如下Fig2的QMIX网络架构实现agent networks: 进行local决策DRQN, Qimixing network: 实现线性并保证单调(式(5))保证单调的方

2021-01-17 16:23:22 262 1

原创 [RL 13] VDN (201706, DeepMind)

paper: Value-Decomposition Networks For Cooperative Multi-Agent Learning背景cooperative setting (reward 相同)centralized MARL approach 存在不足可能会出现 lazy agent: lazy agent 的 exploration 可能导致 reward 变低independent learning 存在不足non-stationaryspurious rewa

2021-01-17 12:50:27 285 1

原创 MuJoCo状态动作观测维度 (强化学习测试平台)

from paper:CONTINUOUS CONTROL WITH DEEP EINFORCEMENT LEARNING维度task namedim(s)dim(a)dim(o)blockworld118543blockworld3da319102canada22762canada2d14329cart213cartpole4114cartpoleBalance4114cartpoleParal

2021-01-14 10:49:07 1054 2

原创 [RL 12] Multi-Agent Reinforcement Learning A Selective Overview of Theories and Algorithms

To be continuedpaper: Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms4 MARL Algorithms with TheoryNotationMarkov Game = Stocastic GameMulti-Agent MDP = Markov Teams4.1 Cooperative Setting4.1.1 Homogeneous Agents

2020-12-12 09:28:35 949 1

原创 2020-11-14

data segment;input data segment code heredata endsstack segment;input stack segment code herestack endscode segmentassume cs:code,ds:data,ss:stackstart:mov ax,datamov ds,ax;8255 控制寄存器设置mov dx,386h ;A0, A1=1,1mov al,10010000b ; 8位控制字节, B口输出, A口

2020-11-14 09:42:16 84

原创 IDEA 2020.2.1 + Tomcat 9.0.39 部署网页 20201101

IDEA 2020.2.1 + Tomcat 9.0.39 部署网页windows 10 安装步骤Tomcat安装Tomcat官网下载合适的压缩包, 该压缩包即为程序目录解压将安装目录中的bin加入系路径Path命令行到bin下运行startup.bat文件, …安装成功IDEA创建新建普通java项目添加项目支持选择webj2e工程结构将自动生成部署添加configure选择Toamcat Server选择Tomcat安装路径添加部署该项目!!!

2020-10-31 19:28:14 275

原创 [RL 11] Asynchronous Methods for Deep Reinforcement Learning (A3C) (ICML, 2016)

Asynchronous Methods for Deep Reinforcement Learning (A3C) (ICML, 2016)1. IntroductionOn-line DRLproblemsrelated samples lead to unstablesolutionERonly for off-policy algresource costAsynchronous Methods4. Asynchronous RL Framework

2020-10-30 20:27:25 168

原创 [RL 10] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)1 Introductionreferences for brittle DRLmotivationhow do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performa

2020-10-30 10:55:47 563

原创 [RL 9] Trust Region Policy Optimization (ICML, 2015)

Trust Region Policy Optimization (ICML, 2015)1 Introductionpolicy optimization categoriespolicy iteration (GPI)PG (e.g. TRPO)derivative-free(无导数) optimization methods2 PreliminariesConsider an infinite-horizon discounted MDPinstead an averag

2020-10-30 09:34:04 151

原创 [RL 8] Proximal Policy Optimization Algorithms (arXiv, 1707)

Proximal Policy Optimization Algorithms (arXiv, 1707)1.Introductionroom for RLscalable: support for parallel implementations, to make use of resourcesdata efficientrobust: non-sensitive to hyperparameterproblemsA3C: poor data efficiencyTRPO: c

2020-10-28 14:59:32 264

原创 [RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)

Deep Deterministic Policy Gradient (ICLR, 2016)0.Abstract“end-to-end” learning: directly from raw pixel inputs1.IntroductionDQN is not natually suitable for continous action space2.BackgroundBellman equationStochastic Policy Qπ(st,at)=Ert,st+

2020-10-27 17:33:44 269

原创 [RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)

Deterministic Policy Gradient Algorithms (ICML, 2014)Stochastic PGT (SPGT)Theorem ∇θJ(πθ)=∫Sρπ(s)∫A∇θπθ(a∣s)Qπ(s,a)dads=Es∼ρπ,a∼πθ[∇θlog⁡πθ(a∣s)Qπ(s,a)]\begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \i

2020-10-27 10:08:02 253

原创 [RL 5] Reinforcement Learning An Introduction: Ch9 On-policy Prediction with Approximation

Chapter 9 On-policy Prediction with Approximation9.0 Preliminaryfunction approximationgeneralization参数数量 << |S|V(s)变会影响其他的V(s’)9.1 Value-function Approximationproblemsno static trainning set: learn from increasing datanon-stationay

2020-10-26 09:39:54 144

原创 [RL 4] Reinforcement Learning An Introduction: Ch13 Policy Gradient Algorithm

Chapter 13 策略梯度算法13.1 PG优点stochastic policyPG学习stochastic policy(policy输出分布, 通过采样得到action); 而value-based算法采用ϵ\epsilonϵ-greedy policy部分研究问题中, optimal policy为stochastic policyexploration随机策略有利于explorationpolicy可以逐渐变deterministic, 即自动调整exploration

2020-10-26 08:01:00 140

原创 [RL 3] Soft Actor-Critic

论文:Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic ActorSoft Actor-Critic Algorithms and ApplicationsSoft Actor-Critic for Discrete Action Settings算法理论 (tabular)motivation设计actor-critic + max entropy +

2020-10-07 07:56:43 349

原创 [RL 2] Soft Q-learning

作者想要提出一种最大化熵和并且expressive的policy, policy的分布形如π(a∣s)∼exp⁡(Qπ(s,a))\pi (a|s) \sim \exp (Q^\pi(s,a))π(a∣s)∼exp(Qπ(s,a)), 即energy-based的形式. 在定理1中, 作者定义了Q, V函数作者通过证明形如(17)形式的策略符合策略提升定理, 从而证明最优策略符合energy-based policy. 定理2, 指明了...

2020-10-05 19:28:00 1166

原创 [RL 1] Deep Recurrent Q-Learning (DRQN)

paper: Deep Recurrent Q-Learning for Partially Observable MDPs (AAAI’15)paper source code: https://github.com/mhauskn/dqn/tree/recurrentMotivationDQN局限:在求解POMDP问题时(take action需history), 我们缺少一种有效整合history的机制. 在DQN中提出的方法是stack 4-frame的图像, 但这种方法只适用于整合图像类

2020-10-02 10:53:09 341

原创 The StarCraft Multi-Agent Challenge(smac): 环境安装 window10

PaperThe StarCraft Multi-Agent Challenge, NeurIPS 2019.安装环境windows10步骤StarCraft Ⅱ 安装,约30G。建议一路默认安装,否则之后需要改变python库文件代码。创建conda环境安装torch安装python

2020-09-21 21:02:49 2626 5

原创 Anaconda3 常用命令

创建conda create -n xxx python=x.xconda create --prefix=/xxx/xx python=x.x删除conda env remove -n xxxconda env remove --prefix=/xxx/xx切换conda activate xxx查看已安装的包conda list查看已安装的环境conda env list

2020-09-20 17:55:55 131

空空如也

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

提示
确定要删除当前文章?
取消 删除