[RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)

Deep Deterministic Policy Gradient (ICLR, 2016)

0.Abstract

  1. “end-to-end” learning: directly from raw pixel inputs

1.Introduction

  1. DQN is not natually suitable for continous action space

2.Background

  1. Bellman equation
    1. Stochastic Policy
      Q π ( s t , a t ) = E r t , s t + 1 ∼ E [ r ( s t , a t ) + γ E a t + 1 ∼ π [ Q π ( s t + 1 , a t + 1 ) ] ] Q^{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma\mathbb{\textcolor{red}E}_{a_{t+1} \sim \pi}\left[Q^{\pi}\left(s_{t+1}, a_{t+1}\right)\right]\right] Qπ(st,at)=Ert,st+1E[r(st,at)+γEat+1π[Qπ(st+1,at+1)]]
    2. Deterministic Policy
      Q μ ( s t , a t ) = E r t , s t + 1 ∼ E [ r ( s t , a t ) + γ Q μ ( s t + 1 , μ ( s t + 1 ) ) ] Q^{\mu}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma Q^{\mu}\left(s_{t+1}, \mu\left(s_{t+1}\right)\right)\right] Qμ(st,at)=Ert,st+1E[r(st,at)+γQμ(st+1,μ(st+1))]

3.Algorithm

  1. Method
    近似DPG中的off-policy AC中的
    1. actor μ ( s ) \mu(s) μ(s)
    2. critic Q ( s , a ) Q(s,a) Q(s,a)
  2. techniques
    1. non-linear approximation
      no converge gurantee but essential in order to learn and generalize on large state spaces
    2. experiences replay
    3. target (soft update)
      1. required to have stable targets y y y in order to consistently train the critic without divergence
    4. batch normalization
      1. in low-dimensional case: input + inner layers
      2. allow learn different envs with the same algorithm settings
    5. exploration
      1. OU
    6. reparametrization trick
      • μ ( s ) \mu(s) μ(s)输出均值
      • a = μ ( s ) + N \mu(s) + \mathcal{N} μ(s)+N
      • 避免了从分布采样

4.Results

  1. target network is necessary for good performance
  2. Value eastimation
    • It can be challenging to learn accurate value estimates in harder task, but DDPG still learn good policy
  3. BN在大部分实验中带来improvement
  4. learning speed
    fewer steps of experience than was used by
    DQN learning to find solutions in the Atari domain

5.Related Work

  1. TRPO
    1. does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient; 计算Q值可以当做Reuse data

Supplementary

  1. EXPERIMENT DETAILS
    1. actor critic learning rate 不同
    2. L2 weight decay TODO
    3. than layer for actor
    4. layer initialization
    5. action input position
  2. MUJOCO ENVIRONMENTS introduction
©️2020 CSDN 皮肤主题: 创作都市 设计师:CSDN官方博客 返回首页