# Deep Deterministic Policy Gradient (ICLR, 2016)

## 0.Abstract

1. “end-to-end” learning: directly from raw pixel inputs

## 1.Introduction

1. DQN is not natually suitable for continous action space

## 2.Background

1. Bellman equation
1. Stochastic Policy
Q π ( s t , a t ) = E r t , s t + 1 ∼ E [ r ( s t , a t ) + γ E a t + 1 ∼ π [ Q π ( s t + 1 , a t + 1 ) ] ] Q^{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma\mathbb{\textcolor{red}E}_{a_{t+1} \sim \pi}\left[Q^{\pi}\left(s_{t+1}, a_{t+1}\right)\right]\right]
2. Deterministic Policy
Q μ ( s t , a t ) = E r t , s t + 1 ∼ E [ r ( s t , a t ) + γ Q μ ( s t + 1 , μ ( s t + 1 ) ) ] Q^{\mu}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma Q^{\mu}\left(s_{t+1}, \mu\left(s_{t+1}\right)\right)\right]

## 3.Algorithm

1. Method
近似DPG中的off-policy AC中的
1. actor μ ( s ) \mu(s)
2. critic Q ( s , a ) Q(s,a)
2. techniques
1. non-linear approximation
no converge gurantee but essential in order to learn and generalize on large state spaces
2. experiences replay
3. target (soft update)
1. required to have stable targets y y in order to consistently train the critic without divergence
4. batch normalization
1. in low-dimensional case: input + inner layers
2. allow learn different envs with the same algorithm settings
5. exploration
1. OU
6. reparametrization trick
• μ ( s ) \mu(s) 输出均值
• a = μ ( s ) + N \mu(s) + \mathcal{N}
• 避免了从分布采样

## 4.Results

1. target network is necessary for good performance
2. Value eastimation
• It can be challenging to learn accurate value estimates in harder task, but DDPG still learn good policy
3. BN在大部分实验中带来improvement
4. learning speed
fewer steps of experience than was used by
DQN learning to find solutions in the Atari domain

## 5.Related Work

1. TRPO
1. does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient; 计算Q值可以当做Reuse data

## Supplementary

1. EXPERIMENT DETAILS
1. actor critic learning rate 不同
2. L2 weight decay TODO
3. than layer for actor
4. layer initialization
5. action input position
2. MUJOCO ENVIRONMENTS introduction
06-22 4万+
06-04 4万+ 12-05 1万+
11-14 5038
02-21 2159
10-04 18
12-22 840
05-07 8315
04-07 1926
10-07 2723