DDPG - deep deterministic policy gradient
INTRODUCTION:
An actor-critic, model-free, off-policy algorithm based on DPG that can operate over continues action spaces.
DQN is able to learn functions in a stable and robust way due to:
1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples.
2.the network is trained with a target Q network to give consistent targets during temporal difference backups
DDPG makes use of those ideas , along with batch normaliztion, using low-dimensional observations to achive high accuracy.
BACKGROUND:
The goal of rl is to learning a ploicy which maximizes the expected return from the start distribution.
Action-function is used to describe the expected return after taking an action a in state s and thereafter follwing policy π;
The action-value function uses the recursive relationship known as Bellman equation.
The expectation depends only on the enviroment. This means that it is possible to learn action-value function off-policy.
To optimize function approximators parameters, we minimize the loss
where
In order to scale Q-learning, two major changes are introduced:
1. use of a replay buffer
2. a separate target network for caculating
These ideas are employed in DDPG
ALGORITHM
using an actor-critic approach based on the DPG algorithm.
It is essential to learn in minibatches, rather than online.
At each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer, this allows algorithm to benefit from learning across a set of uncorrelated transitions.
Since the network Q being updated is also used in calculating the target value, the Q update is prone to divergence.
The solution is making a copy of actor-critic and using "soft" target updates, rather than directly copying the weights.
Create a copy of the target value and critic networks, which are used to calculate the target value
The weights of these target network are than updated by having them slowly track the learned network:
this means the target values are constrained to change slowly, greatly improving the stability of learning.
When learing from the low-dimensional feature vector observationms,the different components of the observation may have different physical units and the range may vary across enviroments.
One solution to this is batch normalization, which scales all features so they are in similar ranges.
Normalizing each dimension across the samples in a minibatch to have unit mean and variance.In addtion, it maintains a running average of the mean and variance to use for nomalization during testing.(how??)
One major challenge of learning in continuous actions spaces is exploration. An advantage of off-policy algorithm such as DDPG is that the problem of exploration can be treated independently. An exploration policy can be constructed by adding noise sampled from a noise process N to the actot policy:
,
the noise can be chosen to suit the enviroment.(Ornstein-Uhlenbeck process, widely used)