DRL paper --DDPG

                                       DDPG - deep deterministic policy gradient

INTRODUCTION:

An actor-critic, model-free, off-policy algorithm based on DPG that can operate over continues action spaces.

DQN  is able to learn functions  in a stable and robust way due to:

1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples.

2.the network is trained with a target Q network to give consistent targets during temporal difference backups

DDPG makes use of those ideas , along with batch normaliztion, using low-dimensional observations to achive high accuracy.

BACKGROUND:

The goal of rl is to learning a ploicy which maximizes the expected return from the start distribution.

Action-function is used to describe the expected return after taking an action a in state s and thereafter follwing policy π;

The action-value function uses the recursive relationship known as Bellman equation.

The expectation depends only on the enviroment. This means that it is possible to learn action-value function off-policy.

To optimize function approximators parameters, we minimize the loss

                                                                                    L\left ( \theta^{Q} \right )=E\left [ \left( Q\left ( s_{t},a_{t} | \theta ^{Q} \right )-y_{t}}\right)^{2}}\right ]

where                                                                           y_{t} = r \left(s_{t},a_{t} \right )+\gamma E\left[Q^{\pi} \left( s_{t+1},a_{t+1}\right ) \right ]

In order to scale Q-learning, two major changes are introduced:

1. use of a replay buffer

2. a separate target network for caculating y_{t}

These ideas are employed in DDPG

 

ALGORITHM

using an actor-critic approach based on the DPG algorithm.

It is essential to learn in minibatches, rather than online.

At each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer, this allows algorithm to benefit from learning across a set of uncorrelated transitions.

Since the network Q being updated is also used in calculating the target value, the Q update is prone to  divergence.

The solution is making a copy of actor-critic and using "soft" target updates, rather than directly copying the weights.

Create a copy of the target value and critic networksQ^{'}\left(s,a|\theta^{Q^{'}} \right ) ,\mu ^{'}\left(s| \theta^{\mu ^{'}} \right ), which are used to calculate the target value y_{t}

The weights of these target network are than updated by having them slowly track the learned network:

                                                                                 \theta^{'}\leftarrow \tau \theta +\left(1-\tau \right )\theta^{'},\tau \ll 1

this means the target values are constrained to change slowly, greatly improving the stability of learning.

 

When learing from the low-dimensional feature vector observationms,the different components of the observation may have different physical units and the range may vary across enviroments.

One solution to this is batch normalization, which scales all features so they are in similar ranges.

Normalizing each dimension across the samples in a minibatch to have unit mean and variance.In addtion, it maintains a running average of the mean and variance to use for nomalization during testing.(how??)

 

One major challenge of learning in continuous actions spaces is exploration. An advantage of off-policy algorithm such as DDPG is that the problem of exploration can be treated independently. An exploration policy \mu^{'}can be constructed by adding noise sampled from a noise process N to the actot policy:

                                                                                       \mu^{'}\left(s_{t} \right ) = \mu\left(s_{t}|\theta_{t}^{\mu} \right )+\eta,

the noise can be chosen to suit the enviroment.(Ornstein-Uhlenbeck process, widely used)

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值