Thanks J. Peter et al for their great work of A Survey on Policy Search for Robotics.
Now let’s discurss different ways of policy update used in policy search. Typical policy update methods of model-free policy consist of policy gradent methods, expectation-maximization-based methods, information-theoretic methods and methods derived from path integral theory.
Policy gradient methods use gradient ascent for maximizing the expected return
Jθ
:
Finite Difference Methods
The finite difference method is among the simplest ways of obtaining the policy gradient and typically used with the episode-based evaluation strategy and exploration strategy in parameter space. The finite difference method estimate the gradient by applying small perturbations δθ[i] to the paramter vector θk . We may either perturb each parameter value separately or use a probability distribution with small variance to create the perturbations.
The gradient
∇FDθJθ
can be obtained by using a first-order Taylor expansion of
Jθ
and solving for the gradient in a least-squares sense:
Likelihood-ratio Policy Gradients
The likelihood-ratio methods make use of the so-called likelihood-ratio trick that is given by the identity
∇pθ(y)=pθ(y)∇logpθ(y)
. By using it:
Due to the inherently noisy Monte-Carlo estimates, the resulting gradient estimates suffer from a large variance. The variance can be reduced by a baseline
b
:
Step-based likelihood-ratio methods
Step-based algorithms exploit the structure of the trajectory distribution:
REINFORCE
REINFORCE is one of the first PG algorithms. The REINFORCE policy gradient is given by:
G(PO)MDP
Despite using step-based policy evalution strategy, REINFORCE uses the returns
R(τ)
(Recall that
R(τ)=rT(xT)+∑T−1t=0rt(xt,ut)
) of the whole episode as the evaluations of single actions. Note that rewards from the past do not depend on actions in the future, and, hence
Epθ[∂θlogπθ(ut|xt,t)rj]=0
for
j<t
. Hence, the policy gradient of G(PO)MDP is given by:
Policy Gradient Theorem Algorithm
We can use the expected reward to come at time step
t
rather than the returns
Episode-based likelihood-ratio methods
Episode-based likelihood-ratio methods directly update the upper-level policy
πw(θ)
for choosing the parameters
θ
of the lower-level policy
πθ(ut|xt,t)
. Refer to here for more details about upper-level policy.
Natural Gradients
Natural gradients often achieve faster convergence than traditional gradients. As the traditional one use an Euclidean metric δθTδθ to determine the direction of the update δθ , i.e. they assume that all parameter dimensions have similary strong effects on the resulting distribution. Small chanegs in θ might result in large changes of the resulting distribution pθ(y) . To achieve a stable behavior of the learning process, it is desirable to enforce that the distribution pθ(y) does not change too much in one update step which is the key intuition behind the natural gradient that limits the distance between the distribution pθ(y) and pθ+δθ(y) .
The Kullback-Leibler (KL) divergence is used to measure the distance between pθ(y) and pθ+δθ(y) . The Fisher information matrix can be used to approximate the KL divergence for sufficiently small δθ . Refer to here and here for more details if you are interested in KL divergence and Fisher information matrix.
The Fisher information matrix is defined as :
Step-based Natural Gradient Methods
The Fisher information matrix of the trajectory distribution can be written as the average Fisher information matrices for each time step:
Let A~w(xt,ut,t)=ψt(xt,ut)Tw≈Qπt(xt,ut)−bt(x) as a function approximation. A good function approximation does not chagne the gradient in expectation, i.e. it does not introduce a bias. Using ψt(xt,ut)=∇θlogπθ(ut|xt,t) as basis functions which is also called compatible function approximation, as the function approximation is compatible with the policy parameterization. Then the policy gradient using compatitble function approximation can be written by:
Note that Gθ=Fθ . Hence, the step-based natural gradient simplifies to:
Episodic Natural Actor Critic
The advantage function is easy to learn as its basis functions ψt(xt,ut) are given by the compatible function approximation, appropriate basis functions for the value function are more difficult to specify. Then we would like to find some algorithms avoiding estimating a value function. One such algorithm is the episodic Natural Actor Critic (eNAC) alogrithm where the estimation of the value function Vt can be avoided by considering whole sample paths.
For simplicity, we omite some internal steps and show the final outcome directly:
Rewrite the Bellman equation:
Natural Actor Critic
eNAC uses the returns R[i] for evaluating the policy, and consequently, gets less accurate for large time-horizons due to the large variance of the returns. The convergence speed can be improved by directly estimating the vaue function. To do so, temporal differnece methods have to first be adapted to learn the advantage function.
Episode-based Natural Policy Gradients
The beneficial properties of the natural gradient can also be exploited for episode-based algorithms. Such methods come from the area of evoluationary algorithms. They perform gradient ascent on a fitness function which is in the reinforcement learning context the expected long-term reward
Jω
of the upper-level policy: