A Policy Update Strategy in Model-free Policy Search: Policy Gradient

最新推荐文章于 2021-03-10 18:35:52 发布

止于至玄

最新推荐文章于 2021-03-10 18:35:52 发布

阅读量919

点赞数

分类专栏： Reinforcement Learning 文章标签：强化学习

Reinforcement Learning 专栏收录该内容

24 篇文章 8 订阅

订阅专栏

Thanks J. Peter et al for their great work of A Survey on Policy Search for Robotics.

Now let’s discurss different ways of policy update used in policy search. Typical policy update methods of model-free policy consist of policy gradent methods, expectation-maximization-based methods, information-theoretic methods and methods derived from path integral theory.

Policy gradient methods use gradient ascent for maximizing the expected return $J_{\theta}$ :

θ k + 1 = θ k + α \nabla θ J θ

$\theta_{k+1}=\theta_{k}+\alpha\nabla_{\theta}J_{\theta}$ where the policy gradient is given by:

\nabla θ J θ = \int τ \nabla θ p θ (τ) R (τ) d τ

$\nabla_{\theta}J_{\theta}=\int_{\tau}\nabla_{\theta}p_{\theta}(\tau)R(\tau)d\tau$ We will discuss several ways to estiamte the gradietn

∇θJθ $\nabla_{\theta}J_{\theta}$ . You can also refer to here for a simpler discussion.

Finite Difference Methods

The finite difference method is among the simplest ways of obtaining the policy gradient and typically used with the episode-based evaluation strategy and exploration strategy in parameter space. The finite difference method estimate the gradient by applying small perturbations $\delta\theta^{[i]}$ to the paramter vector $\theta_{k}$ . We may either perturb each parameter value separately or use a probability distribution with small variance to create the perturbations.

The gradient $\nabla_{\theta}^{FD}J_{\theta}$ can be obtained by using a first-order Taylor expansion of $J_{\theta}$ and solving for the gradient in a least-squares sense:

\nabla F D θ J θ = (δ Θ T δ Θ) - 1 δ Θ T δ R

$\nabla_{\theta}^{FD}J_{\theta}=(\delta \Theta^{T} \delta \Theta)^{-1} \delta \Theta^{T} \delta R$ where

δΘ=[δθ[1],…,δθ[N]]T $\delta\Theta=\left[\delta\theta^{[1]},\dots,\delta\theta^{[N]}\right]^{T}$ and

δR=[δR[1],…,δR[N]] $\delta R=\left[\delta R^{[1]},\dots,\delta R^{[N]}\right]$ . This method is also called Least Square-based Finite Difference (LSFD) scheme.

Likelihood-ratio Policy Gradients

The likelihood-ratio methods make use of the so-called likelihood-ratio trick that is given by the identity $\nabla p_{\theta}(y)=p_{\theta}(y)\nabla\log p_{\theta}(y)$ . By using it:

\nabla θ J θ = \int p θ (τ) \nabla θ log p θ (τ) R (τ) d τ = E p θ (τ) [\nabla θ log p θ (τ) R (τ)]

$\nabla_{\theta}J_{\theta}=\int p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau)R(\tau)d\tau=\mathbb{E}_{p_{\theta}(\tau)}\left[\nabla_{\theta}\log p_{\theta}(\tau)R(\tau)\right]$ where the expectation over

pθ(τ) $p_{\theta}(\tau)$ is approximated by using a sum over the sampled trajectories

τ[i]=(x[i]0,u[i]0,x[i]1,u[i]1,…) $\tau^{[i]}=(x_{0}^{[i]},u_{0}^{[i]},x_{1}^{[i]},u_{1}^{[i]},\dots)$ .

Due to the inherently noisy Monte-Carlo estimates, the resulting gradient estimates suffer from a large variance. The variance can be reduced by a baseline $b$ :

\nabla θ J θ = E p θ (τ) [\nabla θ log p θ (τ) (R (τ) - b)]

$\nabla_{\theta}J_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\nabla_{\theta}\log p_{\theta}(\tau)(R(\tau)-b)\right]$ which does not affect the policy gradient estimate. The baseline

b $b$ is a free paramter, we can choose it s.t. it minimizes the variance of the gradient estimate.

Step-based likelihood-ratio methods

Step-based algorithms exploit the structure of the trajectory distribution:

p θ (τ) = p (x 1) \prod t = 1 T p (x t + 1 | x t, u t) π θ (u t | x t, t)

$p_{\theta}(\tau)=p(x_{1})\prod_{t=1}^{T}p(x_{t+1}|x_{t},u_{t})\pi_{\theta}(u_{t}|x_{t},t)$ Then by the logarithm:

\nabla θ log p θ (τ) = \sum t = 1 T - 1 \nabla θ log π θ (u t | x t, t)

$\nabla_{\theta}\log p_{\theta}(\tau)=\sum_{t=1}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)$ Note that it has nothing to do with the transition model. However this trick only works for stochastic policies.

REINFORCE

REINFORCE is one of the first PG algorithms. The REINFORCE policy gradient is given by:

\nabla R F θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) (R (τ) - b)]

$\nabla_{\theta}^{RF}J_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)(R(\tau)-b)\right]$ The optimal baseline minimizes the variance of

∇RFθJθ $\nabla_{\theta}^{RF}J_{\theta}$ :

b R F h = E p θ [ ( \sum T - 1 t = 0 \nabla θ h log π θ ( u t | x t , t ) ) 2 R ( τ ) ] E p θ [ ( \sum T - 1 t = 0 \nabla θ h log π θ ( u t | x t , t ) ) 2 ]

$b_{h}^{RF}=\frac{\mathbb{E}_{p_{\theta}}\left[\left(\sum_{t=0}^{T-1}\nabla_{\theta_{h}}\log\pi_{\theta}(u_{t}|x_{t},t)\right)^{2}R(\tau)\right]}{\mathbb{E}_{p_{\theta}}\left[\left(\sum_{t=0}^{T-1}\nabla_{\theta_{h}}\log\pi_{\theta}(u_{t}|x_{t},t)\right)^{2}\right]}$
REFORENCE

G(PO)MDP

Despite using step-based policy evalution strategy, REINFORCE uses the returns $R(\tau)$ (Recall that $R(\tau)=r_{T}(x_{T})+\sum_{t=0}^{T-1}r_{t}(x_{t},u_{t})$ ) of the whole episode as the evaluations of single actions. Note that rewards from the past do not depend on actions in the future, and, hence $\mathbb{E}_{p_{\theta}}[\partial_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)r_{j}]=0$ for $j<t$ . Hence, the policy gradient of G(PO)MDP is given by:

\nabla G M D P θ J θ = E p θ (τ) ⎡ ⎣ \sum j = 0 T - 1 \sum t = 0 j \nabla θ log π θ (u t | x t, t) (r j - b j) ⎤ ⎦

$\nabla_{\theta}^{GMDP}J_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\sum_{j=0}^{T-1}\sum_{t=0}^{j}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)(r_{j}-b_{j})\right]$ The optimal baseline is given by:

b G M D P h = E p θ [ ( \sum j t = 0 \nabla θ h log π θ ( u t | x t , t ) ) 2 r j ] E p θ [ ( \sum j t = 0 \nabla θ h log π θ ( u t | x t , t ) ) 2 ]

$b_{h}^{GMDP}=\frac{\mathbb{E}_{p_{\theta}}\left[\left(\sum_{t=0}^{j}\nabla_{\theta_{h}}\log\pi_{\theta}(u_{t}|x_{t},t)\right)^{2}r_{j}\right]}{\mathbb{E}_{p_{\theta}}\left[\left(\sum_{t=0}^{j}\nabla_{\theta_{h}}\log\pi_{\theta}(u_{t}|x_{t},t)\right)^{2}\right]}$
GMDP

Policy Gradient Theorem Algorithm

We can use the expected reward to come at time step $t$ rather than the returns $R(\tau)$ to evaluate an action $u_{t}$ :

\nabla P G θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t) Q π t (x t, u t)]

$\nabla_{\theta}^{PG}J_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t})Q_{t}^{\pi}(x_{t},u_{t})\right]$ We can also subtract an arbitrary baseline

bt(x) $b_{t}(x)$ from

Qπt(xt,ut) $Q_{t}^{\pi}(x_{t},u_{t})$ . This method can be used in combination with function approximation. And

Qπt(xt,ut) $Q_{t}^{\pi}(x_{t},u_{t})$ can be estimated by Monte-Carlo rollouts.

Episode-based likelihood-ratio methods

Episode-based likelihood-ratio methods directly update the upper-level policy $\pi_{w}(\theta)$ for choosing the parameters $\theta$ of the lower-level policy $\pi_{\theta}(u_{t}|x_{t},t)$ . Refer to here for more details about upper-level policy.

\nabla P E θ J θ = E π ω (θ) [\nabla ω log π ω (θ) (R (θ) - b)]

$\nabla_{\theta}^{PE}J_{\theta}=\mathbb{E}_{\pi_{\omega}(\theta)}\left[\nabla_{\omega}\log\pi_{\omega}(\theta)(R(\theta)-b)\right]$ The optimal baseline is given by:

b P G P E h = E π ω ( θ ) [ ( \nabla ω h π ω ( θ ) ) 2 R ( θ ) ] E π ω ( θ ) [ ( \nabla ω h π ω ( θ ) ) 2 ]

$b_{h}^{PGPE}=\frac{\mathbb{E}_{\pi_{\omega}(\theta)}\left[\left(\nabla_{\omega_{h}}\pi_{\omega}(\theta)\right)^{2}R(\theta)\right]}{\mathbb{E}_{\pi_{\omega}(\theta)}\left[\left(\nabla_{\omega_{h}}\pi_{\omega}(\theta)\right)^{2}\right]}$

Natural Gradients

Natural gradients often achieve faster convergence than traditional gradients. As the traditional one use an Euclidean metric $\delta\theta^{T}\delta\theta$ to determine the direction of the update $\delta\theta$ , i.e. they assume that all parameter dimensions have similary strong effects on the resulting distribution. Small chanegs in $\theta$ might result in large changes of the resulting distribution $p_{\theta}(y)$ . To achieve a stable behavior of the learning process, it is desirable to enforce that the distribution $p_{\theta}(y)$ does not change too much in one update step which is the key intuition behind the natural gradient that limits the distance between the distribution $p_{\theta}(y)$ and $p_{\theta+\delta\theta}(y)$ .

The Kullback-Leibler (KL) divergence is used to measure the distance between $p_{\theta}(y)$ and $p_{\theta+\delta\theta}(y)$ . The Fisher information matrix can be used to approximate the KL divergence for sufficiently small $\delta\theta$ . Refer to here and here for more details if you are interested in KL divergence and Fisher information matrix.

The Fisher information matrix is defined as :

F θ = E p θ (y) [\nabla θ log p (y) \nabla θ log p (y) T]

$F_{\theta}=\mathbb{E}_{p_{\theta}(y)}[\nabla_{\theta}\log p(y) \nabla_{\theta}\log p(y)^{T}]$ and the KL divergence can be estimated as :

K L (p θ + δ θ (y) ∥ p θ (y)) \approx δ θ T F θ δ θ

$KL(p_{\theta+\delta\theta}(y)\| p_{\theta}(y)) \approx \delta\theta^{T}F_{\theta}\delta\theta$ The natural gradient update has a bounded distance:

K L (p θ + δ θ (y) ∥ p θ (y)) \leq ϵ

$KL(p_{\theta+\delta\theta}(y)\| p_{\theta}(y))\leq \epsilon$ in the distribution space. Hence, we can formulate the following optimization problem:

δ θ N G = arg max δ θ δ θ T δ θ V G s.t. δ θ T F θ δ θ \leq ϵ

$\delta\theta^{NG}=\arg\max_{\delta\theta}\delta\theta^{T}\delta\theta^{VG}\quad \text{s.t.}\quad \delta\theta^{T}F_{\theta}\delta\theta\leq\epsilon$ where

δθVG $\delta\theta^{VG}$ is the traditional vanilla gradient update. Due to the deficiency of time, we do not look deeper into this problem but show the application only. The natural policy gradient is given by:

\nabla N G θ J θ = F - 1 θ \nabla θ J θ

$\nabla_{\theta}^{NG}J_{\theta}=F_{\theta}^{-1}\nabla_{\theta}J_{\theta}$ where

∇θJθ $\nabla_{\theta}J_{\theta}$ can be estimated by any traditional policy gradient method.

Step-based Natural Gradient Methods

The Fisher information matrix of the trajectory distribution can be written as the average Fisher information matrices for each time step:

F θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) \nabla θ log π θ (u t | x t, t) T]

$F_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)^{T}\right]$ However, estiamting

Fθ $F_{\theta}$ may be difficult as it contains

O(d2) $O(d^{2})$ parameters. If we use compatible function approximations,

Fθ $F_{\theta}$ does not need to be estimated explicitly. For simplicity, we do not derivate compatitble functino approximations in details but point out that it evolves from the Policy Gradient Theorem algorithm :
Let

A~w(xt,ut,t)=ψt(xt,ut)Tw≈Qπt(xt,ut)−bt(x) $\tilde{A}_{w}(x_{t},u_{t},t)=\psi_{t}(x_{t},u_{t})^{T}w\approx Q_{t}^{\pi}(x_{t},u_{t})-b_{t}(x)$ as a function approximation. A good function approximation does not chagne the gradient in expectation, i.e. it does not introduce a bias. Using

ψt(xt,ut)=∇θlogπθ(ut|xt,t) $\psi_{t}(x_{t},u_{t})=\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)$ as basis functions which is also called compatible function approximation, as the function approximation is compatible with the policy parameterization. Then the policy gradient using compatitble function approximation can be written by:

\nabla F A θ J θ = E p θ (τ) [\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) \nabla θ log π θ (u t | x t, t) T] w = G θ w

$\nabla_{\theta}^{FA}J_{\theta}=\mathbb{E}_{p_{\theta}(\tau)}\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)^{T}\right]w=G_{\theta}w$ Hence, in order to compute the traditional gradient, we have to estimate the weight parameters

w $w$ of the advantage function

A~w(xt,ut,t) $\tilde{A}_{w}(x_{t},u_{t},t)$ and the matrix

Gθ $G_{\theta}$ .
Note that

Gθ=Fθ $G_{\theta}=F_{\theta}$ . Hence, the step-based natural gradient simplifies to:

\nabla N G θ J θ = F - 1 θ \nabla F A θ J θ = w

$\nabla_{\theta}^{NG}J_{\theta}=F_{\theta}^{-1}\nabla_{\theta}^{FA}J_{\theta}=w$ Anyway, the natural gradient still requires estimating the function

A~w $\tilde{A}_{w}$ . Due to the baseline

bt(x) $b_{t}(x)$ , the function

A~w(xt,ut.t)≈Qπt(xt,ut)−Vt(xt) $\tilde{A}_{w}(x_{t},u_{t}.t)\approx Q_{t}^{\pi}(x_{t},u_{t})-V_{t}(x_{t})$ can be interpreted as the advantage function. The advantage functino can be estimated by using temporal difference methods which also require an estimate of the value function

Vt(xt) $V_{t}(x_{t})$ .

Episodic Natural Actor Critic

The advantage function is easy to learn as its basis functions $\psi_{t}(x_{t},u_{t})$ are given by the compatible function approximation, appropriate basis functions for the value function are more difficult to specify. Then we would like to find some algorithms avoiding estimating a value function. One such algorithm is the episodic Natural Actor Critic (eNAC) alogrithm where the estimation of the value function $V_{t}$ can be avoided by considering whole sample paths.

For simplicity, we omite some internal steps and show the final outcome directly:
Rewrite the Bellman equation:

A ~ w (x t, u t . t) + V t (x) = r t (x, u) + \int p (x' | x, u) V t (x') d x'

$\tilde{A}_{w}(x_{t},u_{t}.t)+V_{t}(x)=r_{t}(x,u)+\int p(x'|x,u)V_{t}(x')dx'$ Then:

\sum t = 0 T - 1 \nabla θ log π θ (u t | x t, t) w + V 0 (x 0) = \sum t = 1 T - 1 r t (x t, u t) + r T (x T)

$\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(u_{t}|x_{t},t)w+V_{0}(x_{0})=\sum_{t=1}^{T-1}r_{t}(x_{t},u_{t})+r_{T}(x_{T})$ Now the value function

Vt $V_{t}$ needs to be estimated only for the first time step. For a single start state

x0 $x_{0}$ . estimating

V0(x0)=v0 $V_{0}(x_{0})=v_{0}$ reduces to estimating a constant

v0 $v_{0}$ . For multiple start states

x0 $x_{0}$ ,

Vt $V_{t}$ needs to be parameterized

V0(x0)≈V~v(x0)=φ(x)Tv $V_{0}(x_{0})\approx \tilde{V}_{v}(x_{0})=\varphi(x)^{T}v$ . By using mulitiple sample paths

τ[i] $\tau^{[i]}$ , we get a regression problem:

(w v) = (Ψ T Ψ) - 1 Ψ T R

$\begin{pmatrix}w\\v\end{pmatrix}=(\Psi^{T}\Psi)^{-1}\Psi^{T}R$ where

ψ [i] = [\sum t = 0 T - 1 \nabla θ log π θ (u [i] t ∣ ∣ x [i] t, t), φ (x [i]) T], Ψ = [ψ [1], \dots, ψ [N]] T

$\psi^{[i]}=\left[\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}\left(\left. u_{t}^{[i]}\right|x_{t}^{[i]},t\right),\varphi(x^{[i]})^{T}\right],\quad \Psi=\left[\psi^{[1]},\dots,\psi^{[N]}\right]^{T}$
eNAC

Natural Actor Critic

eNAC uses the returns $R^{[i]}$ for evaluating the policy, and consequently, gets less accurate for large time-horizons due to the large variance of the returns. The convergence speed can be improved by directly estimating the vaue function. To do so, temporal differnece methods have to first be adapted to learn the advantage function.

NAC

Episode-based Natural Policy Gradients

The beneficial properties of the natural gradient can also be exploited for episode-based algorithms. Such methods come from the area of evoluationary algorithms. They perform gradient ascent on a fitness function which is in the reinforcement learning context the expected long-term reward $J_{\omega}$ of the upper-level policy:

\nabla N E S ω J ω = F - 1 ω \nabla P E ω J ω

$\nabla_{\omega}^{NES}J_{\omega}=F_{\omega}^{-1}\nabla_{\omega}^{PE}J_{\omega}$

止于至玄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
A Policy Update Strategy in Model-free Policy Search: Policy Gradient

Now let’s discurss different ways of policy update used in policy search. Typical policy update methods of model-free policy consist of policy gradent methods, expectation-maximization-based methods, ...
复制链接

扫一扫