Chapter 13: Policy Gradient Methods

1 Introduction

All previous methods in this book are action-value methods, namely the policy is made according to the learned values of actions. In Policy Gradient Methods, policy is parameterized into a function: π ( a ∣ s , θ ) = P r { A t = a ∣ S t = s , θ t = θ } \pi(a|s,\theta)=Pr\{A_t=a|S_t=s, \theta_t=\theta\} π(as,θ)=Pr{At=aSt=s,θt=θ}.
It should be (1) Differentiable; (2) Never deterministic to guarantee exploration.

If the action space is discrete and not too large, then a natural and common kind of
parameterization is to form parameterized numerical preferences h ( s , a , θ ) ∈ R h(s, a, \theta) \in R h(s,a,θ)R for each state–action pair. Soft-max can be used to calculate the probability. Moreover, ANN can also be used to parameterize policy.

The objective is to maximize some performance measure J ( θ ) J(\theta) J(θ).
θ t + 1 = θ t + α ∇ J ( θ t ) \theta_{t+1}=\theta_t+\alpha \nabla J(\theta_t) θt+1=θt+αJ(θt)

2 Advantages

  1. The approximate policy can approach a deterministic policy, whereas with ε \varepsilon ε-greedy action selection over action values there is always an ε \varepsilon ε probability of selecting a random action.
  2. It enables the selection of actions with arbitrary probabilities.
  3. The policy may be a simpler function to approximate.
  4. The choice of policy parameterization is sometimes a good way of injecting prior knowledge about the desired form of the policy into the reinforcement learning system.
  5. Theoretical advantage, With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas in ε \varepsilon ε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values

3 Policy approximation

3.1 Parameterized preferences (softmax)

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences h ( s , a , θ ) ∈ R h(s,a, \theta) \in R h(s,a,θ)R for each state–action pair.
π ( a ∣ s , θ ) = e h ( s , a , θ ) ∑ b e h ( s , b , θ ) (13.2) \pi(a|s,\theta)=\frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} \tag{13.2} π(as,θ)=beh(s,b,θ)eh(s,a,θ)(13.2) The action preferences themselves can be parameterized arbitrarily. h ( s , a , θ ) h(s,a, \theta) h(s,a,θ) can be ANN, or linear in features: h ( s , a , θ ) = θ T x ( s , a ) h(s,a, \theta)=\theta^T x(s,a) h(s,a,θ)=θTx(s,a).

With linear approximation,
∇ a t π ( a ∣ s , θ ) ∝ ∇ a t l o g e h ( s , a , θ ) ∑ b e h ( s , b , θ ) = ∇ a t ( θ T x ( s , a ) − l o g [ ∑ b e θ T x ( s , b ) ] ) = 1 a = a t x ( s , a ) − e θ T x ( s , a t ) x ( s , a t ) ∑ b e θ T x ( s , b ) = 1 a = a t x ( s , a ) − π ( a t ∣ s ) x ( s , a t ) ) \begin{aligned} \nabla_{a_t}\pi(a|s,\theta)&\propto \nabla_{a_t} log \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}\\ &=\nabla_{a_t}(\theta^T x(s,a)- log[\sum_b e^{\theta^T x(s,b)}])\\ &=\mathbf{1}_{a=a_t}x(s,a)-\frac{e^{\theta^Tx(s,a_t)}x(s,a_t)}{\sum_b e^{\theta^T x(s,b)}}\\ &=\mathbf{1}_{a=a_t}x(s,a)-\pi(a_t|s)x(s,a_t) \end{aligned}) atπ(as,θ)atlogbeh(s,b,θ)eh(s,a,θ)=at(θTx(s,a)log[beθTx(s,b)])=1a=atx(s,a)beθTx(s,b)eθTx(s,at)x(s,at)=1a=atx(s,a)π(ats)x(s,at)) f ( x ) = 1 1 + e − x f(x)=\frac{1}{1+e^{-x}} f(x)=1+ex1 f ( x ) d x = f ( x ) ( 1 − f ( x ) ) \frac{f(x)}{dx}=f(x)(1-f(x)) dxf(x)=f(x)(1f(x))

3.2 Policy Parameterization for Continuous Actions

Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution.
p ( x ) = 1 σ 2 π e x p ( − ( x − μ ) 2 2 σ 2 ) (13.18) p(x)=\frac{1}{\sigma \sqrt{2\pi}}exp(-\frac{(x-\mu)^2}{2\sigma^2}) \tag{13.18} p(x)=σ2π 1exp(2σ2(xμ)2)(13.18) p ( x ) p(x) p(x): The density of the probability at x x x, not the probability, can be greater than 1.
To produce a policy parameterization, the policy can be defined as the normal probability density over a real-valued scalar action, with mean and standard deviation given by parametric function approximators that depend on the state.
π ( a ∣ s , θ ) = 1 σ ( s , θ ) 2 π e x p ( − ( a − μ ( s , θ ) ) 2 2 σ ( s , θ ) 2 ) \pi(a|s,\theta)=\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}exp(-\frac{(a-\mu(s,\theta))^2}{2\sigma(s,\theta)^2}) π(as,θ)=σ(s,θ)2π 1exp(2σ(s,θ)2(aμ(s,θ))2)

4 Policy gradient therom

With function approximation it may seem challenging to change the policy parameter in a way that ensures improvement. The problem is that performance depends on both the action selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter.

4.1 REINFORCE: Monte Carlo Policy Gradient



Reinforce update without discount

Reinforce update with discount

4.2 Reinforce with baseline

The idea of the baseline is to reduce variance — by construction it has no impact on the expected update.

4.3 Actor–Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated.

REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.

4.4 Policy Gradient for Continuing Problems

Use average rate of reward.

With these changes the policy gradient theorem remains true.

4.5 Continuous action space

Policy-based methods offer practical ways of dealing with large actions spaces, even continuous spaces with an infinite number of actions. Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution. For example, the action set might be the real numbers, with actions chosen from a normal (Gaussian) distribution.

0 Questions

Q1: For problems that are not MDP, is it pracitcal to learn a sequential policy model using a temporal convolution network.

Q2: Can parameterized policy focus on some interested action space as well as action-value methods? For example, what is performance in Monte Carlo Tree Search which refines the information policy needs at decision time

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值