Chapter 13: Policy Gradient Methods

最新推荐文章于 2024-07-08 01:30:05 发布

xiwang_chn

最新推荐文章于 2024-07-08 01:30:05 发布

阅读量224

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/107889162

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Notes of Chapter 13: Policy Gradient Methods

1 Introduction
2 Advantages
3 Policy approximation
3.1 Parameterized preferences (softmax)
- 3.2 Policy Parameterization for Continuous Actions
4 Policy gradient therom
0 Questions
- Q1: For problems that are not MDP, is it pracitcal to learn a sequential policy model using a temporal convolution network.
- Q2: Can parameterized policy focus on some interested action space as well as action-value methods? For example, what is performance in Monte Carlo Tree Search which refines the information policy needs at decision time

1 Introduction

All previous methods in this book are action-value methods, namely the policy is made according to the learned values of actions. In Policy Gradient Methods, policy is parameterized into a function: $\pi(a|s,\theta)=Pr\{A_t=a|S_t=s, \theta_t=\theta\}$ .
It should be (1) Differentiable; (2) Never deterministic to guarantee exploration.

If the action space is discrete and not too large, then a natural and common kind of
parameterization is to form parameterized numerical preferences $\theta) \in R$ for each state–action pair. Soft-max can be used to calculate the probability. Moreover, ANN can also be used to parameterize policy.

The objective is to maximize some performance measure $J(\theta)$ .
$\theta_{t+1}=\theta_t+\alpha \nabla J(\theta_t)$

2 Advantages

The approximate policy can approach a deterministic policy, whereas with $\varepsilon$ -greedy action selection over action values there is always an $\varepsilon$ probability of selecting a random action.
It enables the selection of actions with arbitrary probabilities.
The policy may be a simpler function to approximate.
The choice of policy parameterization is sometimes a good way of injecting prior knowledge about the desired form of the policy into the reinforcement learning system.
Theoretical advantage, With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas in $\varepsilon$ -greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values

3 Policy approximation

3.1 Parameterized preferences (softmax)

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences $\theta) \in R$ for each state–action pair.
$\pi(a|s,\theta)=\frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} \tag{13.2}$ The action preferences themselves can be parameterized arbitrarily. $\theta)$ can be ANN, or linear in features: $\theta)=\theta^T x(s,a)$ .

With linear approximation,
$\begin{aligned} \nabla_{a_t}\pi(a|s,\theta)&\propto \nabla_{a_t} log \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}}\\ &=\nabla_{a_t}(\theta^T x(s,a)- log[\sum_b e^{\theta^T x(s,b)}])\\ &=\mathbf{1}_{a=a_t}x(s,a)-\frac{e^{\theta^Tx(s,a_t)}x(s,a_t)}{\sum_b e^{\theta^T x(s,b)}}\\ &=\mathbf{1}_{a=a_t}x(s,a)-\pi(a_t|s)x(s,a_t) \end{aligned})$ $f(x)=\frac{1}{1+e^{-x}}$ $\frac{f(x)}{dx}=f(x)(1-f(x))$

3.2 Policy Parameterization for Continuous Actions

Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution.
$p(x)=\frac{1}{\sigma \sqrt{2\pi}}exp(-\frac{(x-\mu)^2}{2\sigma^2}) \tag{13.18}$ $p (x)$ : The density of the probability at $x$ , not the probability, can be greater than 1.
To produce a policy parameterization, the policy can be defined as the normal probability density over a real-valued scalar action, with mean and standard deviation given by parametric function approximators that depend on the state.
$\pi(a|s,\theta)=\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}exp(-\frac{(a-\mu(s,\theta))^2}{2\sigma(s,\theta)^2})$

4 Policy gradient therom

With function approximation it may seem challenging to change the policy parameter in a way that ensures improvement. The problem is that performance depends on both the action selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter.

4.1 REINFORCE: Monte Carlo Policy Gradient

Reinforce update without discount

Reinforce update with discount

4.2 Reinforce with baseline

The idea of the baseline is to reduce variance — by construction it has no impact on the expected update.

4.3 Actor–Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated.

REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.

4.4 Policy Gradient for Continuing Problems

Use average rate of reward.

With these changes the policy gradient theorem remains true.

4.5 Continuous action space

Policy-based methods offer practical ways of dealing with large actions spaces, even continuous spaces with an infinite number of actions. Instead of computing learned probabilities for each of the many actions, we instead learn statistics of the probability distribution. For example, the action set might be the real numbers, with actions chosen from a normal (Gaussian) distribution.