Reinforcement Learning: An Introduction — Chapter 13th

Dongz__

于 2022-12-28 17:24:33 发布

阅读量178

点赞数

分类专栏： Deep Reinforcement Learning 文章标签：人工智能深度学习

本文链接：https://blog.csdn.net/qq_25521779/article/details/128471547

版权

Deep Reinforcement Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Policy Gradient Methods

So far we almost all the methods have been action-value methods; It learned the values of actions and the selected acitons based on their estimated action values. In this chapter we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

We use the notation $\theta$ for the policy’s parameter vector. Thus we write:

$\pi(a|s, \theta) = Pr\lbrace A_t = a | S_t =s, \theta_t = \theta\rbrace \tag{1}$

for the probability that action $a$ is taken at time $t$ given that the environment is in state $s$ with parameter $\theta$ . If a method uses a learned value function as well, then the value function’s weight vector is denoted $w$ as usual, as in $\hat{v}(s, w)$ .

In this chapter we consider methods for learning the policy parameter base on the gradient of some scalar performance measure $J(\theta)$ with respect to the policy parameter $\theta$ . These method seek to maximize performance, so their updates approximate gradient ascent in $J$ :

$\theta_{t+1} = \theta_{t} + \alpha \nabla \hat{J}(\theta_t) \tag{2}$

where $\hat{J}(\theta)$ is a stochastic estimate whose expectation approxiamtes the gradient of the performance measure with respect to its argument $\theta_t$ . All methods that follow this general scheme we call Policy Gradient Methods. Method that learn approximations to both policy and value function are often called Actor-Critic methods, where actor is a reference to the learned policy, and critic refers to the learned value function, usuallly a state-value function.

First, we treat the episodic case, in which peformance is defined as the value of the start state under the parameterized policy, before going on to consider the continuing cases, in which performance is defined as the average reward rate.

Policy Approximation and its Advantages

In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a|s, \theta)$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a|s, \theta)$ exists and is finite for all $\in S$ , $\in \mathcal{A}$ , $\theta \in \mathbb{R}^{d}$ . In practice, to ensure exploration we generally require that the policy never becomes deterministic.

In this section we introduce the most common parameterization for discrete action spaces and point out the advantages it offers over action-value methods.

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences $\theta) \in \mathbb{R}$ for each state-action pair. The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to an exponential softmax distribution:

$\pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum e^{h(s, a, \theta)}} \tag{3}$

We call this kind of policy parameterization softmax in action preferences. The action preferences themselves can be parameterized arbitrarily. For example, they might be computed by a deep neural networks, where $\theta$ is the connection weights of the network, or the preferences could simply be linear in features,

$\theta) = \theta^T x(s, a) \tag{4}$

using feature vector $\in \mathbb{R}^d$ constructed by any of the methods described in Chapter 9.

One advantage of parameterizing policies according to the softmax in action preferences is that the approximate policy can approach a deterministic policy, whereas with $\epsilon$ -greedy action selection over action values there is always an $\epsilon$ probability of selecting a random action.

Of course, one could select according to a softmax distribution based on action values, but this alone would not allow the policy to approach a deterministic policy. Instead, the action-value estiamtes would converge to their corresponding true values, which would differ by a finite amount, translating to specific probabilities other than 0 and 1.

Action preferences are different because they do not approach sepcific values, instead they are driven to produce the optimal stochastic policy. If the optimal policy is deterministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions.

A second advantage of parameterizing policies according to the softmax in action preferences is that it enables the selection of actions with arbitrary probabilities. In problems with significant function approximation, the best approximate policy may be stochastic. Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating method can.

Perhaps the simplest advantage that policy parameterizaiton may have over action-value parameterization is that the policy may be a simpler function to approximate. Problems vary in the complexity of their policies and action-value functions. For some, the action-value function is simpler and thus easier to approximate. For others, the policy is simpler. In the latter case a policy-based method will typically learn faster and yield a superrior asymptotic policy.

Finally, we note that the choice of policy parameterization is sometimes a good way of injecting prior knowledge about the desired form of policy into the reinforcement learning system. This is often the most important reason for using a policy-based learning method.

The policy Gradient Theorem

In addition to the practical advantages of policy parameterization over $\epsilon$ -greedy action selection, there is also an important theoretical advantage. 对于连续的策略参数化问题，动作概率作为学习参数的函数往往会平滑改变，而在 $\epsilon$ -greedy策略中，动作概率会因为动作估计值的一个小波动产生剧烈变化，如果这个变化使得另一个动作拥有最大值。Largely becuase of this stronger convergence guarantees are available for policy-gradient methods than for action-value methods. In particular, it is the continuity of the policy dependence on the parameters that enables policy-gradient methods to approximate gradient ascent.

The episodic and continuing cases define the performance measure $J(\theta)$ , differently and thus have to be treated separately to some extent. Nevertheless, we will try to present both cases uniformly, and we develop a notation so that the major theoratical results can be described with a single set of equations.

在本节，我们仅针对episodic cases. We define the performance measure $J(\theta)$ as the value of the start state of the episode. We can simplify the notation without losing any meaningful generality by assuming that every episode starts in some particular (non-random) state $s_0$ . Then, in the episodic case we define performance as:

$J(\theta) = v_{\pi_{\theta}}(s_0)$

From here on in our discussion we will assume no discounting for the episodic case.

The problem is that performance depends on both the actio selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter.

给定一个状态，那么策略参数对动作以及奖励的影响能够以一种相对直接的方式从参数化先验知识中计算得到。但是策略对于状态分布的影响是环境的函数，并且一般来说是未知的。当梯度依赖于策略变化对状态分布的未知影响，我们如何能够估计出性能相对于策略参数的梯度？

Policy gradient theorem provides an analytic expression for the gradient of performance with respect to the policy parameter that does not involve the derivative of the state dsitribution. The policy gradient theorem for the episodic case establish that:

$\nabla(\boldsymbol{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \tag{5}$

In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1, so that the relationship is actually an equality. The distribution $\mu$ here is the on-policy distribution under $\pi$ .

We can prove the policy gradient theorem from first principles. To keep the notation simple, we leave it implicit in all cases that $\pi$ is a function of $\theta$ , and all gradients are also implicitly with respect to $\theta$ .

REINFOCE: Monte Carlo Policy Gradient

Recall our overall strategy of stochastic gradient ascent, which requires a way to obtain samples such that the expectation of the sample gradient is propotional to the actual gradient of the performance measure as a function of the parameter. 采样的梯度只需要正比于 $\nabla J(\theta)$ 即可，因为任何常比例系数都可以被纳入到步长 $\alpha$ 中。策略梯度定理给出了梯度正比例表达方式，因此只需要对策略梯度定理的比例项进行采样平均即可。

$\begin{aligned} \nabla J(\theta) &\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s, \theta)q(s,a) \tag{6} \\\ &= \mathbb{E}_{\pi} \left[\sum_a \nabla \pi(a|s, \theta) q(S_t, a) \right] \end{aligned}$

Therefore, we can adopt equation (6) into our gradient-ascent algorithm:

$\theta_{t+1} = \theta_{t} + \alpha \sum_a \hat{q}(S_t, a) \nabla \pi(a|s, \theta) \tag{7}$

where $\hat{q}(S_t, a)$ is some learned approximation to $q_{\pi}$ . This algorithm is called an all-action method because it update involves all of the actions.

此外，为了将公式6中的 $a$ 替换成可采样的随机变量 $A_t$ ，我们将其转换为另一函数的期望形式:

$\begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\pi} \left[\sum_a \frac{\nabla \pi(a|S_t, \theta)}{\pi(a|S_t, \theta)} \pi(a|S_t, \theta) q(S_t, a) \right] \tag{8}\\\ &= \mathbb{E}_{\pi} \left[ \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} q(S_t, A_t) \right] \\\ &= \mathbb{E}_{\pi} \left[ G_t \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right] \\\ &= \mathbb{E}_{\pi} \left[G_t \nabla \ln \pi(A_t| S_t, \theta) \right] \end{aligned}$

The final expression in bracket is exactly what is needed, aquantity that can be sampled on each time step whose expectation is equal to the gradient. Using this sample to instantiate our generic stochastic gradient ascent algorithm yields the REINFOCE update rule:

$\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} G_t \tag{9}\\\ &= \theta_t + \alpha \nabla \ln \pi(A_t|S_t, \theta) G_t \end{aligned}$

Each increment is propotional to the product of a return $G_t$ and a vector, the gradient of the probability of taking the action acutally taken divided by the probability of taking that action. The vector is the direction in parameter space that most increases the probability of repeating the aciton $A_t$ on future visits to state $S_t$ . The update increases the parameter vector in this direction propotional to the return, and inversely propotional to the action probability. $G_t$ 项可以使策略参数向获得更高回报的方向进行更新， $\pi(A_t|S_t, \theta)$ 的意义在于缩减那些出现频率高但是对应回报小的动作的优势，可以看做一种归一化方式。

Note that REINFOCE use the complete return from time t, which includes all future rewards up until the end of the episode. In this sense, REINFOCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.

The pseudocode of REINFORCE are provided as following:

We will refer to $\nabla \ln \pi(A_t|S_t, \theta) $ simply as eligibility vector. The difference between the pseudocode and update rule in equation (9) is that the former includes a factor of $\gamma^t$ . This is because, as mentioned earlier, in the text we are treating the no-discounted case while in the boxed algorithms we are giving the algorithms for the general discounted case.

As a stochastic gradient method, REINFOCE has good theoretical convergence properties. By construction, the expected update over an episode is in the same direction as the performance gradient. This assures an improvement in expected performance for sufficiently small $\alpha$ , and converge to a local maxima under standard stochastic approximation condition for decreasing $\alpha$ . However, as a Monte Carlo method, REINFOCE may be of high variance and thus produce slow learning.

[Exercise] Considering policy parameterizations using the softmax in action preference with linear action preferences, that is :

$\pi(a|s) = \frac{e^{\theta^T x(s, a)}}{\sum_b e^{\theta^T x(s, b)}}$

For this parameterization, the eligibility vector is:

$\begin{aligned} \nabla_{\theta} \ln \pi(a|s, \theta) &= \nabla \left[\ln e^{\theta^T x(s, a)} - \ln \sum_b e^{\theta^T x(s, b)}\right] \\\ &= \nabla \theta^T x(s, a) - \nabla \ln \sum_b e^{\theta^T x(s, b)} \\\ &= x(s, a) - \frac{\nabla \sum_b e^{\theta^T x(s, b)}}{\sum_b e^{\theta^T x(s, b)}} \\\ &= x(s, a) - \sum_b \frac{e^{\theta^T x(s, b)}}{\sum_b e^{\theta^T x(s, b)}} x(s, b) \\\ &= x(s, a) - \sum_b \pi(b|s, \theta) x(s, b) \end{aligned}$

REINFOCE with Baseline

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline $b (s)$ :

$\nabla J(\theta) \propto \sum_s \mu(s) \sum_a \left[q_{\pi}(s, a) - b(s)\right]\nabla \pi(a|s, \theta) \tag{10}$

The baseline can be any fucntion, even a random variable, as long as it does not vary with action $a$ . The equation remain valid because the subtracted quantity is zero:

$\sum_a b(s) \nabla \pi(a|s) = b(s) \nabla \sum_a \pi(a|s) = 0$

The policy gradient theorem with baseline can be used to derive an new version of REINFOCE that includes a general baseline:

$\theta_{t+1} = \theta_{t} + \alpha \nabla \ln \pi(A_t|S_t, \theta) [G_t - b(S_t)] \tag{11}$

In general, the baseline leaves the expected value of the update unchanged, but it can have large effect on its variance. For MDPs the basline should vary with state. In some states all acitons have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones. In other states all actions will have low values and a low baseline is appropriate.

One natural choice for the baseline is an estimate of the state value $\hat{v}(S_t, w)$ , where $\in \mathbb{R}^m$ is the weight learned by one of the value function approximation. Because REINFOCE is a Monte Carlo method for learning the policy parameter $\theta$ , it seems natural to also use a Monte Carlo method to estimate the state-value weight $w$ .

A complete pseudocode for REINFORCE with baseline algorithm is given at below, which uses such a learned state-value function as the baseline:

This algorithm has two step size, denoted $\alpha^{\theta}$ and $\alpha^{w}$ . Choosing the step size for values is relative easy: in linear case, we have rules of thumb for setting it, such as $a^w = 0.1/\mathbb{E}[\lVert \nabla \hat{v}(S_t, w)\rVert^2_{\mu}]$ . It is much less clear how to set the step size for the policy parameters $\alpha^{\theta}$ , whose best value depends on the range of variation of the reward and on the policy parameterization.

Actor-Critic Methods

尽管REINFOCE-with-baseline同时学习了策略和状态函数，我们不认为他是一种actor-critic方法，因为他的状态函数仅用于作为baseline，而不是critic. 即状态函数没有被用于bootstrapping，仅仅用于估计当前状态的value $\hat{v}(S_t, w)$ . 而正常的Critic需要估计 $\hat{v}(S_{t+1}, w)$ 和 $\hat{v}(S_t, w)$ . 这是一个有用的区别，只有通过bootstrapping，我们才能在函数逼近质量中引入偏差和渐进依赖。 The bias introduced through bootstrarpping and reliance on the state representation is often beneficial because it reduces variance and accelerates learning. REINFOCE with baseline is unbiased and will converge asymptotically to a local minima, but like all Monte Carlo methods it tends to learn slowly and to be incovenient to implement online or for continuing problems. With temporal-difference methods we can eliminate there inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in the case of policy gradient method, we use actor-critic methods with a bootstrapping critic.

First we consider one-step actor-critic methods, the analog of the TD methods introduced in CHAPTER 6 such as TD(0), Sarsa(0), and Q-learning. The main appeal of one-step methods is that they are fully online and incremental, yet avoid the complexities of eligibility traces. One-step actor-critic methods replace the full return of REINFOCE with the one-step return as follows:

$\begin{aligned} \theta_{t+1} &= \theta_{t} + \alpha \nabla \ln \pi(A_t|S_t, \theta) [G_{t:t+1} - \hat{v}(S_t, w)] \\\ &= \theta_t + \alpha \nabla \ln \pi(A_t|S_t, \theta) [R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] \\\ &= \theta_t + \alpha \delta_t \nabla \ln \pi(A_t|S_t, \theta) \tag{12} \end{aligned}$

Note that it is now a fully online, incremental algorithm, with states, actions, and rewards processed as they occur and then never revisited.

The generalization to the forward view of n-step methods and then to a $\lambda$ -return algorihm are straighforward. The one-step reutrn in equation (12) i merely replaces by $G_{t:t+n}$ or $G_t^{\lambda}$ respectively. The backward view of the $\lambda$ -return algorithm is also straightforward, using separate eligibility traces for the actor and critic, each after the patterns in Chapter 12.

Policy Gradient for Continuing Problems

For continuing problems without episode boundaries we need define performance in term of average rate of reward per timestep:

$J(\theta) = r(\pi) = \lim_{h \rightarrow \infin} \frac{1}{h} \sum_{t=1}^h \mathbb{E}[R_t|S_0, A_{0:t-1} \sim \pi]$

Dongz__

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
Reinforcement Learning: An Introduction — Chapter 13th

The 13th Chapter of 《Reinforcement Learning: An Introduction》
复制链接

扫一扫