Reinforcement Learning: An Introduction — Chapter 13th

Policy Gradient Methods

So far we almost all the methods have been action-value methods; It learned the values of actions and the selected acitons based on their estimated action values. In this chapter we consider methods that instead learn a parameterized policy that can select actions without consulting a value function. A value function may still be used to learn the policy parameter, but is not required for action selection.

We use the notation θ \theta θ for the policy’s parameter vector. Thus we write:

π ( a ∣ s , θ ) = P r { A t = a ∣ S t = s , θ t = θ } (1) \pi(a|s, \theta) = Pr\lbrace A_t = a | S_t =s, \theta_t = \theta\rbrace \tag{1} π(as,θ)=Pr{At=aSt=s,θt=θ}(1)

for the probability that action a a a is taken at time t t t given that the environment is in state s s s with parameter θ \theta θ. If a method uses a learned value function as well, then the value function’s weight vector is denoted w w w as usual, as in v ^ ( s , w ) \hat{v}(s, w) v^(s,w).

In this chapter we consider methods for learning the policy parameter base on the gradient of some scalar performance measure J ( θ ) J(\theta) J(θ) with respect to the policy parameter θ \theta θ. These method seek to maximize performance, so their updates approximate gradient ascent in J J J:

θ t + 1 = θ t + α ∇ J ^ ( θ t ) (2) \theta_{t+1} = \theta_{t} + \alpha \nabla \hat{J}(\theta_t) \tag{2} θt+1=θt+αJ^(θt)(2)

where J ^ ( θ ) \hat{J}(\theta) J^(θ) is a stochastic estimate whose expectation approxiamtes the gradient of the performance measure with respect to its argument θ t \theta_t θt. All methods that follow this general scheme we call Policy Gradient Methods. Method that learn approximations to both policy and value function are often called Actor-Critic methods, where actor is a reference to the learned policy, and critic refers to the learned value function, usuallly a state-value function.

First, we treat the episodic case, in which peformance is defined as the value of the start state under the parameterized policy, before going on to consider the continuing cases, in which performance is defined as the average reward rate.

Policy Approximation and its Advantages

In policy gradient methods, the policy can be parameterized in any way, as long as π ( a ∣ s , θ ) \pi(a|s, \theta) π(as,θ) is differentiable with respect to its parameters, that is, as long as ∇ π ( a ∣ s , θ ) \nabla \pi(a|s, \theta) π(as,θ) exists and is finite for all s ∈ S s \in S sS, a ∈ A a \in \mathcal{A} aA, θ ∈ R d \theta \in \mathbb{R}^{d} θRd. In practice, to ensure exploration we generally require that the policy never becomes deterministic.

In this section we introduce the most common parameterization for discrete action spaces and point out the advantages it offers over action-value methods.

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences h ( s , a , θ ) ∈ R h(s, a, \theta) \in \mathbb{R} h(s,a,θ)R for each state-action pair. The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to an exponential softmax distribution:

π ( a ∣ s , θ ) = e h ( s , a , θ ) ∑ e h ( s , a , θ ) (3) \pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum e^{h(s, a, \theta)}} \tag{3} π(as,θ)=eh(s,a,θ)eh(s,a,θ)(3)

We call this kind of policy parameterization softmax in action preferences. The action preferences themselves can be parameterized arbitrarily. For example, they might be computed by a deep neural networks, where θ \theta θ is the connection weights of the network, or the preferences could simply be linear in features,

h ( s , a , θ ) = θ T x ( s , a ) (4) h(s, a, \theta) = \theta^T x(s, a) \tag{4} h(s,a,θ)=θTx(s,a)(4)

using feature vector x ( s , a ) ∈ R d x(s, a) \in \mathbb{R}^d x(s,a)Rd constructed by any of the methods described in Chapter 9.

One advantage of parameterizing policies according to the softmax in action preferences is that the approximate policy can approach a deterministic policy, whereas with ϵ \epsilon ϵ-greedy action selection over action values there is always an ϵ \epsilon ϵ probability of selecting a random action.

Of course, one could select according to a softmax distribution based on action values, but this alone would not allow the policy to approach a deterministic policy. Instead, the action-value estiamtes would converge to their corresponding true values, which would differ by a finite amount, translating to specific probabilities other than 0 and 1.

Action preferences are different because they do not approach sepcific values, instead they are driven to produce the optimal stochastic policy. If the optimal policy is deterministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions.

A second advantage of parameterizing policies according to the softmax in action preferences is that it enables the selection of actions with arbitrary probabilities. In problems with significant function approximation, the best approximate policy may be stochastic. Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating method can.

Perhaps the simplest advantage that policy parameterizaiton may have over action-value parameterization is that the policy may be a simpler function to approximate. Problems vary in the complexity of their policies and action-value functions. For some, the action-value function is simpler and thus easier to approximate. For others, the policy is simpler. In the latter case a policy-based method will typically learn faster and yield a superrior asymptotic policy.

Finally, we note that the choice of policy parameterization is sometimes a good way of injecting prior knowledge about the desired form of policy into the reinforcement learning system. This is often the most important reason for using a policy-based learning method.

The policy Gradient Theorem

In addition to the practical advantages of policy parameterization over ϵ \epsilon ϵ-greedy action selection, there is also an important theoretical advantage. 对于连续的策略参数化问题,动作概率作为学习参数的函数往往会平滑改变,而在 ϵ \epsilon ϵ-greedy策略中,动作概率会因为动作估计值的一个小波动产生剧烈变化,如果这个变化使得另一个动作拥有最大值。Largely becuase of this stronger convergence guarantees are available for policy-gradient methods than for action-value methods. In particular, it is the continuity of the policy dependence on the parameters that enables policy-gradient methods to approximate gradient ascent.

The episodic and continuing cases define the performance measure J ( θ ) J(\theta) J(θ), differently and thus have to be treated separately to some extent. Nevertheless, we will try to present both cases uniformly, and we develop a notation so that the major theoratical results can be described with a single set of equations.

在本节,我们仅针对episodic cases. We define the performance measure J ( θ ) J(\theta) J(θ) as the value of the start state of the episode. We can simplify the notation without losing any meaningful generality by assuming that every episode starts in some particular (non-random) state s 0 s_0 s0. Then, in the episodic case we define performance as:

J ( θ ) = v π θ ( s 0 ) J(\theta) = v_{\pi_{\theta}}(s_0) J(θ)=vπθ(s0)

From here on in our discussion we will assume no discounting for the episodic case.

The problem is that performance depends on both the actio selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter.

给定一个状态,那么策略参数对动作以及奖励的影响能够以一种相对直接的方式从参数化先验知识中计算得到。但是策略对于状态分布的影响是环境的函数,并且一般来说是未知的。当梯度依赖于策略变化对状态分布的未知影响,我们如何能够估计出性能相对于策略参数的梯度?

Policy gradient theorem provides an analytic expression for the gradient of performance with respect to the policy parameter that does not involve the derivative of the state dsitribution. The policy gradient theorem for the episodic case establish that:

∇ ( θ ) ∝ ∑ s μ ( s ) ∑ a q π ( s , a ) ∇ π ( a ∣ s , θ ) (5) \nabla(\boldsymbol{\theta}) \propto \sum_{s} \mu(s) \sum_{a} q_{\pi}(s, a) \nabla \pi(a \mid s, \boldsymbol{\theta}) \tag{5} (θ)sμ(s)aqπ(s,a)π(as,θ)(5)

In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1, so that the relationship is actually an equality. The distribution μ \mu μ here is the on-policy distribution under π \pi π.

We can prove the policy gradient theorem from first principles. To keep the notation simple, we leave it implicit in all cases that π \pi π is a function of θ \theta θ, and all gradients are also implicitly with respect to θ \theta θ.

REINFOCE: Monte Carlo Policy Gradient

Recall our overall strategy of stochastic gradient ascent, which requires a way to obtain samples such that the expectation of the sample gradient is propotional to the actual gradient of the performance measure as a function of the parameter. 采样的梯度只需要正比于 ∇ J ( θ ) \nabla J(\theta) J(θ)即可,因为任何常比例系数都可以被纳入到步长 α \alpha α中。策略梯度定理给出了梯度正比例表达方式,因此只需要对策略梯度定理的比例项进行采样平均即可。

∇ J ( θ ) ∝ ∑ s μ ( s ) ∑ a ∇ π ( a ∣ s , θ ) q ( s , a )   = E π [ ∑ a ∇ π ( a ∣ s , θ ) q ( S t , a ) ] (6) \begin{aligned} \nabla J(\theta) &\propto \sum_s \mu(s) \sum_a \nabla \pi(a|s, \theta)q(s,a) \tag{6} \\\ &= \mathbb{E}_{\pi} \left[\sum_a \nabla \pi(a|s, \theta) q(S_t, a) \right] \end{aligned} J(θ) sμ(s)aπ(as,θ)q(s,a)=Eπ[aπ(as,θ)q(St,a)](6)

Therefore, we can adopt equation (6) into our gradient-ascent algorithm:

θ t + 1 = θ t + α ∑ a q ^ ( S t , a ) ∇ π ( a ∣ s , θ ) (7) \theta_{t+1} = \theta_{t} + \alpha \sum_a \hat{q}(S_t, a) \nabla \pi(a|s, \theta) \tag{7} θt+1=θt+αaq^(St,a)π(as,θ)(7)

where q ^ ( S t , a ) \hat{q}(S_t, a) q^(St,a) is some learned approximation to q π q_{\pi} qπ. This algorithm is called an all-action method because it update involves all of the actions.

此外,为了将公式6中的 a a a替换成可采样的随机变量 A t A_t At,我们将其转换为另一函数的期望形式:

∇ J ( θ ) = E π [ ∑ a ∇ π ( a ∣ S t , θ ) π ( a ∣ S t , θ ) π ( a ∣ S t , θ ) q ( S t , a ) ]   = E π [ ∇ π ( A t ∣ S t , θ ) π ( A t ∣ S t , θ ) q ( S t , A t ) ]   = E π [ G t ∇ π ( A t ∣ S t , θ ) π ( A t ∣ S t , θ ) ]   = E π [ G t ∇ ln ⁡ π ( A t ∣ S t , θ ) ] (8) \begin{aligned} \nabla J(\theta) &= \mathbb{E}_{\pi} \left[\sum_a \frac{\nabla \pi(a|S_t, \theta)}{\pi(a|S_t, \theta)} \pi(a|S_t, \theta) q(S_t, a) \right] \tag{8}\\\ &= \mathbb{E}_{\pi} \left[ \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} q(S_t, A_t) \right] \\\ &= \mathbb{E}_{\pi} \left[ G_t \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right] \\\ &= \mathbb{E}_{\pi} \left[G_t \nabla \ln \pi(A_t| S_t, \theta) \right] \end{aligned} J(θ)   =Eπ[aπ(aSt,θ)π(aSt,θ)π(aSt,θ)q(St,a)]=Eπ[π(AtSt,θ)π(AtSt,θ)q(St,At)]=Eπ[Gtπ(AtSt,θ)π(AtSt,θ)]=Eπ[Gtlnπ(AtSt,θ)](8)

The final expression in bracket is exactly what is needed, aquantity that can be sampled on each time step whose expectation is equal to the gradient. Using this sample to instantiate our generic stochastic gradient ascent algorithm yields the REINFOCE update rule:

θ t + 1 = θ t + α ∇ π ( A t ∣ S t , θ ) π ( A t ∣ S t , θ ) G t   = θ t + α ∇ ln ⁡ π ( A t ∣ S t , θ ) G t (9) \begin{aligned} \theta_{t+1} &= \theta_t + \alpha \frac{\nabla \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} G_t \tag{9}\\\ &= \theta_t + \alpha \nabla \ln \pi(A_t|S_t, \theta) G_t \end{aligned} θt+1 =θt+απ(AtSt,θ)π(AtSt,θ)Gt=θt+αlnπ(AtSt,θ)Gt(9)

Each increment is propotional to the product of a return G t G_t Gt and a vector, the gradient of the probability of taking the action acutally taken divided by the probability of taking that action. The vector is the direction in parameter space that most increases the probability of repeating the aciton A t A_t At on future visits to state S t S_t St. The update increases the parameter vector in this direction propotional to the return, and inversely propotional to the action probability. G t G_t Gt项可以使策略参数向获得更高回报的方向进行更新, π ( A t ∣ S t , θ ) \pi(A_t|S_t, \theta) π(AtSt,θ)的意义在于缩减那些出现频率高但是对应回报小的动作的优势,可以看做一种归一化方式。

Note that REINFOCE use the complete return from time t, which includes all future rewards up until the end of the episode. In this sense, REINFOCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed.

The pseudocode of REINFORCE are provided as following:

We will refer to $\nabla \ln \pi(A_t|S_t, \theta) $ simply as eligibility vector. The difference between the pseudocode and update rule in equation (9) is that the former includes a factor of γ t \gamma^t γt. This is because, as mentioned earlier, in the text we are treating the no-discounted case while in the boxed algorithms we are giving the algorithms for the general discounted case.

As a stochastic gradient method, REINFOCE has good theoretical convergence properties. By construction, the expected update over an episode is in the same direction as the performance gradient. This assures an improvement in expected performance for sufficiently small α \alpha α, and converge to a local maxima under standard stochastic approximation condition for decreasing α \alpha α. However, as a Monte Carlo method, REINFOCE may be of high variance and thus produce slow learning.

[Exercise] Considering policy parameterizations using the softmax in action preference with linear action preferences, that is :

π ( a ∣ s ) = e θ T x ( s , a ) ∑ b e θ T x ( s , b ) \pi(a|s) = \frac{e^{\theta^T x(s, a)}}{\sum_b e^{\theta^T x(s, b)}} π(as)=beθTx(s,b)eθTx(s,a)

For this parameterization, the eligibility vector is:

∇ θ ln ⁡ π ( a ∣ s , θ ) = ∇ [ ln ⁡ e θ T x ( s , a ) − ln ⁡ ∑ b e θ T x ( s , b ) ]   = ∇ θ T x ( s , a ) − ∇ ln ⁡ ∑ b e θ T x ( s , b )   = x ( s , a ) − ∇ ∑ b e θ T x ( s , b ) ∑ b e θ T x ( s , b )   = x ( s , a ) − ∑ b e θ T x ( s , b ) ∑ b e θ T x ( s , b ) x ( s , b )   = x ( s , a ) − ∑ b π ( b ∣ s , θ ) x ( s , b ) \begin{aligned} \nabla_{\theta} \ln \pi(a|s, \theta) &= \nabla \left[\ln e^{\theta^T x(s, a)} - \ln \sum_b e^{\theta^T x(s, b)}\right] \\\ &= \nabla \theta^T x(s, a) - \nabla \ln \sum_b e^{\theta^T x(s, b)} \\\ &= x(s, a) - \frac{\nabla \sum_b e^{\theta^T x(s, b)}}{\sum_b e^{\theta^T x(s, b)}} \\\ &= x(s, a) - \sum_b \frac{e^{\theta^T x(s, b)}}{\sum_b e^{\theta^T x(s, b)}} x(s, b) \\\ &= x(s, a) - \sum_b \pi(b|s, \theta) x(s, b) \end{aligned} θlnπ(as,θ)    =[lneθTx(s,a)lnbeθTx(s,b)]=θTx(s,a)lnbeθTx(s,b)=x(s,a)beθTx(s,b)beθTx(s,b)=x(s,a)bbeθTx(s,b)eθTx(s,b)x(s,b)=x(s,a)bπ(bs,θ)x(s,b)

REINFOCE with Baseline

The policy gradient theorem can be generalized to include a comparison of the action value to an arbitrary baseline b ( s ) b(s) b(s):

∇ J ( θ ) ∝ ∑ s μ ( s ) ∑ a [ q π ( s , a ) − b ( s ) ] ∇ π ( a ∣ s , θ ) (10) \nabla J(\theta) \propto \sum_s \mu(s) \sum_a \left[q_{\pi}(s, a) - b(s)\right]\nabla \pi(a|s, \theta) \tag{10} J(θ)sμ(s)a[qπ(s,a)b(s)]π(as,θ)(10)

The baseline can be any fucntion, even a random variable, as long as it does not vary with action a a a. The equation remain valid because the subtracted quantity is zero:

∑ a b ( s ) ∇ π ( a ∣ s ) = b ( s ) ∇ ∑ a π ( a ∣ s ) = 0 \sum_a b(s) \nabla \pi(a|s) = b(s) \nabla \sum_a \pi(a|s) = 0 ab(s)π(as)=b(s)aπ(as)=0

The policy gradient theorem with baseline can be used to derive an new version of REINFOCE that includes a general baseline:

θ t + 1 = θ t + α ∇ ln ⁡ π ( A t ∣ S t , θ ) [ G t − b ( S t ) ] (11) \theta_{t+1} = \theta_{t} + \alpha \nabla \ln \pi(A_t|S_t, \theta) [G_t - b(S_t)] \tag{11} θt+1=θt+αlnπ(AtSt,θ)[Gtb(St)](11)

In general, the baseline leaves the expected value of the update unchanged, but it can have large effect on its variance. For MDPs the basline should vary with state. In some states all acitons have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones. In other states all actions will have low values and a low baseline is appropriate.

One natural choice for the baseline is an estimate of the state value v ^ ( S t , w ) \hat{v}(S_t, w) v^(St,w), where w ∈ R m w \in \mathbb{R}^m wRm is the weight learned by one of the value function approximation. Because REINFOCE is a Monte Carlo method for learning the policy parameter θ \theta θ, it seems natural to also use a Monte Carlo method to estimate the state-value weight w w w.

A complete pseudocode for REINFORCE with baseline algorithm is given at below, which uses such a learned state-value function as the baseline:

This algorithm has two step size, denoted α θ \alpha^{\theta} αθ and α w \alpha^{w} αw. Choosing the step size for values is relative easy: in linear case, we have rules of thumb for setting it, such as a w = 0.1 / E [ ∥ ∇ v ^ ( S t , w ) ∥ μ 2 ] a^w = 0.1/\mathbb{E}[\lVert \nabla \hat{v}(S_t, w)\rVert^2_{\mu}] aw=0.1/E[∥v^(St,w)μ2]. It is much less clear how to set the step size for the policy parameters α θ \alpha^{\theta} αθ, whose best value depends on the range of variation of the reward and on the policy parameterization.

Actor-Critic Methods

尽管REINFOCE-with-baseline同时学习了策略和状态函数,我们不认为他是一种actor-critic方法,因为他的状态函数仅用于作为baseline,而不是critic. 即状态函数没有被用于bootstrapping,仅仅用于估计当前状态的value v ^ ( S t , w ) \hat{v}(S_t, w) v^(St,w). 而正常的Critic需要估计 v ^ ( S t + 1 , w ) \hat{v}(S_{t+1}, w) v^(St+1,w) v ^ ( S t , w ) \hat{v}(S_t, w) v^(St,w). 这是一个有用的区别,只有通过bootstrapping,我们才能在函数逼近质量中引入偏差和渐进依赖。 The bias introduced through bootstrarpping and reliance on the state representation is often beneficial because it reduces variance and accelerates learning. REINFOCE with baseline is unbiased and will converge asymptotically to a local minima, but like all Monte Carlo methods it tends to learn slowly and to be incovenient to implement online or for continuing problems. With temporal-difference methods we can eliminate there inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in the case of policy gradient method, we use actor-critic methods with a bootstrapping critic.

First we consider one-step actor-critic methods, the analog of the TD methods introduced in CHAPTER 6 such as TD(0), Sarsa(0), and Q-learning. The main appeal of one-step methods is that they are fully online and incremental, yet avoid the complexities of eligibility traces. One-step actor-critic methods replace the full return of REINFOCE with the one-step return as follows:

θ t + 1 = θ t + α ∇ ln ⁡ π ( A t ∣ S t , θ ) [ G t : t + 1 − v ^ ( S t , w ) ]   = θ t + α ∇ ln ⁡ π ( A t ∣ S t , θ ) [ R t + 1 + γ v ^ ( S t + 1 , w ) − v ^ ( S t , w ) ]   = θ t + α δ t ∇ ln ⁡ π ( A t ∣ S t , θ ) (12) \begin{aligned} \theta_{t+1} &= \theta_{t} + \alpha \nabla \ln \pi(A_t|S_t, \theta) [G_{t:t+1} - \hat{v}(S_t, w)] \\\ &= \theta_t + \alpha \nabla \ln \pi(A_t|S_t, \theta) [R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] \\\ &= \theta_t + \alpha \delta_t \nabla \ln \pi(A_t|S_t, \theta) \tag{12} \end{aligned} θt+1  =θt+αlnπ(AtSt,θ)[Gt:t+1v^(St,w)]=θt+αlnπ(AtSt,θ)[Rt+1+γv^(St+1,w)v^(St,w)]=θt+αδtlnπ(AtSt,θ)(12)

Note that it is now a fully online, incremental algorithm, with states, actions, and rewards processed as they occur and then never revisited.

The generalization to the forward view of n-step methods and then to a λ \lambda λ-return algorihm are straighforward. The one-step reutrn in equation (12) i merely replaces by G t : t + n G_{t:t+n} Gt:t+n or G t λ G_t^{\lambda} Gtλ respectively. The backward view of the λ \lambda λ-return algorithm is also straightforward, using separate eligibility traces for the actor and critic, each after the patterns in Chapter 12.

Policy Gradient for Continuing Problems

For continuing problems without episode boundaries we need define performance in term of average rate of reward per timestep:

J ( θ ) = r ( π ) = lim ⁡ h → ∞ 1 h ∑ t = 1 h E [ R t ∣ S 0 , A 0 : t − 1 ∼ π ] J(\theta) = r(\pi) = \lim_{h \rightarrow \infin} \frac{1}{h} \sum_{t=1}^h \mathbb{E}[R_t|S_0, A_{0:t-1} \sim \pi] J(θ)=r(π)=hlimh1t=1hE[RtS0,A0:t1π]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Gam

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Dongz__

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值