Chapter 5: Monte Carlo Methods

1 Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. Our first learning methods for estimating value functions and discovering optimal policies.

Assumption:

  • Episodic task

Advantages:

  1. Learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics.
  2. They can be used with simulation or sample models. For surprisingly many applications it is easy to simulate sample episodes even though it is difficult to construct the kind of explicit model of transition probabilities p ( s ′ , r ∣ s , a ) p(s',r|s,a) p(s,rs,a) required by DP methods.
  3. Easy and efficient to focus Monte Carlo methods on a small subset of the states.
  4. They may be less harmed by violations of the Markov property. This is because they do not update their value estimates on the basis of the value estimates of successor states. In other words, it is because they do not bootstrap.
  5. Computational expense of estimating the value of a single state is independent of the number of states. This can make Monte Carlo methods particularly attractive when one requires the value of only one or a subset of states.

Disadvantages

  • Incremental only in the episode-episode case, namely learning will happen only at the end of an episode, which can extremely slow learning.
  • The most important way to address it is probably by incorporating temporal difference learning.

Different from DP mainly in two ways:

  1. They operate on sample experience, and thus can be used for direct learning without a model.
  2. They do not bootstrap. That is, they do not update their value estimates on the basis of other value estimates.

Convergence

  • This is easy to see for the case of first-visit MC. In this case each return is an independent, identically distributed estimate of v π ( s ) v_{\pi}(s) vπ(s) with finite variance. By the law of large numbers the sequence of averages of these estimates converges to their expected value. Each average is itself an unbiased estimate, and the standard deviation of its error falls as 1 / n 1/\sqrt{n} 1/n , where n is the number of returns averaged. Every-visit MC is less straightforward, but its estimates also converge quadratically to v π ( s ) v_{\pi}(s) vπ(s)

Characteristics:

  • Learn from sampled trajectories
  • Because all the action selections are undergoing learning, the problem becomes nonstationary from the point of view of the earlier state. Handled in GPI (general policy iteration).

On-policy version

2 Policy evaluation (Monte Carlo Prediction; on-policy)

2.1 State-value prediction

Monte Carlo estimate the state-value by simply averaging the returns observed after visits to that state.

Consider all the trajectories T R TR TR where the desired state is visited:
v b ( s ) = E b [ G t ∣ S t = s ] v_b(s)=E_b[G_t|S_t=s] vb(s)=Eb[GtSt=s] v b ( s ) = ∑ t r ∈ T R P ( t r ∣ b ) R e t u r n ( t r ) v_b(s)=\sum_{tr \in TR}P(tr|b)Return(tr) vb(s)=trTRP(trb)Return(tr)

First-visit MC prediction (page 92, 114)

在这里插入图片描述

2.2 Action-value prediction

Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate q ∗ q_* q.

Consider all the trajectories T R TR TR where the desired state-action pair is visited:
q b ( s , a ) = E b [ G t ∣ S t = s , A t = a ] q_b(s,a)=E_b[G_t|S_t=s,A_t=a] qb(s,a)=Eb[GtSt=s,At=a] q b ( s , a ) = ∑ t r ∈ T R P ( t r ∣ b ) R e t u r n ( t r ) q_b(s,a)=\sum_{tr \in TR}P(tr|b)Return(tr) qb(s,a)=trTRP(trb)Return(tr)

The only complication is that many state–action pairs may never be visited. (Exploration problem)
There are two methods for this problem.

  • Exploring starts:
    • Definition: Specify that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start.
    • Disadvantage: However, when learning directly from actual interaction with an environment, the starting conditions are unlikely to be so helpful.
  • Stochastic (soft) policy:
    • Definition: consider only policies that are stochastic with a nonzero probability of selecting all actions in each state. π ( a ∣ s ) > 0 , for all  s ∈ S  and all  a ∈ A ( s ) \pi(a|s)>0, \text{for all }s\in \mathcal{S} \text{ and all }a \in \mathcal{A}(s) π(as)>0,for all sS and all aA(s)
    • ε \varepsilon ε-greedy policy: All nongreedy actions are given the minimal probability ε ∣ A ( s ) ∣ \frac{\varepsilon}{|\mathcal{A}(s)|} A(s)ε, and the remaining bulk of the probability, 1 − ε + ε ∣ A ( s ) ∣ 1-\varepsilon+\frac{\varepsilon}{|\mathcal{A}(s)|} 1ε+A(s)ε are given to the greedy action.
    • Disadvantage: It is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores.
  • Off policy methods:
    • The above two methods are on-policy methods, evaluate one policy using by episodes generated by that policy.

3 Policy improvement (on-policy)

Policy improvement is done by making the policy greedy with respect to the current value function.
π ( s ) = a r g m a x a q ( s , a ) (5.1) \pi(s)=\mathop{argmax}\limits_a q(s,a) \tag{5.1} π(s)=aargmaxq(s,a)(5.1) According to the policy improvement therom (Section 4.2). It guarantees the improvement of values. It is actually the same for both on-policy and off-policy methods.

4 Generalized policy iteration (GPI; on-policy)

Like that in DP. In GPI one maintains both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function, as suggested by the diagram to the right. These two kinds of changes work against each other to some extent, as each creates a moving target for the other, but together they cause both policy and value function to approach optimality.

For Monte Carlo policy iteration it is natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode.
在这里插入图片描述

4.1 Monte Carlo control with Exploring Starts

Monte Carlo ES (Exploring Starts) (page 99, 121)
在这里插入图片描述
It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it did, then the value function would eventually converge to the value function for that policy, and that in turn would cause the policy to change. Stability is achieved only when both the policy and the value function are optimal. Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved. In our opinion, this is one of the most fundamental open theoretical questions in reinforcement learning.

4.2 Monte Carlo control with ε − g r e e d y \varepsilon-greedy εgreedy policy

On-policy first-visit MC control for ε − s o f t \varepsilon-soft εsoft policies:
在这里插入图片描述
Now we only achieve the best policy among the ε \varepsilon ε-soft policies, but on the other hand, we have eliminated the assumption of exploring starts.

Off-policy version

5 Policy evaluation (Monte Carlo Prediction; off-policy)

5.1 Introduction

  • What is off-policy learning?
    • A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy π \pi π, and the policy used to generate behavior is called the behavior policy b b b. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
    • Assumption of coverage: we require that every action taken under π \pi π is also taken, at least occasionally, under b b b. That is, we require that π ( a ∣ s ) > 0 \pi(a|s) > 0 π(as)>0 implies b ( a ∣ s ) > 0 b(a|s) > 0 b(as)>0.
  • Comparison with on-policy learning
    • On-policy methods are generally simpler and are considered first.
    • Off-policy methods require additional concepts and notation, and because the data is due to a different policy, off-policy methods are often of greater variance and are slower to converge.
    • Off-policy methods are more powerful and general. They include on-policy methods as the special case in which the target and behavior policies are the same. they can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert.

5.2 Importance sampling

5.2.1 Importance sampling ratio

  • Definition:
    Relative probability of their trajectories occurring under the target and behavior policies.
    ρ t : T − 1 = ∏ k = t T − 1 π ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) ∏ k = t T − 1 b ( A k ∣ S k ) p ( S k + 1 ∣ S k , A k ) = ∏ k = t T − 1 π ( A t ∣ S t ) b ( A t ∣ S t ) \begin{aligned} \rho_{t:T-1}&=\frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)}\\ &=\prod_{k=t}^{T-1} \frac{\pi(A_t|S_t)}{b(A_t|S_t)} \end{aligned} ρt:T1=k=tT1b(AkSk)p(Sk+1Sk,Ak)k=tT1π(AkSk)p(Sk+1Sk,Ak)=k=tT1b(AtSt)π(AtSt)
  • Application:
    v b ( s ) = E [ G t ∣ S t = s ] v_b(s)=\mathbb{E}[G_t|S_t=s] vb(s)=E[GtSt=s] v π = E [ ρ t : T − 1 G t ∣ S t = s ] v_{\pi}=\mathbb{E}[\rho_{t:T-1}G_t|S_t=s] vπ=E[ρt:T1GtSt=s] Hint v b ( s ) = ∑ t r ∈ T R P ( t r ∣ b ) R e t u r n ( t r ) v_b(s)=\sum_{tr \in TR}P(tr|b)Return(tr) vb(s)=trTRP(trb)Return(tr) v π ( s ) = ∑ t r ∈ T R P ( t r ∣ π ) R e t u r n ( t r ) v_{\pi}(s)=\sum_{tr \in TR}P(tr|\pi)Return(tr) vπ(s)=trTRP(trπ)Return(tr)

5.2.2 Ordinary importance sampling

Denote Γ ( s ) \mathcal{\Gamma(s)} Γ(s) set of all time steps in which state s is visited (first visits in first-visit method). T ( t ) T(t) T(t) the first time of termination following time t, G t G_t Gt the return after t up through T ( t ) T(t) T(t).
V ( s ) = ∑ t ∈ Γ ( s ) ρ t : T ( t ) − 1 G t ∣ Γ ( s ) ∣ (5.5) V(s)=\frac{\sum_{t\in \Gamma(s)}\rho_{t:T(t)-1}G_t}{|\Gamma(s)|} \tag{5.5} V(s)=Γ(s)tΓ(s)ρt:T(t)1Gt(5.5)

Incremental W i = ρ t i : T ( t i ) − 1 W_i=\rho_{t_i:T(t_i)-1} Wi=ρti:T(ti)1:
V n + 1 = V n + 1 n ( W n G t n − V n ) V_{n+1}=V_n+\frac{1}{n}(W_nG_{t_n}-V_n) Vn+1=Vn+n1(WnGtnVn) Where n is the current size of Γ ( s ) \mathcal{\Gamma(s)} Γ(s)

5.2.3 Weighted importance sampling

  • Comparison with ordinary importance sampling:(first visit)
    Ordinary importance sampling is unbiased, but can be extreme. Weight importance sampling is biased, but is less extreme.
  • both biased for every-visit
  • In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred.

W i = ρ t i : T ( t i ) − 1 W_i=\rho_{t_i:T(t_i)-1} Wi=ρti:T(ti)1
V n = ∑ k = 1 n − 1 W k G t k ∑ k = 1 n − 1 W k , n ≥ 2 (5.7) V_n=\frac{\sum_{k=1}^{n-1}W_kG_{t_k}}{\sum_{k=1}^{n-1}W_k}, n\geq2 \tag{5.7} Vn=k=1n1Wkk=1n1WkGtk,n2(5.7)
Incremental form:
V n + 1 = V n + W n C n [ G t n − V n ] , n ≥ 1 (5.8) V_{n+1}=V_n+\frac{W_n}{C_n}[G_{t_n}-V_n], n\geq1 \tag{5.8} Vn+1=Vn+CnWn[GtnVn],n1(5.8) C n + 1 = C n + W n + 1 , C 0 = 0 , V 1 = r a n d o m C_{n+1}=C_n+W_{n+1}, C_0=0, V_1=random Cn+1=Cn+Wn+1,C0=0,V1=random
Off-policy MC prediction (policy evaluation) (page 110, 132)
在这里插入图片描述

6 Policy improvement (off-policy)

π ( s ) = a r g m a x a q ( s , a ) (5.1) \pi(s)=\mathop{argmax}\limits_a q(s,a) \tag{5.1} π(s)=aargmaxq(s,a)(5.1) According to the policy improvement therom (Section 4.2). It guarantees the improvement of values. It is actually the same for both on-policy and off-policy methods.

7 Generalized policy iteration (GPI; off-policy control via importance sampling)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值