Chapter 5: Monte Carlo Methods

最新推荐文章于 2020-08-25 10:36:33 发布

xiwang_chn

最新推荐文章于 2020-08-25 10:36:33 发布

阅读量522

点赞数

分类专栏： Reinforced learning

本文链接：https://blog.csdn.net/weixin_42017454/article/details/106954034

版权

Reinforced learning 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Chapter 5: Monte Carlo Methods

1 Introduction
On-policy version
2 Policy evaluation (Monte Carlo Prediction; on-policy)
- 2.1 State-value prediction
- 2.2 Action-value prediction
3 Policy improvement (on-policy)
4 Generalized policy iteration (GPI; on-policy)
- 4.1 Monte Carlo control with Exploring Starts
- 4.2 Monte Carlo control with $\varepsilon-greedy$ policy
Off-policy version
5 Policy evaluation (Monte Carlo Prediction; off-policy)
6 Policy improvement (off-policy)
7 Generalized policy iteration (GPI; off-policy control via importance sampling)

1 Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. Our first learning methods for estimating value functions and discovering optimal policies.

Assumption:

Episodic task

Advantages:

Learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics.
They can be used with simulation or sample models. For surprisingly many applications it is easy to simulate sample episodes even though it is difficult to construct the kind of explicit model of transition probabilities $p (s^{'}, r ∣ s, a)$ required by DP methods.
Easy and efficient to focus Monte Carlo methods on a small subset of the states.
They may be less harmed by violations of the Markov property. This is because they do not update their value estimates on the basis of the value estimates of successor states. In other words, it is because they do not bootstrap.
Computational expense of estimating the value of a single state is independent of the number of states. This can make Monte Carlo methods particularly attractive when one requires the value of only one or a subset of states.

Disadvantages

Incremental only in the episode-episode case, namely learning will happen only at the end of an episode, which can extremely slow learning.
The most important way to address it is probably by incorporating temporal difference learning.

Different from DP mainly in two ways:

They operate on sample experience, and thus can be used for direct learning without a model.
They do not bootstrap. That is, they do not update their value estimates on the basis of other value estimates.

Convergence

This is easy to see for the case of first-visit MC. In this case each return is an independent, identically distributed estimate of $v_{\pi}(s)$ with finite variance. By the law of large numbers the sequence of averages of these estimates converges to their expected value. Each average is itself an unbiased estimate, and the standard deviation of its error falls as $1/\sqrt{n}$ , where n is the number of returns averaged. Every-visit MC is less straightforward, but its estimates also converge quadratically to $v_{\pi}(s)$

Characteristics:

Learn from sampled trajectories
Because all the action selections are undergoing learning, the problem becomes nonstationary from the point of view of the earlier state. Handled in GPI (general policy iteration).

On-policy version

2 Policy evaluation (Monte Carlo Prediction; on-policy)

2.1 State-value prediction

Monte Carlo estimate the state-value by simply averaging the returns observed after visits to that state.

Consider all the trajectories $T R$ where the desired state is visited:
$v_b(s)=E_b[G_t|S_t=s]$ $v_b(s)=\sum_{tr \in TR}P(tr|b)Return(tr)$

First-visit MC prediction (page 92, 114)

在这里插入图片描述

2.2 Action-value prediction

Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate $q_*$ .

Consider all the trajectories $T R$ where the desired state-action pair is visited:
$q_b(s,a)=E_b[G_t|S_t=s,A_t=a]$ $q_b(s,a)=\sum_{tr \in TR}P(tr|b)Return(tr)$

The only complication is that many state–action pairs may never be visited. (Exploration problem)
There are two methods for this problem.

Exploring starts:
- Definition: Specify that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start.
- Disadvantage: However, when learning directly from actual interaction with an environment, the starting conditions are unlikely to be so helpful.
Stochastic (soft) policy:
- Definition: consider only policies that are stochastic with a nonzero probability of selecting all actions in each state. $\pi(a|s)>0, \text{for all }s\in \mathcal{S} \text{ and all }a \in \mathcal{A}(s)$
- $\varepsilon$ -greedy policy: All nongreedy actions are given the minimal probability $\frac{\varepsilon}{|\mathcal{A}(s)|}$ , and the remaining bulk of the probability, $1-\varepsilon+\frac{\varepsilon}{|\mathcal{A}(s)|}$ are given to the greedy action.
- Disadvantage: It is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores.
Off policy methods:
- The above two methods are on-policy methods, evaluate one policy using by episodes generated by that policy.

3 Policy improvement (on-policy)

Policy improvement is done by making the policy greedy with respect to the current value function.
$\pi(s)=\mathop{argmax}\limits_a q(s,a) \tag{5.1}$ According to the policy improvement therom (Section 4.2). It guarantees the improvement of values. It is actually the same for both on-policy and off-policy methods.

4 Generalized policy iteration (GPI; on-policy)

Like that in DP. In GPI one maintains both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function, as suggested by the diagram to the right. These two kinds of changes work against each other to some extent, as each creates a moving target for the other, but together they cause both policy and value function to approach optimality.

For Monte Carlo policy iteration it is natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode.
在这里插入图片描述

4.1 Monte Carlo control with Exploring Starts

Monte Carlo ES (Exploring Starts) (page 99, 121)
在这里插入图片描述
It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it did, then the value function would eventually converge to the value function for that policy, and that in turn would cause the policy to change. Stability is achieved only when both the policy and the value function are optimal. Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved. In our opinion, this is one of the most fundamental open theoretical questions in reinforcement learning.

4.2 Monte Carlo control with $\varepsilon-greedy$ policy

On-policy first-visit MC control for $\varepsilon-soft$ policies:
在这里插入图片描述
Now we only achieve the best policy among the $\varepsilon$ -soft policies, but on the other hand, we have eliminated the assumption of exploring starts.

Off-policy version

5 Policy evaluation (Monte Carlo Prediction; off-policy)

5.1 Introduction

What is off-policy learning?
- A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy $\pi$ , and the policy used to generate behavior is called the behavior policy $b$ . In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
- Assumption of coverage: we require that every action taken under $\pi$ is also taken, at least occasionally, under $b$ . That is, we require that $\pi(a|s) > 0$ implies $b (a ∣ s) > 0$ .
Comparison with on-policy learning
- On-policy methods are generally simpler and are considered first.
- Off-policy methods require additional concepts and notation, and because the data is due to a different policy, off-policy methods are often of greater variance and are slower to converge.
- Off-policy methods are more powerful and general. They include on-policy methods as the special case in which the target and behavior policies are the same. they can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert.

5.2 Importance sampling

5.2.1 Importance sampling ratio

Definition:
Relative probability of their trajectories occurring under the target and behavior policies.
$\begin{aligned} \rho_{t:T-1}&=\frac{\prod_{k=t}^{T-1} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1} b(A_k|S_k)p(S_{k+1}|S_k,A_k)}\\ &=\prod_{k=t}^{T-1} \frac{\pi(A_t|S_t)}{b(A_t|S_t)} \end{aligned}$
Application:
$v_b(s)=\mathbb{E}[G_t|S_t=s]$ $v_{\pi}=\mathbb{E}[\rho_{t:T-1}G_t|S_t=s]$ Hint $v_b(s)=\sum_{tr \in TR}P(tr|b)Return(tr)$ $v_{\pi}(s)=\sum_{tr \in TR}P(tr|\pi)Return(tr)$

5.2.2 Ordinary importance sampling

Denote $\mathcal{\Gamma(s)}$ set of all time steps in which state s is visited (first visits in first-visit method). $T (t)$ the first time of termination following time t, $G_t$ the return after t up through $T (t)$ .
$V(s)=\frac{\sum_{t\in \Gamma(s)}\rho_{t:T(t)-1}G_t}{|\Gamma(s)|} \tag{5.5}$

Incremental $W_i=\rho_{t_i:T(t_i)-1}$ :
$V_{n+1}=V_n+\frac{1}{n}(W_nG_{t_n}-V_n)$ Where n is the current size of $\mathcal{\Gamma(s)}$

5.2.3 Weighted importance sampling

Comparison with ordinary importance sampling:(first visit)
Ordinary importance sampling is unbiased, but can be extreme. Weight importance sampling is biased, but is less extreme.
both biased for every-visit
In practice, the weighted estimator usually has dramatically lower variance and is strongly preferred.

$W_i=\rho_{t_i:T(t_i)-1}$
$V_n=\frac{\sum_{k=1}^{n-1}W_kG_{t_k}}{\sum_{k=1}^{n-1}W_k}, n\geq2 \tag{5.7}$
Incremental form:
$V_{n+1}=V_n+\frac{W_n}{C_n}[G_{t_n}-V_n], n\geq1 \tag{5.8}$ $C_{n+1}=C_n+W_{n+1}, C_0=0, V_1=random$
Off-policy MC prediction (policy evaluation) (page 110, 132)
在这里插入图片描述

6 Policy improvement (off-policy)

$\pi(s)=\mathop{argmax}\limits_a q(s,a) \tag{5.1}$ According to the policy improvement therom (Section 4.2). It guarantees the improvement of values. It is actually the same for both on-policy and off-policy methods.

7 Generalized policy iteration (GPI; off-policy control via importance sampling)

xiwang_chn

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter 5: Monte Carlo Methods

Chapter 5: Monte Carlo Methods1 Introduction2 Policy evaluation (Monte Carlo Prediction; on-policy)3 Policy improvement (on-policy)4 Generalized policy iteration (GPI; on-policy)4.1 Monte Carlo control with Exploring Starts4.2 Monte Carlo control without
复制链接

扫一扫