Monte Carlo Methods

最新推荐文章于 2020-11-11 20:55:18 发布

Vic_Hao

最新推荐文章于 2020-11-11 20:55:18 发布

阅读量411

点赞数

分类专栏：强化学习

本文链接：https://blog.csdn.net/weixin_42018112/article/details/80862580

版权

强化学习专栏收录该内容

18 篇文章 3 订阅

订阅专栏

Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. And Monte Carlo methods require no prior knowledge of the environment’s dynamics.

To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
(episode task是指不管采取哪种策略，都会在有限时间内达到终止状态并获得回报的任务。)

The term “Monte Carlo” is often used more broadly for any estimation method whose operation invloves a significant random component. Here we use it specifically for methods based on average complete returns (as opposed to methods that learn form partial returns, considered in the next chapter).

As in the DP chapter, first we consider the prediction problem (the computation of $v_{\pi}$ and $q_{\pi}$ for a fixed arbitrary policy $\pi$ ) then policy improvement, and, finally, the control problem and its solution by GPI. Each of these ideas taken from DP is extended to the Monte Carlo case in which only sample experience is available.

1. Monte Carlo Prediction

Averaging the returns observed after visits to that states is the idea underlies all Monte Carlo methods.
First-visit MC method and every-visit MC method are very similar but have slightly different theoretical properties.
这里写图片描述
Backup diagram can be also used to persent Monte Carlo methods.
The diffenrence between DP and MC:
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap as we defined it in the previous chapter.

2. Monte Carlo Estimation of Action Values

The Monte Carlo methods for action value estimation are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state.

The only complication is that many state-action pairs may never be visited.
One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts. It is somtimes useful.

3. Monte Carlo Control

The overall idea is to proceed according to the same pattern as in the same pattern as in the DP chapter, that is, according to the idea of generalized policy iteration (GPI).

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
这里写图片描述

4. Monte Carlo Control without Exploring Starts

The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods.
On-policy methods attempted to evaluate or improve the policy that is used to make decisions, where as off-policy methods evaluate or improve a policy different from that used to generate the data.

On-policy method:
In on-policy control methods, the policy is generally soft, meaning that $\pi(a\mid s)>0$ for all $s\in S$ and all $a\in A(s)$ .
The $\varepsilon$ -greedy policies are examples of $\varepsilon$ -soft policies, defined as policies for which $\pi (a\mid s)\geq \frac{\varepsilon}{\left | A(s) \right |}$ for all states and actions, for some $\varepsilon>0$ .
这里写图片描述
That any $\varepsilon$ -greedy policy with respect to $q_{\pi}$ is an improvement over any $\varepsilon$ -soft policy $\pi$ is assured by the policy improvement theorem.

5. Off-policy Prediction via Importance Sampling

The policy being learned about is called the target policy. and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.