Monte Carlo Methods

Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. And Monte Carlo methods require no prior knowledge of the environment’s dynamics.

To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
(episode task是指不管采取哪种策略,都会在有限时间内达到终止状态并获得回报的任务。)

The term “Monte Carlo” is often used more broadly for any estimation method whose operation invloves a significant random component. Here we use it specifically for methods based on average complete returns (as opposed to methods that learn form partial returns, considered in the next chapter).

As in the DP chapter, first we consider the prediction problem (the computation of vπ v π and qπ q π for a fixed arbitrary policy π π ) then policy improvement, and, finally, the control problem and its solution by GPI. Each of these ideas taken from DP is extended to the Monte Carlo case in which only sample experience is available.

1. Monte Carlo Prediction

Averaging the returns observed after visits to that states is the idea underlies all Monte Carlo methods.
First-visit MC method and every-visit MC method are very similar but have slightly different theoretical properties.
这里写图片描述
Backup diagram can be also used to persent Monte Carlo methods.
The diffenrence between DP and MC:
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap as we defined it in the previous chapter.

2. Monte Carlo Estimation of Action Values

The Monte Carlo methods for action value estimation are essentially the same as just presented for state values, except now we talk about visits to a state-action pair rather than to a state.

The only complication is that many state-action pairs may never be visited.
One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this the assumption of exploring starts. It is somtimes useful.

3. Monte Carlo Control

The overall idea is to proceed according to the same pattern as in the same pattern as in the DP chapter, that is, according to the idea of generalized policy iteration (GPI).

这里写图片描述

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
这里写图片描述

4. Monte Carlo Control without Exploring Starts

The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods.
On-policy methods attempted to evaluate or improve the policy that is used to make decisions, where as off-policy methods evaluate or improve a policy different from that used to generate the data.

On-policy method:
In on-policy control methods, the policy is generally soft, meaning that π(as)>0 π ( a ∣ s ) > 0 for all sS s ∈ S and all aA(s) a ∈ A ( s ) .
The ε ε -greedy policies are examples of ε ε -soft policies, defined as policies for which π(as)ε|A(s)| π ( a ∣ s ) ≥ ε | A ( s ) | for all states and actions, for some ε>0 ε > 0 .
这里写图片描述
That any ε ε -greedy policy with respect to qπ q π is an improvement over any ε ε -soft policy π π is assured by the policy improvement theorem.

5. Off-policy Prediction via Importance Sampling

The policy being learned about is called the target policy. and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值