蒙特卡洛方法习题_蒙特卡洛方法

蒙特卡洛方法习题

深度强化学习说明— 13 (DEEP REINFORCEMENT LEARNING EXPLAINE — 13)

In this new post of the Deep Reinforcement Learning Explained series, we will introduce the Monte Carlo Methods, another of the classical methods of Reinforcement Learning along with Dynamic Programming introduced in the first part of this series and Temporal Difference Learning that we will introduce in the following post. In this post we will also introduce how to estimate the optimal policy and the Exploration-Exploitation Dilemma.

深度强化学习介绍系列的新文章中,我们将介绍蒙特卡洛方法,这是本系列第一部分中介绍的另一种经典强化学习方法以及动态编程,还将介绍时间差异学习在以下帖子中。 在这篇文章中,我们还将介绍如何估算最佳政策和勘探开发困境。

蒙特卡洛与动态规划 (Monte Carlo versus Dynamic Programming)

In Part 1 of this series, we presented a solution to MDP called dynamic programming, pioneered by Richard Bellman. Remember that the Bellman equation allows us to define the value function recursively and can be solved with the Value Iteration algorithm. To summarize, Dynamic Programming provides a foundation for reinforcement learning, but we need to loop through all the states on every iteration (they can grow exponentially in size, and the state space can be very large or infinite). Dynamic programming also requires a model of the Environment, specifically knowing the state-transition probability p(s′,r|s,a).

在本系列的第1部分中,我们介绍了由Richard Bellman率先提出的MDP解决方案,称为动态编程。 请记住,贝尔曼方程式允许我们递归定义值函数,并且可以使用值迭代算法求解。 总而言之,动态编程为强化学习提供了基础,但是我们需要在每次迭代中遍历所有状态(它们的大小可以成倍增长,并且状态空间可以非常大或无限)。 动态编程还需要一个环境模型,特别是要知道状态转换概率p(s',r | s,a)

In contrast, Monte Carlo methods are all about learning from experience. Any expected value can be approximated by sample means — in other words, all we need to do is play a bunch of episodes, gather the returns, and average them. Monte Carlo methods are actually a set of alternatives to the basic algorithm. These are only for episodic tasks, where the interaction stops when the Agent encounters a terminal state. That is, we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected.

相反,蒙特卡洛方法都是关于从经验中学习的方法。 任何期望值都可以通过样本均值来近似估算-换句话说,我们要做的就是播放一集情节,收集收益并将其平均。 蒙特卡洛方法实际上是基本算法的一组替代方法。 这些仅用于情景任务,在此任务中,当Agent遇到终端状态时,交互将停止。 也就是说,我们假设经验被分为几集,并且无论选择什么动作,所有集最终都会终止。

It’s important to note that Monte Carlo methods only give us a value for states and actions we’ve encountered, and if we never encounter a state its value is unknown.

需要特别注意的是,蒙特卡洛方法仅为我们提供了所遇到的状态和动作的值,如果我们从未遇到过状态,则其值是未知的。

蒙特卡洛方法 (Monte Carlo Methods)

This post will provide a practical approach to Monte Carlo used in Reinforcement Learning. For a more formal explanation of the methods, I invite the reader to read the Chapter 5 of the textbook Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

这篇文章将提供一种在强化学习中使用蒙特卡洛的实用方法。 有关方法的更正式说明,我请读者阅读Richard S. Sutton和Andrew G. Barto编写的《 强化学习:入门 》教科书的第5章。

Recall that the optimal policy π∗​ specifies, for each Environment state s, how the Agent should select an action a towards its goal of maximising reward G. We also learned that the Agent could structure its search for an optimal policy by first estimating the optimal action-value function q∗​; then, once q∗​ is known, π∗​ is quickly obtained.

回想一下,对于每个环境状态s最优策略 π*都指定了代理商应如何选择行动a以实现其最大化奖励G的目标。 我们还了解到,Agent可以通过首先估计最佳行动值函数 q ∗来构造其对最佳策略的搜索; 然后,一旦知道q ∗,便很快获得π ∗。

The Agent starts taking a basic policy, like the equiprobable random policy, a stochastic policy where from each state the Agent randomly selects from the set of available actions, and each action is selected with equal probability. The Agent uses it to collect some episodes, and then consolidate the results to arrive at a better policy.

代理开始采用基本策略,例如等概率随机策略,随机策略,其中代理从每种状态中从可用操作集中随机选择,并且以相等的概率选择每个操作。 代理使用它来收集一些情节,然后合并结果以制定更好的策略。

The way to do it is by estimating the action-value function with a table we will call Q-table. This core table in Monte Carlo Methods has a row for each state and a column for each action. The entry corresponding to state s and action a is denoted Q(s,a).

做到这一点的方法是通过用一个表估计动作值函数,我们将其称为Q-table 。 蒙特卡洛方法中的此核心表的每个状态都有一行,每个动作都有一个列。 对应于状态s和动作a的条目表示为Q ( sa )。

We refer to this as the prediction problem: Given a policy, how might the Agent estimate the value function for that policy?. We refer to Monte Carlo (MC) approaches to the prediction problem as MC prediction methods.

我们将其称为预测问题: 给定一个策略,代理如何估算该策略的价值函数? 。 我们将预测问题的蒙特卡罗(MC)方法称为MC预测方法

We will focus our explanation to the action-value function, but “prediction problem” also refers to approaches that can be used to estimate the state-value function.

我们将把解释集中在作用值函数上,但是“预测问题”也指可用于估计状态值函数的方法。

In the algorithm for MC prediction, we begin by collecting many episodes with the policy. Then, we note that each entry in the Q-table corresponds to a particular state and action. To populate an entry of the Q-table, we use the return that followed when the Agent was in that state and chose the action.

在MC预测算法中,我们首先从该策略收集许多事件开始。 然后,我们注意到Q表中的每个条目都对应于特定的状态和动作。 要填充Q表的条目,我们使用Agent处于该状态并选择操作时的返回值。

We define every occurrence of a state in an episode as a visit to that state-action pair. It can happen that a state-action pair is visited more than once in an episode. This leads us to have two versions of MC prediction algorithm:

我们将情节中每次出现的状态定义为对该状态动作对的访问 。 在一个情节中,一次状态-动作对可能会被多次访问。 这使我们拥有两种MC预测算法版本:

  • Every-visit MC Prediction: Average the returns following all visits to each state-action pair, in all episodes.

    每次访问MC预测 :在所有情节中,对每个状态动作对的所有访问之后的平均回报。

  • First-visit MC Prediction: For each episode, we only consider the first visit to the state-action pair.

    首次访问MC预测 :对于每个情节,我们仅考虑首次访问国家行动对。

Both the first visit method and each visit are considered to guarantee convergence to the true action-value function.

首次访问方法和每次访问都被认为可以确保收敛到真实的动作值函数。

In this post, we will implement the first-visit for our working example fr

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值