【SuttonBartoIPRLBook2ndEd】 【Tabular Solution Methods】

Part I
Tabular Solution Methods

In this part of the book we describe almost all the core ideas of reinforcement
learning algorithms in their simplest forms, that in which the state and
action spaces are small enough for the approximate action-value function to be
represented as an array, or table. In this case, the methods can often nd exact
solutions, that is, they can often nd exactly the optimal value function and
the optimal policy. This contrasts with the approximate methods described in
the next part of the book, which only nd approximate solutions, but which
in return can be applied eectively to much larger problems.

在本书的这一部分, 我们在强化学习最简单的形式下——状态与动作空间足够小使得
近似的值函数可以以数组或表<table> 来表示——阐述几乎所有强化学习算法的核心概念.
在这种情况下, 这些方法常常可以获得确切的解, 即可以获得确切的最优值函数与最优策略.
这和本书的下一部分叙述的近似方法恰恰相反, 后者只能获得近似解, 但作为回报可以高效
地应用于规模大得多的问题上.

The rst chapter of this part of the book describes solution methods for
the special of the reinforcement learning problem in which there is only a
single state, called bandit problems. The second chapter describes the general
problem formulation that we treat throughout the rest of the book|nite
markov decision processes|and its main ideas including Bellman equations
and value functions.

本书这一部分的第一章阐述了只有单个状态的强化学习问题特例, 即所谓的赌博机问
题. 第二章阐述了在本书的余下部分使用的、一般强化学习问题的形式化——有限马尔科夫
决策过程<finite Markov decision process, finite MDP>, 也阐述了形式化中包括了贝尔曼方
程<Bellman equation> 与值函数的主要概念.

The next three chapters describe three fundamental classes of methods
for solving nite Markov decision problems: dynamic programming, Monte
Carlo methods, and temporal-dierence learning. Each class of methods has
its strengths and weaknesses. Dynamic programming methods are well developed
mathematically, but require a complete and accurate model of the
environment. Monte Carlo methods don't require a model and are conceptually
simple, but are not suited for step-by-step incremental computation.
Finally, temporal-dierence methods require no model and are fully incremental,
but are more complex to analyze. The methods also dier in several ways
with respect to their eciency and speed of convergence.

接下来的三章阐述了三类解决有限马尔科夫决策过程的基本方法: 动态规划<dynamic
programming, DP>, 蒙特卡洛<Monte Carlo, MC> 方法, 以及时序差分<temporal difference,
TD> 方法. 每一类方法都有其优点与缺点. 动态规划方法在数学上被研究得很好, 但
需要完整、正确的环境模型. 蒙特卡洛方法不需要模型且从概念上说较为简单, 但不适合于
逐步的增量计算. 最后, 时序差分方法不需要模型且完全是增量式的, 但分析起来更为复杂.
这些方法在效率与收敛速度这些方面也存在着差异.

The remaining two chapters describe how these three classes of methods
can be combined to obtain the best features of each of them. In one chapter we
describe how the strengths of Monte Carlo methods can be combined with the
strengths of temporal-dierence methods via the use of eligibility traces. In
the nal chapter of this part of the book we show these two learning methods
can be combined with model learning and planning methods (such as dynamic
programming) for a complete and unied solution to the tabular reinforcement
learning problem.

余下的两章阐述了怎么将这三类方法结合起来以利用各自的优点. 在其中一章我们阐
述怎样通过多步自举方法<multi-step bootstrapping method> 来将蒙特卡洛方法与时序差
分方法的长处结合起来. 在本部分的最后一章, 我们将展示怎样将时序差分学习方法, 与模
型学习以及计划方法(例如动态规划) 结合起来, 作为完整、统一的表格式强化学习问题的解
决方案.

Chapter 2
Multi-arm Bandits

The most important feature distinguishing reinforcement learning from other
types of learning is that it uses training information that evaluates the actions
taken rather than instructs by giving correct actions. This is what creates
the need for active exploration, for an explicit trial-and-error search for good
behavior. Purely evaluative feedback indicates how good the action taken is,
but not whether it is the best or the worst action possible. Evaluative feedback
is the basis of methods for function optimization, including evolutionary
methods. Purely instructive feedback, on the other hand, indicates the correct
action to take, independently of the action actually taken. This kind
of feedback is the basis of supervised learning, which includes large parts of
pattern classication, articial neural networks, and system identication. In
their pure forms, these two kinds of feedback are quite distinct: evaluative
feedback depends entirely on the action taken, whereas instructive feedback is
independent of the action taken. There are also interesting intermediate cases
in which evaluation and instruction blend together.

将强化学习同其他类型的学习区分开来的最重要的特征就是: 强化学习使用训练信息来
评估所采取的动作, 而非使用正确的动作来指导动作的选择. 正是这一点提出了对积极探索
的要求. 单纯的评估性反馈只能说明所采取的动作的好坏, 但无法说明其是否为最好或最坏
的动作. 单纯的指示性反馈, 恰恰相反, 指示出应该做的正确动作, 且独立于实际采取的动作.
这一类的反馈是包含了大量模式分类、人工神经网络、系统辨别的监督学习的基础. 在各自最
典型的情况下, 这两类反馈是大不相同的: 评估性反馈完全依赖于所采取的动作, 而指示性
反馈独立于所采取的动作.

1、evaluative 评价

2、independently 独立地

In this chapter we study the evaluative aspect of reinforcement learning in
a simplied setting, one that does not involve learning to act in more than
one situation. This nonassociative setting is the one in which most prior
work involving evaluative feedback has been done, and it avoids much of the
complexity of the full reinforcement learning problem. Studying this case will
enable us to see most clearly how evaluative feedback diers from, and yet can
be combined with, instructive feedback.

在本章中我们在简化的设定下——仅需要在单个状态下学得如何采取动作——来探讨
强化学习评估的方面. 多数涉及到评估性反馈的先前工作正是在这个非关联性<nonassociative>
设定下完成的, 因为这个设定避免了完整强化学习问题的复杂性. 学习这种情形使我
们能清晰地看到评估性反馈怎样地区别于指示性反馈, 又怎样地可以同指示性反馈结合起
来.

1、complexity 复杂性

The particular nonassociative, evaluative feedback problem that we explore
is a simple version of the n-armed bandit problem. We use this problem to
introduce a number of basic learning methods which we extend in later chapters
to apply to the full reinforcement learning problem. At the end of this chapter,
we take a step closer to the full reinforcement learning problem by discussing

what happens when the bandit problem becomes associative, that is, when
actions are taken in more than one situation

我们探讨的非关联性的评估性反馈问题正是k-摇臂赌博机问题<k-armed bandit problem>
的一种简单形式. 我们利用这个问题来介绍一些基本的强化学习方法, 并在之后的章
节中拓展这些方法以应用于完整的强化学习问题上. 在本章的末尾, 我们探讨赌博机问题变
成关联性——即需要在多个状态下采取动作——时会发生什么, 来向完整的强化学习问题更
进一步.

2.1 An n-Armed Bandit Problem

Consider the following learning problem. You are faced repeatedly with a
choice among n dierent options, or actions. After each choice you receive a
numerical reward chosen from a stationary probability distribution that depends
on the action you selected. Your objective is to maximize the expected
total reward over some time period, for example, over 1000 action selections,
or time steps.

考虑如下的学习问题. 你需要重复地对k 个不同的选项或动作做出选择. 在每一次选择
后你会获得一个实数型的奖赏, 该奖赏是从固定的概率分布中采样获得的, 且该概率分布取
决于你所选择的动作. 你的目标在一定的时期内, 如1000 个动作选择或时步<time step>
内, 最大化期望的奖赏和.

This is the original form of the n-armed bandit problem, so named by analogy
to a slot machine, or \one-armed bandit," except that it has n levers
instead of one. Each action selection is like a play of one of the slot machine's
levers, and the rewards are the payos for hitting the jackpot. Through repeated
action selections you are to maximize your winnings by concentrating
your actions on the best levers. Another analogy is that of a doctor choosing
between experimental treatments for a series of seriously ill patients. Each
action selection is a treatment selection, and each reward is the survival or
well-being of the patient. Today the term \n-armed bandit problem" is sometimes
used for a generalization of the problem described above, but in this
book we use it to refer just to this simple case.

这是k-摇臂赌博机问题<k-armed bandit problem> 的典型形式, 之所以这么称呼是将
其类比于老虎机或“单摇臂赌博机”, 只不过其有k 个摇臂, 而非1 个. 每一次动作选择就像
拉下赌博机的摇臂之一, 而奖赏就是中了头奖之后的回报. 在反复的动作选择过程中, 你必
须将动作集中到最好的摇臂上来最大化累积奖赏. 另一个类比为, 医生为一批批的重病患者
选择实验性的疗法. 每一个动作就是选择一种疗法, 而每一个奖赏就是病人存活或健康与否.
现如今术语“赌博机问题” 有时也用于上述问题的泛化, 但本书中我们仅用其指代如上所述
的简单形式.

In our n-armed bandit problem, each action has an expected or mean
reward given that that action is selected; let us call this the value of that
action. If you knew the value of each action, then it would be trivial to solve
the n-armed bandit problem: you would always select the action with highest
value. We assume that you do not know the action values with certainty,
although you may have estimates.

在我们的k-摇臂赌博机问题中, 每个被选择的action有一个期望或者被给予的平均奖励

我们称他为行动价值,如果你知道每个行动的价值,你会尝试去解决赌博机问题,你将总是选择

行动价值高的动作,我们假定我们不知道行动的具体价值,即使我们已经建立了他。

If you maintain estimates of the action values, then at any time step there
is at least one action whose estimated value is greatest. We call this a greedy
action. If you select a greedy action, we say that you are exploiting your
current knowledge of the values of the actions. If instead you select one of
the nongreedy actions, then we say you are exploring, because this enables
you to improve your estimate of the nongreedy action's value. Exploitation is
the right thing to do to maximize the expected reward on the one step, but
exploration may produce the greater total reward in the long run. For example,
suppose the greedy action's value is known with certainty, while several other
actions are estimated to be nearly as good but with substantial uncertainty.
The uncertainty is such that at least one of these other actions probably is

actually better than the greedy action, but you don't know which one. If
you have many time steps ahead on which to make action selections, then it
may be better to explore the nongreedy actions and discover which of them
are better than the greedy action. Reward is lower in the short run, during
exploration, but higher in the long run because after you have discovered the
better actions, you can exploit them many times. Because it is not possible
both to explore and to exploit with any single action selection, one often refers
to the \con ict" between exploration and exploitation

如果你维持有对动作值的估计, 那么在任何时步一定至少有一个动作有着最高的估计
值. 我们将其称为贪心<greedy> 动作. 当你选择贪心动作之一时, 我们称你在利用<exploit>
你对动作值的已有知识. 但如果你选择了非贪心动作之一, 那么我们称你在探索<explore>,
因为这能帮助你改进对非贪心动作的值的估计. 利用用于最大化单步的期望奖赏,
但探索也许可以在长期内产生更高的奖赏和. 例如, 假设一个贪心动作的值是确定地知道的,
而一些其他的动作以极高的不确定性被估计为和贪心动作差不多好. 这个不确定性为, 这些
动作中的至少一个动作, 可能比贪心动作要好, 但你不知道是哪一个. 如果你在选择动作前
有许多时步的话, 那么这么做也许会更好: 探索非贪心动作, 来发现其中哪个比贪心动作更
好. 在探索过程中, 短期而言奖赏变低了, 但长期而言奖赏会变高, 因为在你发现了更好的动
作后, 你可以穿多次地利用它们. 因为在单个动作选择中不能既探索又利用, 所有这常常被
称为探索与利用之间的“矛盾”.

In any specic case, whether it is better to explore or exploit depends in a
complex way on the precise values of the estimates, uncertainties, and the number
of remaining steps. There are many sophisticated methods for balancing
exploration and exploitation for particular mathematical formulations of the
n-armed bandit and related problems. However, most of these methods make
strong assumptions about stationarity and prior knowledge that are either
violated or impossible to verify in applications and in the full reinforcement
learning problem that we consider in subsequent chapters. The guarantees of
optimality or bounded loss for these methods are of little comfort when the
assumptions of their theory do not apply

在任何一个具体的情况下, 是探索还是利用更好取决于对估计值、不确定性、余下步数
的具体值的复杂考量. 有许多针对k-摇臂赌博机及其关联问题的特定数学形式的、用于平衡
探索和利用的精巧方法. 然而, 这些方法中的多数对固定性<stationarity> 及先验知识做出
了较强的假设, 这些假设对实际应用或我们在接下来的章节中考虑的完整的强化学习问题而
言, 要么无法做到, 要么无法证实. 就这些方法而言, 当这些假设不成立时, 其对最优性的保
证或对损失的边界的保证, 都无从谈起.

In this book we do not worry about balancing exploration and exploitation
in a sophisticated way; we worry only about balancing them at all. In this
chapter we present several simple balancing methods for the n-armed bandit
problem and show that they work much better than methods that always
exploit. The need to balance exploration and exploitation is a distinctive
challenge that arises in reinforcement learning; the simplicity of the n-armed
bandit problem enables us to show this in a particularly clear form.

在本书中我们不考虑以精妙的方式平衡探索和利用; 我们仅以浅显的程度对平衡方法进
行探讨. 在本章中我们将呈现数种针对k-摇臂赌博机问题的平衡方法, 以及其对只会利用

的方法的显著优越性. 需要平衡探索和利用是强化学习特有的特有的挑战; 我们这一版本的
k-摇臂赌博机问题的简洁性, 使我们能够以一种特别清晰的形式来呈现这一点.

2.2 Action-Value Methods

We begin by looking more closely at some simple methods for estimating the
values of actions and for using the estimates to make action selection decisions.
In this chapter, we denote the true (actual) value of action a as q(a), and the
estimated value on the tth time step as Qt(a). Recall that the true value of an
action is the mean reward received when that action is selected. One natural
way to estimate this is by averaging the rewards actually received when the
action was selected. In other words, if by the tth time step action a has been
chosen Nt(a) times prior to t, yielding rewards R1;R2; : : : ;RNt(a), then its value
is estimated to be

我们以对两种方法的更进一步的审视开始. 这两种方法分别为估计动作值的方法, 以及
使用估计值来做出动作选择的决策的方法, 两者合称为动作值方法<action-value method>.
我们还记得一个动作的真实值是该动作被选择时的平均奖赏. 一个自然的估计方法就是对
接受到的奖赏进行平均:

If Nt(a) = 0, then we dene Qt(a) instead as some default value, such as
Q1(a) = 0. As Nt(a) ! 1, by the law of large numbers, Qt(a) converges
to q(a). We call this the sample-average method for estimating action values
because each estimate is a simple average of the sample of relevant rewards.
Of course this is just one way to estimate action values, and not necessarily
the best one. Nevertheless, for now let us stay with this simple estimation
method and turn to the question of how the estimates might be used to select
actions.

其中1predicate 表示一随机变量, 当谓词predicate 为真时其值为1, 反之为0. 如果分母为0
的话, 那么我们可以将Qt(a) 设定为一默认值, 比如0. 当分母趋向于无穷大时, 由大数定理
可以得知Qt(a) 收敛于q(a). 我们将此称为估计动作值的样本平均方法<sample-average
method>, 因为估计值为相关奖赏的样本的均值. 当然这只是估计动作值的一种方法, 也不
一定是最好的一种. 但是, 让我们暂且使用这种简单的估计方法, 然后考虑怎样使用估计值
来选择动作的问题.

The simplest action selection rule is to select the action (or one of the
actions) with highest estimated action value, that is, to select at step t one
of the greedy actions, At
, for which Qt(At
) = maxa Qt(a). This greedy action
selection method can be written as

最简单的动作选择规则就是选择有最高估计值的那个动作, 即前一节中定义的贪心动
作. 如果有多于一个的贪心动作, 那么以任意一种方式在选择其中一个, 例如随机选择. 我们
将这样的贪心<greedy> 动作选择写作

where argmaxa denotes the value of a at which the expression that follows
is maximized (with ties broken arbitrarily). Greedy action selection always
exploits current knowledge to maximize immediate reward; it spends no time
at all sampling apparently inferior actions to see if they might really be better.
A simple alternative is to behave greedily most of the time, but every
once in a while, say with small probability ", instead to select randomly from
amongst all the actions with equal probability independently of the actionvalue
estimates. We call methods using this near-greedy action selection rule
"-greedy methods. An advantage of these methods is that, in the limit as the
number of plays increases, every action will be sampled an innite number
of times, guaranteeing that Nt(a) ! 1 for all a, and thus ensuring that all
the Qt(a) converge to q(a). This of course implies that the probability of selecting
the optimal action converges to greater than 1 ? ", that is, to near
certainty. These are just asymptotic guarantees, however, and say little about
the practical eectiveness of the methods.

其中argmaxa 表示令后面的表达式最大的那个动作a(再次声明, 如果有多个最值时任意选
择). 贪心选择总是对已有的知识进行利用来最大化立即的奖赏; 其不会对明显次等的动作进
行采样来观察这些动作是否实际上更好. 一个简单的替代方法就是在多数的时间内进行贪
心选择; 但是每隔一定时间, 如以一个较小的概率", 从所有的动作中以相同的概率进行随机
选择, 无论各个动作的估计值为多少. 我们将使用这种近似贪心的动作选择规则的方法称为
"-贪心方法. 这一方法的优点是, 当步数增加到无穷大时, 每个动作都会被采样无穷多次, 因
此保证了所有的Qt(a) 都收敛到q(a). 这也预示着选择最优动作的概率会收敛到大于1 ? "
的值, 即几乎确定. 然而这只是一个渐进的保证, 且无法说明该方法的实际效率.


To roughly assess the relative eectiveness of the greedy and "-greedy methods,
we compared them numerically on a suite of test problems. This was a
set of 2000 randomly generated n-armed bandit tasks with n = 10. For each
bandit, the action values, q(a), a = 1; : : : ; 10, were selected according to a
normal (Gaussian) distribution with mean 0 and variance 1. On tth time step
with a given bandit, the actual reward Rt was the q(At) for the bandit (where
At was the action selected) plus a normally distributed noise term that was
mean 0 and variance 1. Averaging over bandits, we can plot the performance
and behavior of various methods as they improve with experience over 1000
steps, as in Figure 2.1. We call this suite of test tasks the 10-armed testbed.

 

为了大致地评估贪心与" -贪心动作值方法的相对效率, 我们使用一个测试问题套件来
定量地比较它们. 这是一个由2000 个随机生成的k -摇臂赌博机问题组成的集合, 其中
k = 10 . 对每个赌博机问题而言, 如在图2.1 中所展示的, 各个动作值q(a); a = 1; : : : ; 10 ,
是从均值为0 且方差为1 的正态(高斯) 分布中采样得到的. 并且, 若一个应用于本问题的学
习方法在时步t 时选择了动作At , 那么实际奖赏Rt 是从均值为q(At) 且方差为1 的正
态分布中采样获得的. 这些分布在图2.1 中用灰色表示. 我们将这个测试任务套件称为10-摇
臂测试工具. 对于任何学习方法来说, 当将其应用于赌博机问题之一时, 我们可以测量其于
1000 多个时步中逐步提升的性能与表现.

Figure 2.1: Average performance of "-greedy action-value methods on the
10-armed testbed. These data are averages over 2000 tasks. All methods
used sample averages as their action-value estimates. The detailed structure
at the beginning of these curves depends on how actions are selected when
multiple actions have the same maximal action value. Here such ties were
broken randomly. An alternative that has a similar eect is to add a very
small amount of randomness to each of the initial action values, so that ties
eectively never happen.

图2.2: "-贪心动作值方法在10-摇臂测试工具上的平均表现. 这些数据是对超过2000 次使用不同赌博机
问题的行程中的数值的平均. 所有的方法使用采样平均方法估计动作值.

 

formed their action-value estimates using the sample-average technique. The
upper graph shows the increase in expected reward with experience. The
greedy method improved slightly faster than the other methods at the very
beginning, but then leveled o at a lower level. It achieved a reward per step of
only about 1, compared with the best possible of about 1.55 on this testbed.
The greedy method performs signicantly worse in the long run because it
often gets stuck performing suboptimal actions. The lower graph shows that
the greedy method found the optimal action in only approximately one-third of
the tasks. In the other two-thirds, its initial samples of the optimal action were
disappointing, and it never returned to it. The "-greedy methods eventually
perform better because they continue to explore, and to improve their chances
of recognizing the optimal action. The " = 0:1 method explores more, and
usually nds the optimal action earlier, but never selects it more than 91%
of the time. The " = 0:01 method improves more slowly, but eventually
performs better than the " = 0:1 method on both performance measures. It
is also possible to reduce " over time to try to get the best of both high and
low values.

The advantage of "-greedy over greedy methods depends on the task. For
example, suppose the reward variance had been larger, say 10 instead of 1.
With noisier rewards it takes more exploration to nd the optimal action, and
"-greedy methods should fare even better relative to the greedy method. On
the other hand, if the reward variances were zero, then the greedy method
would know the true value of each action after trying it once. In this case the
greedy method might actually perform best because it would soon nd the
optimal action and then never explore. But even in the deterministic case,
there is a large advantage to exploring if we weaken some of the other assumptions.
For example, suppose the bandit task were nonstationary, that is,
that the true values of the actions changed over time. In this case exploration
is needed even in the deterministic case to make sure one of the nongreedy
actions has not changed to become better than the greedy one. As we will see
in the next few chapters, eective nonstationarity is the case most commonly
encountered in reinforcement learning. Even if the underlying task is stationary
and deterministic, the learner faces a set of banditlike decision tasks each
of which changes over time due to the learning process itself. Reinforcement
learning requires a balance between exploration and exploitation.

"-贪心方法对贪心方法的优势视任务而定. 例如, 假设奖赏的方差变得更大, 比如是10
而不是1. 对于有更多噪声的奖赏, 代理需要更多的探索来发现最优动作, 那么"-贪心方法
的优势甚至会更大. 在另一方面, 如果奖赏的方差为0, 那么贪心方法在试过一次后就可以得
知所有动作的真实值. 在这种情况下贪心方法可能是表现得最好的, 因为其立即发现了最优
解, 然后不再做任何探索. 但如果我们弱化一些假设, 即使是在确定性<deterministic>1情况
下, 探索也能很有优势. 例如, 假设赌博机问题是非固定性<nonstationary> 的, 即动作的真
实值会随时间变化. 在这种情况下, 即使有确定性这一性质的存在, 依然需要探索来保证没
有一个非贪心动作变得比贪心动作更好. 就像我们在接下来的几章看到的那样, 非固定性情

况在强化学习中最为常见. 即使潜在的任务是确定性与固定性的, 学习器也可能面临一组这
样的任务, 其中每个任务都随着学习的进行和代理的决策策略的变化而随时间改变. 强化学
习需要对探索与利用的平衡.

 

2.3 Incremental Implementation

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值