【SuttonBartoIPRLBook2ndEd】【chapter I】

最新推荐文章于 2021-09-18 22:08:10 发布

money_yuan

最新推荐文章于 2021-09-18 22:08:10 发布

阅读量583

点赞数

分类专栏： AI

AI 专栏收录该内容

60 篇文章 9 订阅

订阅专栏

The idea that we learn by interacting with our environment is probably the
rst to occur to us when we think about the nature of learning. When an
infant plays, waves its arms, or looks about, it has no explicit teacher, but it
does have a direct sensorimotor connection to its environment. Exercising this
connection produces a wealth of information about cause and eect, about
the consequences of actions, and about what to do in order to achieve goals.
Throughout our lives, such interactions are undoubtedly a major source of
knowledge about our environment and ourselves. Whether we are learning to
drive a car or to hold a conversation, we are acutely aware of how our environment
responds to what we do, and we seek to in uence what happens through
our behavior. Learning from interaction is a foundational idea underlying
nearly all theories of learning and intelligence

infant：婴儿
explicit：明确的
sensorimotor：感觉
acutely：敏锐地
inuence ：影响
nearly：几乎
theorize：理论化
effectiveness：效用
exhaustively：详细地
paradigms 范式

通过和环境相互作用来学习的想法可能第一次发生在当我们思考自然的学习方法的时候。当婴儿玩耍，挥动手臂或者四处张望。这些事情都没有明确的老师

但是他的确是有一个直接的和环境的感官联系。行驶这些联系产生了一系列信息并造成了一些列影响，一些序列的关于如何达到目标的动作

纵观我们的人生，这些相互作用毋庸置疑地是关于环境关于我们自己的知识的主要来源。我们是否在开车或者在举行一个会议，我们敏锐地意识到环境如何回应我们，并且我们通过我们的行为影响正在发生的事，从相互作用中学习几乎是所有学习和智力的基础

”

当我们考虑学习的本质时, 通过与环境交互来学习这一想法可能是我们第一个想到的.
当幼儿玩耍、挥动手臂或者四下查看时, 他并没有一个实际的老师, 但他确实和环境有着直接
的感觉、运动上的联系. 锻炼这一联系产生了丰富的关于因果的信息, 关于动作的结果的信
息, 以及关于为了达到目标应该做什么的信息. 在我们的一生中, 这样的交互毫无疑问是关
于我们所处的环境和关于我们自身的知识的主要来源. 无论是学习开车还是学习进行对话,
我们都敏锐地察觉到我们的环境会怎样地对我们的行为做出反应, 并且我们希望通过我们的
行为影响结果. 从交互中学习是一个位于几乎所有的学习与智能理论之下的基本的想法.

“

In this book we explore a computational approach to learning from interaction.
Rather than directly theorizing about how people or animals learn, we
explore idealized learning situations and evaluate the eectiveness of various
learning methods. That is, we adopt the perspective of an articial intelligence
researcher or engineer. We explore designs for machines that are eective in
solving learning problems of scientic or economic interest, evaluating the
designs through mathematical analysis or computational experiments. The
approach we explore, called reinforcement learning, is much more focused on
goal-directed learning from interaction than are other approaches to machine
learning.

在本书中，我们探索相互作用的计算方法而不是直接理论化人或者动物如何学习的，我们探索理想的学习场景和评估各种算法的效用。也就是说我们调整到人工智能研究者或者工程师的视角。我们为计算机探索在解决科学问题和经济利益的有效方案，通过数学分析和计算实验评估这些方案，这种方法称为RL，他和其余机器学习相比更加注重于从相互作用中的目标导向

”

在本书中我们探讨一种计算性的从交互中学习的方法. 我们主要探讨理想化的学习情
况, 并评估不同学习方法的效率1, 而不是直接建立人与动物怎样学习的理论. 也就是说, 我
们采用了人工智能研究者或工程师的视角. 我们探讨关于“利用计算机来有效地解决有着科
学或经济方面关注点的学习问题” 的构思, 并通过数学分析与计算实验来评估这些构思. 我
们所探讨的方法被称为强化学习<reinforcement learning>, 其相比于其他的机器学习方法
更加关注于目标导向的从交互中学习.

“

1.1 Reinforcement Learning

Reinforcement learning is like many topics with names ending in -ing, such
as machine learning, planning, and mountaineering, in that it is simultaneously
a problem, a class of solution methods that work well on the class of
problems, and the eld that studies these problems and their solution methods.
Reinforcement learning problems involve learning what to do|how to
map situations to actions|so as to maximize a numerical reward signal. In
an essential way they are closed-loop problems because the learning system's
actions in uence its later inputs. Moreover, the learner is not told which actions
to take, as in many forms of machine learning, but instead must discover
which actions yield the most reward by trying them out. In the most interesting
and challenging cases, actions may aect not only the immediate reward
but also the next situation and, through that, all subsequent rewards. These
three characteristics|being closed-loop in an essential way, not having direct
instructions as to what actions to take, and where the consequences of actions,
including reward signals, play out over extended time periods|are the three
most important distinguishing features of reinforcement learning problems.

simultaneously：同时的
specification：格式
extent;程度
formulation：公式
sensation：感觉
trivializing：轻视
extrapolate：推断
generalize：概括
impractical：不切实际的
uncharted：未知的
territory：领域

RL像很多以ing结尾的主题一样，像机器学习，计算学习、登上等。他同时也是一个问题，一个在这个问题上有效果的解决方案也是一个研究这些问题和问题解决方案的领域。

RL学习问题涉及到为了最大化系列的奖励信号而学着做什么如何映射动作场景，在重要的方面，他属于是闭源的系统，因为他对后续的输入产生了影响

此外，像很多机器学习一样学者并没有被告知如何指向action，但是相对的必须通过尝试找到哪个action产生最大的回报，在大多更加有趣和挑战性的场景，行为不只影响立即回报也影响下一个场景，通过这个，影响所有子集奖励。以一个重要的方式形成闭环，没有直接说明执行什么动作，动作的序列和奖励信号在哪里延迟时间播放这三个特征是RL的重要区别特征

A full specication of reinforcement learning problems in terms of optimal
control of Markov decision processes must wait until Chapter 3, but the basic
idea is simply to capture the most important aspects of the real problem facing
a learning agent interacting with its environment to achieve a goal. Clearly,
such an agent must be able to sense the state of the environment to some extent
and must be able to take actions that aect the state. The agent also must
have a goal or goals relating to the state of the environment. The formulation
is intended to include just these three aspects|sensation, action, and goal|in
their simplest possible forms without trivializing any of them.

一个全面的就最优控制马尔卡夫决策过程的RL要等到第三章，但是基本的原理很简单就是使用一个和环境交互的完成目标的智能体捕捉到现实问题的影响。明确地。这个智能体一定能一定程度感知环境的变化并且必须能够产生动作来影响环境，这个智能体必须有一个目标和一系列和环境相关的目标，这个公式意在以最简单最可能的形式包含这个三个方面：感觉、动作、目标而不是忽视他们

”

我们使用动力系统理论<dynamical systems theory> 中的概念来形式化强化学习问
题, 更具体地说, 将强化学习问题作为对非完全已知的马尔科夫决策过程<Markov decision
process, MDP> 的最优化控制. 此形式化的细节必须等到第3 章进行呈现, 但是基本的想法
就简单地是捕捉“学习代理<learning agent> 怎样持续地和环境交互来达到目标” 这一问题
的最重要的y 那些方面. 学习代理必须能够从某种程度上察觉到环境的状态, 也必须能够做
出行动来影响环境的状态. 代理也必须拥有一个或多个和环境的状态相关的目标. 马尔科夫
决策过程仅仅以最简单的方式包含了这三个方面——感知, 动作与目标——且没有看轻任一
方面的重要性. 任何适宜于解决这样的问题的方法可以被认作是强化学习方法.

“

Any method that is well suited to solving this kind of problem we consider

to be a reinforcement learning method. Reinforcement learning is different

from supervised learning, the kind of learning studied in most current research

in field of machine learning. Supervised learning is learning from a training

set of labeled examples provided by a knowledgable external supervisor.

Each example is a description of a situation together with a specific ation|the

label|of the correct action the system should take to that situation, which is

often to identify a category to which the situation belongs.

expect learning to be most benecial|an agent must be able to learn from its
own experience.

任何适用于解决这种问题的方法我们认为是RL学习方法，RL不同于当前在机器学习领域使用的监督式学习，监督学习从额外监督的知识提供的已经打好标签的训练数据中学习。每个样本是包含特殊的动作、标签和在这个场景中系统该执行的正确的动作的描述，即经常识别到场景属于哪个类别。这类学习的目的是推断或者说概括系统的回复，以至于他能在不需要训练集的环境中正确行动，这是一种重要的学习但是他在相互作用中不适用。在相互作用的问题中智能体要实行行为，获取在所有场景中即正确又具有代表性的期望表现样本是不切实际的，在期待着学习获取最大收益的位置领域，智能体必须学会从自己的经验中进行学习

”

强化学习不同于监督学习<supervised learning>, 后者在机器学习领域的大部分当前研
究中探讨. 监督学习通过训练集进行学习, 训练集中的样本都由外部的有相关知识的监督者
进行了标记. 每一个实例包含对一个情形的描述, 以及对在这种情形下的系统应该怎么做的
说明(即标签), 系统做的常常为将当前的情形归到某个类别下. 这类学习的目标是对系统的
响应进行推算或泛化, 使其能在没有于训练集中出现过的情形下做出正确的响应. 这是一类
重要的学习, 但仅有它的话对从交互中学习而言是不够的. 在交互式的问题中, 获得既正确、
又对代理必须做出反应的所有情形而言具有代表性的行为实例是不现实的. 在未知的领域
中——人们普遍认为这样的情形下学习是最有用途的——一个代理必须从它自己的经历中
学习.

“

Reinforcement learning is also dierent from what machine learning researchers
call unsupervised learning, which is typically about nding structure
hidden in collections of unlabeled data. The terms supervised learning
and unsupervised learning appear to exhaustively classify machine learning
paradigms, but they do not. Although one might be tempted to think of reinforcement
learning as a kind of unsupervised learning because it does not rely
on examples of correct behavior, reinforcement learning is trying to maximize
a reward signal instead of trying to nd hidden structure. Uncovering structure
in an agent's experience can certainly be useful in reinforcement learning,
but by itself does not address the reinforcement learning agent's problem of
maximizing a reward signal. We therefore consider reinforcement learning to
be a third machine learning paradigm, alongside of supervised learning, unsupervised
learning, and perhaps other paradigms as well.

RL也和非监督学习不同，监督学习典型的是寻找一系列无标签的数据的隐藏的结构，监督式和非监督式的学习似乎对机器学习进行了详尽的分类，事实上并非如此，尽管一种说法是将RL归为一种无监督学习，因为他不需要依赖正确表现的样本，RL正在最大化一个奖励信号而不是去找到他们相关的隐藏的结构关系。揭开智能体经验的结构对RL来说却是有用的，但是本身并没有解决RL智能体最大化奖励值的问题，我们由此认为RL属于是第三种机器学习的范式，和监督和非监督学习或者其余范式一样。

trade-off：交易
exploration and exploitation：探索和开发
pursued 追求的
exclusively 专属的
progressively 逐步地
stochastic 随机的
estimate 估计
intensively 集中地
purist 最纯粹的

”

强化学习又不同与机器学习研究者所称的无监督学习<unsupervised learning>, 后者常
常被用于发现未标签的数据集合的潜在结构. 监督学习与无监督学习这两个术语看上去能
够将机器学习的范式进行彻底地分类, 但事实上并非如此. 虽然有些人可能忍不住将强化学
习视为无监督学习的一种, 但强化学习试图最大化奖赏信号而非发现隐藏的结构. 对于强化
学习来说发现代理经历中的结构当然是有用的, 但其本身没有解决最大化奖赏信号这一强化
学习问题. 因此除监督学习与无监督学习以及可能的其他范式之外, 我们认为强化学习是第
三种机器学习范式.

“

One of the challenges that arise in reinforcement learning, and not in other
kinds of learning, is the trade-o between exploration and exploitation. To
obtain a lot of reward, a reinforcement learning agent must prefer actions
that it has tried in the past and found to be eective in producing reward.
But to discover such actions, it has to try actions that it has not selected
before. The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action selections in
the future. The dilemma is that neither exploration nor exploitation can be
pursued exclusively without failing at the task. The agent must try a variety of
actions and progressively favor those that appear to be best. On a stochastic
task, each action must be tried many times to gain a reliable estimate its
expected reward. The exploration{exploitation dilemma has been intensively
studied by mathematicians for many decades (see Chapter 2). For now, we
simply note that the entire issue of balancing exploration and exploitation
does not even arise in supervised and unsupervised learning, at least in their
purist forms.

RL带来的一个挑战是在探索和开发之间进行交易，为了获得很多奖励，RL智能体要倾向于选择过去已经尝试过的并发现能产生更有效奖励的动作，但是为了发现这些行为智能体不得不开始探索为了在将来做出更加优化的动作选择，困境在于不管是探索还是开发都不能追求在这个任务上完全没有失败，智能体必须尝试一系列的action并逐步偏向于那些表现更好的。在一个随机的任务里，每个动作都必须被尝试很多次去获得可信赖的估计他的期望奖励。探索和开发的困境在过去的几十年间被数学家集中的研究了，我们现在简单的提及探索和开发之间的平衡问题，这些问题是不会在纯粹的监督和非监督学习中出现的

”

在其他类型的学习问题中没有出现而出现在强化学习中的挑战之一, 是探索与利用间的
权衡. 想要获得许多奖赏, 一个强化学习代理必须偏爱已经尝试过并且被发现可以产生高奖
赏的动作. 但为了发现这样的动作, 代理不得不尝试之前未选择过的动作. 一方面, 代理不得
不利用<exploit> 已有的经验来获得奖赏; 另一方面, 代理不得不探索<explore> 以便能在
将来做出更好的动作选择. 困境就在于只追求探索或只追求利用都不能完全任务. 代理必须
尝试各种动作, 然后逐渐偏向于看上去最好的那个. 在一个概率性的任务中, 一个动作必须

被尝试许多次以获得对其期望奖赏的可靠估计. 探索-利用困境已经被数学家深入地研究了
几十年, 但依然尚未解决. 目前, 我们只需要知道平衡探索与利用的整个问题甚至没有在监
督学习和无监督学习中出现, 至少在这些范式最典型的情形下.

“

explicitly 明确地
contrast 对比
decisionmaking 做决定的

Another key feature of reinforcement learning is that it explicitly considers
the whole problem of a goal-directed agent interacting with an uncertain environment.
This is in contrast with many approaches that consider subproblems
without addressing how they might t into a larger picture. For example,
we have mentioned that much of machine learning research is concerned with
supervised learning without explicitly specifying how such an ability would
nally be useful. Other researchers have developed theories of planning with
general goals, but without considering planning's role in real-time decisionmaking,
or the question of where the predictive models necessary for planning
would come from. Although these approaches have yielded many useful results,

their focus on isolated subproblems is a signi cant limitation.

RL另外一个关键的特征是他明确的考虑了目标引导智能体和环境相互作用的整个问题，这个是他和其他很多方法的一个对比点，这些方法就只是考虑子问题没有解决他们如何使用大量的图片。例如我们已经提到的很多机器学习研究是关于监督性学习研究没有明确的说这种能力最后是如何使用。其余的学者开发了带有通用目标的理论，但是没有考虑在需要实时做决定的角色规划或者预测模型计划关键来自哪里的问题。尽管这些方法已经是有用的结论，他们的焦点在一个相关的子问题上是一个很大的限制

”

强化学习的另一个关键特征是其明确地考虑“目标导向的代理如何同不确定的环境交
互” 这一整个的问题. 这不同于许多只考虑子问题而无法令其适应于更大的图谱的方法. 例
如, 我们提到过许多机器学习研究是关于监督学习问题的, 但这样的研究最终在怎样的情形
下会被用到并没有被指明. 一些其他的研究者发展出具有通用目标的计划<planning> 理
论, 但没有考虑计划在实时决策中的角色, 也不关心对计划而言必要的预测模型从哪儿来这
一问题. 虽然这些方法已经产生了许多实用的结果, 他们对单独的子问题的关注是一个巨大
的限制.

“

tack 钉
despite 尽管
interplay 相互作用

Reinforcement learning takes the opposite tack, starting with a complete,
interactive, goal-seeking agent. All reinforcement learning agents have explicit
goals, can sense aspects of their environments, and can choose actions to in uence
their environments. Moreover, it is usually assumed from the beginning
that the agent has to operate despite signicant uncertainty about the environment
it faces. When reinforcement learning involves planning, it has to address
the interplay between planning and real-time action selection, as well as the
question of how environment models are acquired and improved. When reinforcement
learning involves supervised learning, it does so for specic reasons
that determine which capabilities are critical and which are not. For learning
research to make progress, important subproblems have to be isolated and
studied, but they should be subproblems that play clear roles in complete,
interactive, goal-seeking agents, even if all the details of the complete agent
cannot yet be lled in.

RL采取了一个相反的钉，开始一个完整的相互作用的寻找目标的代理，所有的RL智能体都有一个明确的目标，能够感知环境的方面并能选择动作来影响他们的环境。更多地，RL经常被认定从一开始智能体就能运转尽管面对很大程度的环境的不确定性。当RL涉及到规划问题时，他不得不解决在规划和实时动作选择之间的相互作用和环境模型获取和提升模型。当RL涉及监督学习时，他会出于特定的决定哪个功能重要哪个不重要的原因这样做，为了研究学习取得进步，重要的子问题不得不被独立出来研究，但是他们应该变成能扮演清晰的角色的子问题在完整的相互作用的目标寻找的智能体，即使所有完整的智能体不能被填满。

”

强化学习从完整的、交互式的、追寻目标的代理开始, 采取了相反的方针. 所有的强化学
习代理都拥有明确的目标, 能够感知到环境中的各个方面, 也能够选择所做的动作来影响环
境. 此外, 常从开始就假设: 即使面临着关于环境的巨大不确定性, 代理也不得不良好运转.
当强化学习涉及到计划时, 其必须处理计划与实时动作选择之间的相互影响, 并解决“环境
模型怎么获得与改进” 这一问题. 当强化学习牵涉到监督学习时, 常常用监督学习确定哪些
能力是至关重要的以及哪些能力是不重要的. 如果想要强化学习的研究有进展, 重要的子问
题应该被独立开来研究, 但是其应该在完整的、交互式的、追寻目标的代理中扮演清晰的角
色, 即使代理的全部细节还没有被补充完整.

“

One of the most exciting aspects of modern reinforcement learning is its
substantive and fruitful interactions with other engineering and scientic disciplines.
Reinforcement learning is part of a decades-long trend within articial
intelligence and machine learning toward greater integration with statistics,
optimization, and other mathematical subjects. For example, the ability of
some reinforcement learning methods to learn with parameterized approximators
addresses the classical \curse of dimensionality" in operations research
and control theory. More distinctively, reinforcement learning has also interacted
strongly with psychology and neuroscience, with substantial benets
going both ways. Of all the forms of machine learning, reinforcement learning
is the closest to the kind of learning that humans and other animals do,
and many of the core algorithms of reinforcement learning were originally inspired
by biological learning systems. And reinforcement learning has also
given back, both through a psychological model of animal learning that better
matches some of the empirical data, and through an in uential model of parts
of the brain's reward system. The body of this book develops the ideas of
reinforcement learning that pertain to engineering and articial intelligence,
with connections to psychology and neuroscience summarized in Chapters ??
and ??.

最令人激动的一个方面是RL在和其他工程和科学领域相互作用下是实质和卓有成效的，RL和统计积分、优化和其他数学学科结合是在人工智能和机器学习几十年的趋势，例如，一些RL方法在控制理论和运算研究领域学习参数逼近解决经典的维数灾难。更加独特地，RL也和心理学、神经学强相关，这对双发都有很大好处。在所有的机器学习中，RL是最接近人和动物学习方式的，而且很多RL的核心算法最初也是通过生物学习系统产生的。并且RL也通过比经验数据更好的动物学习心理模式和部分大脑奖励系统影响模型给予了反馈，本书的主题包含了结合了心理和神经网络科学的工程和人工智能，这些都被总结在第二章和第三章

”

与其他的科技领域的大量而富有成果的交互, 是现代强化学习最令人激动的一面. 在人
工智能与机器学习的领域中, 寻求与统计学、优化理论以及其他数学科目的更好集成的浪潮
已经持续了数十年, 强化学习正是这浪潮的一部分. 例如, 一些使用参数化近似器<parameterized
approximator> 的强化学习方法的能力可以应对在运筹学与控制论中的典型的“维
数灾难”<curse of dimensionality> 问题. 更特别的是, 强化学习一直与心理学与神经科学有
紧密的交互, 且交互的双方都有巨大的收获. 在各种形式的机器学习中, 强化学习是与人类
和其他动物的学习方式最接近的, 并且许多强化学习的核心算法最初是受到了生物学习系统
的启发. 而强化学习既回报以与实验数据符合得更好的关于动物学习的心理学模型, 也回报
以富有影响力的关于大脑的部分奖励系统的模型. 本书的主体阐述的强化学习的概念属于

“

substantive 实质性
fruitful 卓有成效的
integration 积分
dimensionality 维度
curse 灾难
approximators 逼近
distinctively 独特地
neuroscience 神经学
substantial 大量的
biological生物
empirical 经验
pertain 属于

Finally, reinforcement learning is also part of a larger trend in artificial intelligence back toward simple general principles. Since the late 1960’s, many artificial intelligence researchers presumed that there are no general principles to be discovered, that intelligence is instead due to the possession of vast numbers of special purpose tricks, procedures, and heuristics. It was sometimes said that if we could just get enough relevant facts into a machine, say one million, or one billion, then it would become intelligent. Methods based on general principles, such as search or learning, were characterized as “weak methods,” whereas those based on specific knowledge were called “strong methods.” This view is still common today, but much less dominant. From our point of view, it was simply premature: too little effort had been put into the search for general principles to conclude that there were none. Modern AI now includes much research looking for general principles of learning, search, and decisionmaking, as well as trying to incorporate vast amounts of domain knowledge. It is not clear how far back the pendulum will swing, but reinforcement learning research is certainly part of the swing back toward simpler and fewer general principles of artificial intelligence.

最后，RL也是人工智能倾向于通用原则的，自从1960开始，很多人工智能的专家假定没有可发现的通用原则，由于大量特殊的用途技巧、程序和启发式智能是可以替代的。有一种说法是如果我们能得到足够的相应的数据到我们的机器中像100万或者上千万，机器就能变得智能。像搜索和学习这种通用的方法被定义为薄弱的方法，而这类基于特殊知识的方法被称为增强学习，这个观点至今任然适用但是已经弱化了。从我们的角度来看这样是不成熟的，在研究通用原则上并没有做很多努力就得出结论通用的是原则没有。当代AI包含很多在通用原则的研究，搜索和决策也尝试包括其他领域的大量知识，目前尚不知道钟摆要摆动多远，但是RL研究是确定性地朝着更简单更简化的通用原则和人工智能。

”

最后, 强化学习也是“人工智能回归简单通用的准则” 这一更大潮流的一部分. 自从20
世纪60 年代晚期起, 许多人工智能研究者假设不存在通用的准则; 假设与之相反, 智能的存
在必须归因于大量特定目的的技巧、过程与启发式算法. 有时有这样的说法: 如果能将足够
多的相关的事实, 例如数百万条或数十亿条事实, 放进一台机器里, 那么这台机器就能拥有
智能. 基于通用准则(例如搜索与学习) 的方法, 被称为“弱方法”<weak method>, 反之基于
特定知识的被称为“强方法”<strong method>. 这个观点在今日依然很常见, 但并非统治性
的. 就我们的观点而言, 这样的假设言之过早: 投入到通用准则的研究中的努力太少了, 因此
不能得出通用准则不存在这样的结论. 现代人工智能目前包括大量的关于学习、搜索以及决
策的通用准则的研究. 目前仍不确定历史的钟摆会往回摆多少, 但强化学习毫无疑问是“更
简单与更少的人工智能通用准则” 这一回摆的一部分.

“

presumed 假定
possession 所有权
tricks 技巧
heuristics 启发式
procedures 程序
relevant 相应的
facts 事实
whereas 而
dominant 优势
premature 过早
conclude 得出结论
decisionmaking 做决定
incorporate 包括
pendulum 摆

1.2 Examples

informed 通知
anticipating 期待
intuitive 直观的
desirability 可取
petroleum 石油
refinery 炼油厂
yield 产量
marginal 边缘的
stick 坚持
gazelle 羚羊
calf 犊
apparently 显然地
mundane 平凡
reveal 揭示
interlocking 连锁
cupboard 橱柜
retrieving 检索
tuned 可调整的
bowl 碗
jug 坛子
locomotion 运动
ferry 摆渡
refrigerator 冰箱
cereal 谷类
ultimately 最终地
nourishment 营养
nutritional 营养

A good way to understand reinforcement learning is to consider some of the examples and possible applications that have guided its development.

去理解RL的方法是考虑一些实例和可能的引导他发展起来的应用

• A master chess player makes a move. The choice is informed both by planning—anticipating possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of particular positions and moves.

一个象棋大师执行动作的决定即是通过计划-期待可能的回复和反馈也是通过立即的直观的特殊位置和移动的判断来获取的

• An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers.

自适应控制器实时调节石油炼油厂的运作参数，控制器基于边缘成本而不是严格按照工程师建议的原始点优化产量/成本/质量的权衡

• A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.

一只小羚羊在出生之后几分钟之后尝试站立，半小时之后他就能跑每小时20公里了

• A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on the current charge level of its battery and how quickly and easily it has been able to find the recharger in the past.

移动机器人决定是进入一个房间搜索垃圾还是开始尝试找到路回去找充电站充电，他基于他当前的电量和他过去有多快和多简单的找到充电站

• Phil prepares his breakfast. Closely examined, even this apparently mundane activity reveals a complex web of conditional behavior and interlocking goal–subgoal relationships: walking to the cupboard, opening it, selecting a cereal box, then reaching for, grasping, and retrieving the box. Other complex, tuned, interactive sequences of behavior are required to obtain a bowl, spoon, and milk jug. Each step involves a series

of eye movements to obtain information and to guide reaching and locomotion. Rapid judgments are continually made about how to carry the objects or whether it is better to ferry some of them to the dining table before obtaining others. Each step is guided by goals, such as grasping a spoon or getting to the refrigerator, and is in service of other goals, such as having the spoon to eat with once the cereal is prepared and ultimately obtaining nourishment. Whether he is aware of it or not, Phil is accessing information about the state of his body that determines his nutritional needs, level of hunger, and food preferences.

Phil 准备他的早餐，仔细检查，即使这个显然平凡的活动揭示了一个复杂的条件行为网和相互作用的目标-子目标关系；走到橱柜，打开他，选择谷类盒子并靠近、抓紧、检索这个盒子。其余负责的，可调整的，相互作用的一系列的行为都被要求来获取碗，汤勺，和牛奶坛子，每个都包含一系列的为了获取信息和引导去到达的眼睛动作和行为。快速的判断是持续性地决定如何拿这个东西或者是否在获取其余东西之前先将它们拿到碗餐桌上是更好的。每个动作都引导着一个目标，例如抓汤勺或者去冰箱是为了其余的目标服务的，一旦准备好了谷物就用勺子来吃最终获得营养。是否意识到他取决于Phil根据他身体状态决定营养需求，hunger程度和食物是否准备就绪

overlook 俯瞰
permitted 允许
reservoirs 水库
thereby 从而
take into 考虑
foresight 前瞻

These examples share features that are so basic that they are easy to overlook. All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position, the level of reservoirs of the refinery, the robot’s next location and the future charge level of its battery), thereby affecting the options and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.

这些例子共享的特征非常基础也很容易看懂，所有这些都涉及主动决策标记智能体和他的环境的相互作用，尽管环境不确定智能体尝试去完成他的目标。智能体的动作允许影响环境未来的状态比如下一步棋的位置，炼油厂的水库的水位，机器人下一步的位置和将来的电量。

从而影响这个位置和智能体可获取的机会在随后的时间。正确的选择要求考虑到间接的延迟的序列动作，因此很多需要前瞻和规划

appropriately 适当地
overflowing 满溢

At the same time, in all these examples the effects of actions cannot be fully predicted; thus the agent must monitor its environment frequently and react appropriately. For example, Phil must watch the milk he pours into his cereal bowl to keep it from overflowing. All these examples involve goals that are explicit in the sense that the agent can judge progress toward its goal based on what it can sense directly. The chess player knows whether or not he wins, the refinery controller knows how much petroleum is being produced, the mobile robot knows when its batteries run down, and Phil knows whether or not he is enjoying his breakfast.

同时，在这些例子中动作的影响都是不能提前完全预测的；因此智能体必须频繁地模拟他的环境并给出适当的反馈，例如，Phil 必须看着牛奶倒进谷物碗以确保他不会满出来，所有的例子都涉及到明确的目标即智能体能通过他直接感知到的判断目标进度。棋类选手知道他是否赢或者输，炼油厂的控制器知道已经生产了多少石油，移动机器人知道什么时候电池用完，Phil知道是否正在享受他的早餐

coincide 重合
necessarily 一定
organism 生物
resides 所在
aspirations 愿望
abstract 抽象的

Neither the agent nor its environment may coincide with what we normally think of as an agent and its environment. An agent is not necessarily an entire robot or organism, and its environment is not necessarily only what is outside of a robot or organism. The example robot’s battery is part of the environment of its controlling agent, and Phil’s degree of hunger and food preferences are features of the environment of his internal decision-making agent. The state of an agent’s environment often include’s information about the state of the machine or organism in which the agent resides, and this can include memories and even aspirations. Throughout this book we are being abstract in this way when we talk about agents and their environments.

代理商和环境可能都不会与我们通常认为的代理上和环境重合，智能体不一定是一个完整的机器人或者生物，环境也不一定只是机器人或者生物外部的东西，机器人的电量也是他的智能体环境的一部分，Phil的hunger程度和食物准备得怎么样是内部决策代理人的环境的特征。智能体环境状态经常包含机器或者生物所在的状态信息，这些包含存储记忆和期望，尽管本书在我们讨论智能体和环境之间是抽象的。

refines 提炼
intuition 直觉
streamline 简化

In all of these examples the agent can use its experience to improve its performance over time. The chess player refines the intuition he uses to evaluate positions, thereby improving his play; the gazelle calf improves the efficiency

with which it can run; Phil learns to streamline making his breakfast. The knowledge the agent brings to the task at the start—either from previous experience with related tasks or built into it by design or evolution—influences what is useful or easy to learn, but interaction with the environment is essential for adjusting behavior to exploit specific features of the task.

在所有的这些例子中，智能体能用他的经验来提升他自己的性能，棋类选手提炼他用来评估位置的直觉，因此来提升他的技巧；小羚羊提升他奔跑的效率。Phil 学会简化他制作早餐的流程。智能体从开始就为任务带来知识，这些知识不管是从以前相关任务的经验或者从涉及和评估是否有效或者简单易学影响来构建，但是和环境的相互作用对明确特殊任务调整行为是重要的。

mapping 映射
perceived 感知
stimulus 刺激
stimuli 刺激
stochastic 随机的
sufficient 足够的

1.2 节示例
一个理解强化学习的好方法是了解一些示例, 和一些一直以来指导强化学习的发展的可
能应用.
• 一个富有经验的象棋选手下了一步棋. 这个选择受到了两个方面的启发: 一是计划
——预测可能的回击以及对回击的回击; 二是凭直觉对特定位置与走法的所作出的立
即判断.
• 一个自适应的控制器实时地调整炼油厂生产操作的各个参数. 控制器在设定的消耗速
率范围内, 优化产出、消耗、质量这三者之间权衡, 而不必使值严格地遵从工程师建议的
预设值.
• 羚羊幼崽在出生几分钟后就能挣扎着站起来. 半个小时后它就能以20 英里每小时的
速度奔跑.
• 一个移动机器人决定它应该是进入一个新房间来搜索更多的垃圾, 还是开始试图寻找
回到充电站的路. 它基于当前的电量水平与过去它寻得充电器的速度与难易度来做出
决定.
• 菲尔准备他的早餐. 如果仔细审视的话, 即使是这样普通的活动, 也能在其中发现一张
由条件行为和连锁的目标-子目标关系所组成的复杂的网: 走向食物柜, 打开食物柜, 选
择一个谷物盒, 然后将手伸向它, 抓住它, 最后将盒子取回. 其他复杂的、熟练的、交互

的动作序列, 也被取回一个碗、一根汤匙或一盒牛奶所需要. 为了能获取信息并指导伸
手与其他动作, 每一步都涉及到一系列的眼球运动. 判断被迅速地做出: 如何带回这个
物体, 或者先将手头上的拿到餐桌上再取余下的是否更好. 每一步都是由目标——如抓
一个汤匙或去冰箱那里——所指导的, 并且每一步都为其他的目标服务, 例如拿来汤匙
后, 一旦当谷物准备好了, 就能使用汤匙来吃掉谷物, 并最终获得营养. 无论他有没有
意识到这一点, 菲尔一直在获取关于身体的状态信息: 这些状态信息确定了他的营养
需求, 饥饿的等级, 以及食物上的偏好.
这些示例共有着一些太过基本以致于容易被忽视的特点. 所有这些示例都涉及到做决
策的代理及其环境间的交互. 在交互中, 虽然面对着环境中的不确定性, 但代理试图达到特
定的目标. 代理的动作能够影响环境的未来状态(例如下一步棋的位置, 炼油厂中储蓄池的
油位, 机器人的下一个位置以及电池的未来电量), 因此能够影响代理后续可以采取的动作以
及所面对的机会. 正确的选择需要将动作间接的、延迟的结果考虑在内, 因此可能需要预测
与计划.
与此同时, 在所有的这些示例中, 动作的效果不能被完全预测, 因此代理必须频繁地监视
环境并作出合适的反应. 例如, 菲尔必须照看倒入碗中的牛奶的量以防止牛奶溢出. 所有这
些示例涉及到一个明确的目标, 而明确之处就在代理可以基于其所能直接感知到的来判断离
目标有多近. 棋手知道他是否赢了, 炼油厂控制器知道多少油正在被生产, 羚羊幼崽能察觉
到它跌倒了, 机器人能判断它的电量是否耗尽了, 菲尔能判断他是否享受他的早餐.
在所有的这些示例中, 代理能够逐渐地利用其经验来提升表现. 象棋选手提升了用于评
估各个位置的直觉, 因此改善了他的下棋技巧; 羚羊幼崽提升了奔跑的效率; 菲尔学会了使
用流水线的方式来制作早餐. 从一开始代理带入任务的知识——无论是从先前的相关任务获
得的经验获得的, 还是刻意带入, 或是以生物进化的方式带入——会对哪些对于学习是有用
的或者哪些容易学习产生影响, 但是同环境的交互对于调整动作来利用该任务的具体特征而
言是必须的.

1.3 Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment. A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations (provided that stimuli include those that can come from within the animal). In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic.

除了智能体和环境，RL学习系统的四个主要的元素：策略、奖励信号、值函数和可选的环境模型。策略定义了学习的智能体在给定时间的表现方式，简单的说，一个策略映射从环境状态到在这些状态下将被执行的动作。这对应于心理学的刺激反应规则或者社会学（提供来自动物的这些刺激）在一些场景里面策略对应了简单的函数和查找表，然而，其他情形下他可能包含其余的计算例如搜索过程。策略在是RL智能体的核心，因为他自身足够决定行为，通常策略是随机的。

sole 唯一的
objective 目标
analogous 类似
alter 改变
mood 心情

A reward signal defines the goal in a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent. In a biological system, we might think of rewards as analogous to the experiences of pleasure or pain. They are the immediate and defining features of the problem faced by the agent. The reward sent to the agent at any time depends on the agent’s current action and the current state of the agent’s environment. The agent cannot alter the process that does this. The only way the agent can influence the reward signal is through its actions, which can have a direct effect on reward, or an indirect effect through changing the environment’s state. In our example above of Phil eating breakfast, the reinforcement learning agent directing his behavior might receive different reward signals when he eats his breakfast depending on how hungry he is, what mood he is in, and other features of his of his body, which is part of his internal reinforcement learning agent’s environment. The reward signal is the primary basis for altering the policy. If an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future. In general, reward signals may be stochastic functions of the state of the environment and the actions taken

在RL问题中奖励定义了目标，在每一步中，环境发送给智能体一个单独的数字，一个奖励。智能体的唯一目标是最大化他所收到的长时间的所有奖励。奖励信号因此定义了对智能体来说什么是好的什么是坏的事件。在生物系统，我们可能认为奖励是快乐或者悲伤的经验，奖励是立即的并且定义了智能体面临的问题的特征，在任何时间发送给智能体的奖励决定了智能体当前的行为和智能体环境的当前状态。智能体不能改变此操作的进程，智能体唯一能影响奖励值的方式是通过他的动作直接影响奖励值或者间接通过改变环境的状态来影响。在我们上面的Phil吃早餐的例子中，当在吃早餐时RL智能体根据他的hunger程度，心情如何，或者身体的其他特征，指导他的行为可能受到不同的奖励信号。这个信号是RL智能体环境的一部分，奖励信号是改变策略的主要的依据。如果策略选择一个动作得到了很少的奖励，在将来遇到这个状态的时候就可能去选择别的action，奖励可能是环境状态和动作的随机函数

indicates指示
specifies 指定
accumulate 累积
intrinsic 固有的
desirability 可取
account 账 take into account 考虑在内
reverse 相反
analogy 比喻
refined 精致
formalize 形式化

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state. Expressed this way, we hope it is clear that value functions formalize a basic and familiar idea.

然而奖励信号指示着在即时场景中什么是好的，值函数指定长期的什么是好的，简单地说，状态的价值是一个智能体可以期望着累积从那个开始状态一直到将来所有奖励的总和。然而奖励决定着立即的固有的环境的可取值，价值指示长期可取的后续考虑在内的状态和状态的奖励。例如，一个状态可能总是得到很少的立即奖励但是因为他伴随着其他完成高回报的状态所以任然有很高的价值或者相反也是成立的，以人来比喻，奖励有时就像快乐或者悲伤，然而价值对应更加精致和长远的关于在环境是特殊状态时如何使我们满意或者不满意。这样表达，我们希望值函数能清晰的形成一个基础的熟悉的概念。

Nevertheless 虽然
Make over 结束
Derived 派生的
Quantity 数量
Arguably 按理说

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run. In decision-making and planning, the derived quantity called value is the one with which we are most concerned. Unfortunately, it is much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values. The central role of value estimation is arguably the most important thing we have learned about reinforcement learning over the last few decades.

奖励是首要的，然而作为奖励的预测是次要的，没有奖励将没有价值并且奖励简直的唯一目的是获取更多的奖励。虽然，在标记和评估决定的时候我们更加关注价值。行为选择依据价值判断。我们选择带来高价值而不是带来高回报的行为，因为这些行为在长时间内为我们获得大量的回报。在决策-标记和规划中，称为价值的派生数量是我们最关心的。不幸地是，定义价值比定义奖励更加困难，奖励是直接基于环境反馈的，但是价值必须从智能体的整个生命周期观测序列里建立和反复建立。事实上，我们认为几乎所有RL算法最重要的部分是有效建立价值的方法。RL价值建立按理说是我们过去几十年已经了解的最重要的部分

mimics 模仿
inferences 推论
behave 表现
simultaneously 同时
spans 跨度
spectrum 光谱
deliberative 审议

The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler modelfree methods that are explicitly trial-and-error learners—viewed as almost the

opposite of planning. In Chapter 9 we explore reinforcement learning systems that simultaneously learn by trial and error, learn a model of the environment, and use the model for planning. Modern reinforcement learning spans the spectrum from low-level, trial-and-error learning to high-level, deliberative planning.

第四个也是最后一个RL的元素是环境的建模，这是一定程度模仿环境的行为，或者更通俗的说，他允许做出关于环境如何表现的推论。例如，给定一个状态和行为，这个模型可能预测下一个状态的结果和下一步的奖励值。在规划问题中通过在实际行动之前考虑将来可能的场景的方法来决定行为。使用模型来解决RL问题的方法和规划我们称为基于模型的方法，相反的简单的无模型的方法我们明确地称为反复试错的学习-基本上是相反的规划。在第九章我们讨论反复试错的RL，学习环境模型，并使用模型来完成规划问题，目前的RL跨度了低级的，反复试错到高级审议规划

1.3 节强化学习的组成要素
如果目光越过代理与环境, 那么我们可以发现强化学习系统的四个子要素: 策略<policy>,
奖赏信号<reward signal>, 值函数<value function>, 以及可选的环境模型<model>.
策略定义了代理在一给定时间的决策方式. 粗略地将, 策略就是从感知到的环境的状态,
到在这些状态下应该采取的动作的映射. 它对应于心理学中被称为一组刺激-反应规则或关
联. 其在一些情况下可能是一个简单的函数或一张查找表, 但在其他的情况下可能涉及到例
如搜索的额外操作. 从策略单独就足以决定动作这一角度讲, 策略是强化学习代理的核心.

一般而言, 策略是概率性的, 指定了执行每一动作的概率.
奖赏信号定义了强化学习问题的目标. 在每一时步<time step> 中, 环境向强化学习代
理发送一个称为奖赏<reward> 的实数值. 代理的唯一目标就是最大化其长期的累积奖赏.
奖赏信号因此定义了对代理而言什么样的事件是好, 什么样的事件是坏. 对生物系统而言,
我们可以将奖赏类比于快乐或悲伤的体验. 奖赏信号是代理所面对的问题的直接的、决定性
的特征. 奖赏信号是更改策略的基础; 如果由策略选择的动作只能带来低奖赏, 那么策略会
被调整为, 将来在同样的情况下, 选择其他的动作. 一般而言, 奖赏信号是环境的状态与代理
采取的动作的概率性函数.
然而奖赏信号只能显示立即的优劣, 值函数<value function> 才能指明长期的优劣. 粗
略地讲, 一个状态的值<value> 是从当前状态起, 代理未来所有奖赏的累积和的期望值. 奖
赏只能决定对环境状态立即的、固有的喜好程度; 而值预示着从长期来看的对状态的喜好程
度, 其中将未来可能遇到的状态以及可以从这些未来状态中得到的奖赏也被考虑在内. 例如,
一个状态可能只能获得较低的立即奖赏, 但如果其后能频繁遇到可以产生高奖赏的状态, 那
么这个状态仍可能有较高的值. 如果拿人来做比方的话, 奖赏类似于快乐(如果较高) 或痛苦
(如果较低); 而值对应于当环境处于特定状态时, 对喜怒的更为精细与长远的考量.
从一定意义上说, 奖赏为主, 而作为对奖赏的预测的值为次. 没有奖赏就没有值, 对值
作出评估的唯一目的就是获得更多的奖赏. 然而, 当评估并作出决定时, 值是我们最为关心
的. 对动作的选择基于对值的评估. 我们寻求的动作应带来有最高值的状态, 而非带来最高
奖赏, 因为从长远来看前者能带来最高的奖赏和. 然而不幸的是, 确定值远比确定奖赏困难
的多. 从根本上说, 奖赏是由环境直接给出的; 而值必须在其整个生命周期中, 根据代理观察
到的序列不停地评估与再评估. 事实上, 所有我们所阐述的几乎所有强化学习算法中最为重
要的组件就是高效地对值进行估计的方法. “值估计处于核心地位” 也许是过去六十年里从
强化学习中得出的最为重要的结论.
强化学习系统的第四也是最后一个元素是环境的模型<model>. 模型用于模仿环境的
反应, 或更一般地讲, 其能够推断出环境将会做出怎样的反应. 例如, 给定一个状态和动作,
模型可以预测出作为结果的下一个状态以及奖赏. 模型是用于计划<planning> 的, 其中计
划的含义是在通过在实际经历前考虑将来可能的情形, 来决定行为方式. 使用模型与计划的
强化学习方法被称为有模型<model-based> 方法; 与之相反的是更简单的使用试错的免模
型<model-free> 方法, 试错可以视为计划的反面. 在第8 章中我们将介绍一强化学习系统,
其能同时做到通过试错学习、学得环境模型、以及将模型用于计划. 现代强行学习既包括低层
的试错学习, 又包括高层的、周全的计划.

1.4 Limitations and Scope

Most of the reinforcement learning methods we consider in this book are structured

around estimating value functions, but it is not strictly necessary to do

this to solve reinforcement learning problems. For example, methods such as

genetic algorithms, genetic programming, simulated annealing, and other optimization

methods have been used to approach reinforcement learning problems

without ever appealing to value functions. These methods evaluate the \lifetime"

behavior of many non-learning agents, each using a di_erent policy for

interacting with its environment, and select those that are able to obtain the

most reward. We call these evolutionary methods because their operation is

analogous to the way biological evolution produces organisms with skilled behavior

even when they do not learn during their individual lifetimes. If the

space of policies is su_ciently small, or can be structured so that good policies

are common or easy to _nd|or if a lot of time is available for the search|then

evolutionary methods can be e_ective. In addition, evolutionary methods have

advantages on problems in which the learning agent cannot accurately sense

the state of its environment.

我们在这本书上考虑的大部分的RL学习方法都围绕着估计值函数，但是做这个对于解决RL问题而言并不是严格必要的，例如像遗传算法，遗传编程，模拟退火和其他的优化算法已经被用来解决RL问题而没有使用值函数。这些方法评价很多非学习的智能体的生命周期的行为，每个都是用和环境交互的不同的策略，并选择那些能获取最多收益的。我们将这个称为进化方法因为他们类似生物进化产生具有熟练技能的个体，即使这个个体在他的生命周期内不学习。如果策略空间是足够小或者可以建立，所以好的策略是通用的容易找到的或者如果我们有很多时间去寻找的话，进化方法是有效的。进化方法具有正在学习的智能不能准确感知环境变化上的优势。

、misperceived 误以为的

Our focus is on reinforcement learning methods that involve learning while

interacting with the environment, which evolutionary methods do not do (unless

they evolve learning algorithms, as in some of the approaches that have

been studied). It is our belief that methods able to take advantage of the

details of individual behavioral interactions can be much more e_cient than

evolutionary methods in many cases. Evolutionary methods ignore much of

the useful structure of the reinforcement learning problem: they do not use

the fact that the policy they are searching for is a function from states to

actions; they do not notice which states an individual passes through during

its lifetime, or which actions it selects. In some cases this information can

be misleading (e.g., when states are misperceived), but more often it should

enable more effcient search. Although evolution and learning share many features

and naturally work together, we do not consider evolutionary methods

by themselves to be especially well suited to reinforcement learning problems.

For simplicity, in this book when we use the term \reinforcement learning

method" we do not include evolutionary methods.

我们聚焦在RL方法在和环境的相互作用之间涉及学习，但是进化方法并不涉及到这点，除非他们进化他们的学习算法，正如在一些他们已经学习的方法。我们相信在绝大多数场景中相互作用的个体细节表现上RL会比进化方法更加有优势。进化方法忽略了很多RL问题有用的结构；进化算法不基于我们正在搜索的策略是一个状态到动作的函数，也不关注状态在生命周期是选择了哪个转台的或者选择哪个动作。在很多场景下这个信息可能是会误导的，但是更多情况下他应该是有效的搜索。尽管进化和学习共享了很多特征并自然的工作在一起，我们不考虑进化方法去特别适合RL问题，简单的说，在这本书中我们使用短期RL方法，我们不包含进化方法。

However, we do include some methods that, like evolutionary methods,

do not appeal to value functions. These methods search in spaces of policies

de_ned by a collection of numerical parameters. They estimate the directions

the parameters should be adjusted in order to most rapidly improve a policy's

performance. Unlike evolutionary methods, however, they produce these estimates

while the agent is interacting with its environment and so can take

advantage of the details of individual behavioral interactions. Methods like

this, called policy gradient methods, have proven useful in many problems, and

some of the simplest reinforcement learning methods fall into this category. In

fact, some of these methods take advantage of value function estimates to improve

their gradient estimates. Overall, the distinction between policy gradient

methods and other methods we include as reinforcement learning methods is

not sharply defined.

然而我们不使用不使用值函数的方法就像进化方法。这些方法搜索一系列被参数定义的策略空间。他们建立直接的参数来调整以达到快速提升性能。不像进化方法，他们在智能体和环境相互作用的条件生产这些策略，以至于可以有效利用个体和环境交互的细节数据。像这样的方法称为策略迭代方法，已经解决了很多问题。一些最简单的RL方法就属于这个种类。事实上，一些这种方法利用值函数估计来提升迭代估计。总体上，在策略迭代方法和其他RL方法之间的区别并没有被大幅定义。

Reinforcement learning's connection to optimization methods deserves some

additional comment because it is a source of a common misunderstanding.

When we say that a reinforcement learning agent's goal is to maximize a numerical

reward signal, we of course are not insisting that the agent has to

actually achieve the goal of maximum reward. Trying to maximize a quantity

does not mean that that quantity is ever maximized. The point is that a reinforcement

learning agent is always trying to increase the amount of reward

it receives. Many factors can prevent it from achieving the maximum, even if

one exists. In other words, optimization is not the same a optimality.

RL和优化问题的联系值得一点注意，因为经常被混淆在一起。当我们说RL智能体的目标是最大化一系列目标信息，我们当然不是坚持说智能体要获取最大的奖励。尝试最大数值的并不意味着数量是最大的。RL智能体一直尝试提升他获取到的reward的数值。很多因为影响他得到最大值，即使一个存在，换言之。优化不等于最优

1.4 节范围与局限
强化学习极其依赖于状态的概念——作为策略函数与值函数的输入, 又同时作为模型的
输入与输出. 非正式的地讲, 我们可以将状态理解将一特定时间点的环境的某些方面传递给
代理的信号. 我们在这里所使用的状态的正式定义于第3 章在马尔科夫决策过程的框架下给
出. 然而, 从更宽泛的角度看, 我们鼓励读者沿用这个非正式的定义, 并将状态理解为代理可
以获得的所有环境信息. 事实上, 我们假设状态信号由预处理系统产生, 而预处理系统名义
上是代理所处的环境的一部分. 在本书中我们不处理关于状态信号的构建、更改与学习的内
容(除了在17.3 节有简短的部分). 我们之所以采取这样的方式, 不是因为我们认为状态表示
不重要, 而是想将全部的注意力放到决策问题上. 换言之, 我们在本书中的关注点并非状态
信号的设计, 而是决定应采取什么动作, 作为获取到的状态信号的函数.
我们在本书中讨论的多数强化学习方法是围绕对值函数的估计构建的, 但对于解决强
化学习问题而已这并非严格必需的. 举例来说, 遗传算法<genetic algorithm>、遗传规划
<genetic programming>、模拟退火<simulated annealing> 及其他优化算法的方法从不评估
值函数. 这些方法在延长的时间周期中迭代, 且迭代时在不同的环境实例中应用不同的固定
策略. 获得了最高奖赏的策略及其随机变体, 被保留到下一代的策略中, 并重复此过程. 我们
将此称为进化方法<evolutionary method>, 因为其操作和生物进化类似, 生物进化过程中
物种可以演化出杰出的能力, 虽然对个体而言可能终其一生其也没有任何进步. 如果策略空
间足够小, 或者组织出的策略空间中好策略能被轻松或普遍地找到——或者有大量的时间用
于搜索策略空间——那么进化方法是很有效的. 此外, 进化方法在面对如代理不能感知到环
境的完整状态之类的问题时很有优势.

我们的重点关注能从与环境的交互中学习的强化学习方法, 而进化方法做不到这一点.
能够利用个体的交互细节的方法在多数情况下远比进化方法高效. 进化方法没忽视了强化
学习问题许多有用的结构: 其没有利用“试图搜索的策略是一个从状态到动作的函数” 这一
事实; 其未留意个体在生命周期中经历了哪些状态及选择了哪些动作. 在一些情况下这些信
息可能会起误导作用(例如感知到了错误的状态), 但更多的情况下这些信息使得搜索更加高
效. 虽然进化方法与学习有许多共同特征且能很自然地协同工作, 但我们认为进化方法本身
不是特别适合于强化学习问题, 所以我们不在本书中介绍进化方法.

1.5 An Extended Example: Tic-Tac-Toe

To illustrate the general idea of reinforcement learning and contrast it with
other approaches, we next consider a single example in more detail.

1、illustrate 说明

2、tic-tac-toe 井字棋

为了说明RL的理念和对比他和其他的方法，我们来细看一个单独的例子

Consider the familiar child's game of tic-tac-toe. Two players take turns
playing on a three-by-three board. One player plays Xs and the other Os until
one player wins by placing three marks in a row, horizontally, vertically, or
diagonally, as the X player has in this game:

考虑一下对小孩熟悉的井字游戏，两个选手轮流在3×3的棋盘上下棋，一个下x另外一个就下0

直到一个玩家在横线、竖线或者对角线上连起三颗

1、diagonally 对角线上

2、as 正如

3、draw 平局

4、as never 从来没有

5、For the moment 目前来看

6、construct 创建

7、specification 说明

8、vast majority 绝大多数地

If the board fills up with neither player getting three in a row, the game is
a draw. Because a skilled player can play so as never to lose, let us assume
that we are playing against an imperfect player, one whose play is sometimes

incorrect and allows us to win. For the moment, in fact, let us consider draws
and losses to be equally bad for us. How might we construct a player that will
find the imperfections in its opponent's play and learn to maximize its chances
of winning?

在棋盘上全部填满，任何一个都没有三个连起一条线的就是平局。有经验的玩家可以从来没有

失败，让我们假定我们正在和经验不丰富的玩家对局，目前来看，平局和失败是一样糟糕的结果

我们怎么构建我们的玩家使得它能找到对手的确定并学会最大概率的获胜。

Although this is a simple problem, it cannot readily be solved in a satisfactory
way through classical techniques. For example, the classical \minimax"
solution from game theory is not correct here because it assumes a particular
way of playing by the opponent. For example, a minimax player would never
reach a game state from which it could lose, even if in fact it always won from
that state because of incorrect play by the opponent. Classical optimization
methods for sequential decision problems, such as dynamic programming, can
compute an optimal solution for any opponent, but require as input a complete
specication of that opponent, including the probabilities with which
the opponent makes each move in each board state. Let us assume that this
information is not available a priori for this problem, as it is not for the vast
majority of problems of practical interest. On the other hand, such information
can be estimated from experience, in this case by playing many games
against the opponent. About the best one can do on this problem is rst to
learn a model of the opponent's behavior, up to some level of condence, and
then apply dynamic programming to compute an optimal solution given the
approximate opponent model. In the end, this is not that difierent from some
of the reinforcement learning methods we examine later in this book.

1、readily 容易地

2、classical 古典的

3、opponent. 对手

4、conguration 组态

5、evaluation 估计

6、incremental

7、Literally 此外

尽管这是一个简单的问题，但他并不能通过传统的技巧轻易而满意的解决。例如传统的游戏理论最大最小方法在这里

就是不适用的，因为他假定了对手下棋的特殊方式。例如, 一个极小化极大程序永远不会到达一个可能会将其引向失败的状态, 即使在多数情况下因为对手的失误该状态会将其引向胜利.典型的针对一系列决策问题的优化方
法, 例如动态规划<dynamic programming>, 可以计算出针对任一对手的方案, 但需要关于
对手的完整说明——包括在棋盘的每一个状态中对手下任意一步棋的概率——作为输入.让
我们假设如同实际关注中的多数问题, 这些信息并非先验的.另一方面, 通过和同一个对手
下多盘棋, 这样的信息可以从经验中被评估出来.对经典方法而言, 关于此问题能做到的最
好的就是先学习得到有一定置信度的对手行为模型, 然后对给定的对手近似模型应用动态规
划来计算最优解。最后的这种方法, 和我们之后在本书中探讨的一些强化学习方法没有太大
差别.

An evolutionary method applied to this problem would directly search the
space of possible policies for one with a high probability of winning against
the opponent. Here, a policy is a rule that tells the player what move to make
for every state of the game|every possible conguration of Xs and Os on the
three-by-three board. For each policy considered, an estimate of its winning
probability would be obtained by playing some number of games against the
opponent. This evaluation would then direct which policy or policies were
considered next. A typical evolutionary method would hill-climb in policy
space, successively generating and evaluating policies in an attempt to obtain
incremental improvements. Or, perhaps, a genetic-style algorithm could be
used that would maintain and evaluate a population of policies. Literally
hundreds of dierent optimization methods could be applied.

如果在本问题上应用进化方法, 那么其将在直接在策略空间中搜索可以以高概率战胜对
手的策略.其中, 一个策略即为指导玩家在每一个游戏状态——每一个在3 3 的棋盘上合
法的X 与O 的配置——中该怎么走的规则。对于每一个所考虑的策略, 其获胜概率的估计值
可以通过与对手下多盘棋来获得。这样的估计将用于指导哪些策略能够之后继续被考虑.

一个典型的进化算法能在策略空间中逐步提升<hill-climb>不停产生并估计策略以获得增量
式的改进.此外, 维持并估计多个策略的遗传算法也可以被使用. 事实上数以百计的优化算
法都可以被使用.

Here is how the tic-tac-toe problem would be approached with a method
making use of a value function. First we set up a table of numbers, one for
each possible state of the game. Each number will be the latest estimate of
the probability of our winning from that state. We treat this estimate as the
state's value, and the whole table is the learned value function. State A has
higher value than state B, or is considered \better" than state B, if the current
estimate of the probability of our winning from A is higher than it is from B.

Figure 1.1: A sequence of tic-tac-toe moves. The solid lines represent the
moves taken during a game; the dashed lines represent moves that we (our
reinforcement learning player) considered but did not make. Our second move
was an exploratory move, meaning that it was taken even though another
sibling move, the one leading to e, was ranked higher. Exploratory moves do
not result in any learning, but each of our other moves does, causing backups
as suggested by the curved arrows and detailed in the text.

接来下讨论井字棋问题怎么利用值函数来处理，首先我们为每一个可能的状态建立一
张数值表.表中每一个数值都是从该状态起获胜的概率的最新估计值。我们将此作为每一个
状态的值<value>,同时整张表就是学得的值函数.如果当前, 从状态A 起获胜概率的估计
值比状态B 高, 那么状态A 的值比状态B 高, 或者说状态A 比状态B 更“好”.

1、solid 实体的

2、dashed 虚线的

3、exploratory 探索的

4、 even though 即使

5、 sibling 兄弟的

6、curved 弯曲的

7、arrows 箭头

Assuming we always play Xs, then for all states with three Xs in a row the
probability of winning is 1, because we have already won. Similarly, for all
states with three Os in a row, or that are \lled up," the correct probability
is 0, as we cannot win from them. We set the initial values of all the other
states to 0.5, representing a guess that we have a 50% chance of winning.

假设我们一直执X, 那么对所有有一排三枚X 棋子的状态其获胜概率为1, 因为在这样的情况下我们已
经赢了.类似的, 对于所有有一排三枚O 棋子的状态或平局, 其获胜概率为0，因为我们已经不可能赢了

我们将所有其他状态的初始值设为0.5, 表示我们猜测从这些状态起有50% 的概率获胜.

We play many games against the opponent. To select our moves we examine
the states that would result from each of our possible moves (one for each blank
space on the board) and look up their current values in the table. Most of the
time we move greedily, selecting the move that leads to the state with greatest
value, that is, with the highest estimated probability of winning. Occasionally,
however, we select randomly from among the other moves instead. These are
called exploratory moves because they cause us to experience states that we
might otherwise never see. A sequence of moves made and considered during
a game can be diagrammed as in Figure 1.1.

然后我们同对手玩许多盘棋. 为了选择下一步棋, 我们检查下了一步棋之后所有可能的
状态(将当前棋盘上任一空填上后各对应一个状态), 然后在表中查找各个状态当前的估计
值. 在多数情况下我们贪心<greedy> 地选择下一步棋, 选择能导向拥有最高值的状态的那
步棋, 即选择胜率估计值最高那步棋. 然而, 我们偶尔随机地选择下一步棋. 这些棋被称为探
索<exploratory> 步, 因为这让我们探索原先根本无法经历的状态. 游戏中一系列所考虑与
所做出的下法, 被绘制在了图1.1 中.

While we are playing, we change the values of the states in which we nd
ourselves during the game. We attempt to make them more accurate estimates
of the probabilities of winning. To do this, we \back up" the value of the state
after each greedy move to the state before the move, as suggested by the arrows
in Figure 1.1. More precisely, the current value of the earlier state is adjusted
to be closer to the value of the later state. This can be done by moving the
earlier state's value a fraction of the way toward the value of the later state.
If we let s denote the state before the greedy move, and s0 the state after
the move, then the update to the estimated value of s, denoted V (s), can be
written as

where is a small positive fraction called the step-size parameter, which in-
uences the rate of learning. This update rule is an example of a temporal-
dierence learning method, so called because its changes are based on a difference,
V (s0) ? V (s), between estimates at two dierent times.

当我们在下棋时, 我们更改我们经历过的状态的值.我们尝试做出对获胜率更为准确的
估计.为了做到这一点, 我们将做出贪心选择后的状态值逆流<back up> 至做出贪心选择

前的状态,如图1.1 所示，更确切地说, 早先状态的当前值更新后向后续状态的值靠拢.

这可以通过将早先状态的值向后续状态接近部分做到.让我们用St 表示做出贪心选择前的状态,
用St+1 表示贪心选择后的状态,那么对St 的值的估计——记作V (St)——的更新可以写作:

其中为称为步长<step-size> 的参数, 为一个小的正数, 能影响学习的速率. 上述的更新
是时序差分<temporal difference, TD> 学习方法的特例, 时序差分名称的由来为——更新
是基于V (St+1) ? V (St) 这一两个连续状态的估计值之差.

1、arrows 箭头

2、fraction 分数

3、denote 表示

The method described above performs quite well on this task. For example,
if the step-size parameter is reduced properly over time, this method converges,
for any fixed opponent, to the true probabilities of winning from each state
given optimal play by our player. Furthermore, the moves then taken (except
on exploratory moves) are in fact the optimal moves against the opponent. In
other words, the method converges to an optimal policy for playing the game.
If the step-size parameter is not reduced all the way to zero over time, then
this player also plays well against opponents that slowly change their way of
playing.

上述的方法在这个问题上表现很好.例如, 如果步长参数能随时间以合适的速率递减,
那么对于任何给定的对手, 任意状态的估计值都能收敛到从该状态起我方使用最优策略的话
最终获胜的概率.更进一步说, 收敛后所下的每一步(除去探索步) 事实上都是针对这一(非
完美) 对手的最优下法. 换句话说, 此方法最终收敛为针对这一对手的最优策略. 如果步长参
数没有随时间减小到0, 那么下棋程序也能很好地应对缓慢地改变策略的玩家.

1、converges 收敛

This example illustrates the dierences between evolutionary methods and
the methods that learn value functions. To evaluate a policy an evolutionary
method holds the policy xed and plays many games against the opponent, or
simulate many games using a model of the opponent. The frequency of wins
gives an unbiased estimate of the probability of winning with that policy, and
can be used to direct the next policy selection. But each policy change is made
only after many games, and only the nal outcome of each game is used: what
happens during the games is ignored. For example, if the player wins, then
all of its behavior in the game is given credit, independently of how specic
moves might have been critical to the win. Credit is even given to moves that
never occurred! Value function methods, in contrast, allow individual states
to be evaluated. In the end, evolutionary and value function methods both
search the space of policies, but learning a value function takes advantage of
information available during the course of play.

这一例子阐明了进化方法与使用值函数的方法的区别. 为了评估一个策略, 进化方法使
该策略固定, 同对手下许多盘棋或使用对手的模型模拟下许多盘棋. 获胜的频率给出了该
策略获胜概率的无偏估计, 该频率可以用于指导下一步的策略选择. 但策略改进必须要经过
许多盘游戏, 并且只有每局游戏的最终结果被利用了——发生在游戏过程中的一切都被忽视
了. 例如, 如果程序获胜了, 那么这局游戏中的所有行为都获得了赞誉<credit>, 无论某一
步对获胜而言有多重要. 赞誉甚至被给予从未下过的那几步. 而使用值函数的方法, 与之相
反, 允许对各个状态分开进行评估. 从结果上而言, 进化方法与值函数方法都搜索了策略空
间, 但对值函数的学习利用了游戏过程中的信息.

1、illustrates 说明

2、evolutionary 发展的

This simple example illustrates some of the key features of reinforcement
learning methods. First, there is the emphasis on learning while interacting

with an environment, in this case with an opponent player. Second, there is a
clear goal, and correct behavior requires planning or foresight that takes into
account delayed eects of one's choices. For example, the simple reinforcement
learning player would learn to set up multi-move traps for a shortsighted
opponent. It is a striking feature of the reinforcement learning solution that it
can achieve the eects of planning and lookahead without using a model of the
opponent and without conducting an explicit search over possible sequences
of future states and actions.

这个简单的例子阐明了强化学习方法的一些关键特征. 首先, 在有对手的情况下, 必须
强调从与环境的交互中学习.其次, 必须要有明确的目标, 且正确的动作选择需要将延迟的
效果考虑在内的计划或远见。例如, 简单的强化学习程序可能会学会由多步组成的陷阱, 来
应对短视的对手. 这是强化学习方法的一个显著特征。其不需要对手的模型, 也不需要对未
来可能的动作、状态序列进行显式搜索, 就可以达到计划与预见的目的.

1、emphasis 重点

2、striking 显著的

3、explicit 明确的

While this example illustrates some of the key features of reinforcement
learning, it is so simple that it might give the impression that reinforcement
learning is more limited than it really is. Although tic-tac-toe is a two-person
game, reinforcement learning also applies in the case in which there is no external
adversary, that is, in the case of a \game against nature." Reinforcement
learning also is not restricted to problems in which behavior breaks down into
separate episodes, like the separate games of tic-tac-toe, with reward only at
the end of each episode. It is just as applicable when behavior continues indefinitely
and when rewards of various magnitudes can be received at any time.
Reinforcement learning is also applicable to problems that do not even break
down into discrete time steps, like the plays of tic-tac-toe. The general principles
apply to continuous-time problems as well, although the theory gets more
complicated and we omit it from this introductory treatment.

虽然这个示例阐明了一些强化学习的基本特征, 但它是在是太简单了, 以致于可能会给
人留下强化学习的应用十分有限的印象. 除了井字棋这样的双人游戏外, 强化学习也可以用
于没有形式上的对手的情形, 即“与自然斗争的游戏”. 虽然井字棋游戏中每一局是分开的且
只在每一局的末尾有奖赏, 但强化学习并不局限于将交互划分为不同分节<episode> 的问
题. 强化学习也可以应用于交互是无穷无尽的情形, 或可以在任意时间接受不同量级的奖赏
的情形. 强化学习甚至可以应用于不像井字棋那样将时间分为离散的步长的问题. 强化学习
的一般准则也适用于连续时间<continuous-time> 问题, 但相关理论同时也变得更复杂, 因
此在此导论性质的书中略去.

1、adversary 对手

2、restricted 限制的

3、discrete 离散的

Tic-tac-toe has a relatively small, nite state set, whereas reinforcement
learning can be used when the state set is very large, or even innite. For
example, Gerry Tesauro (1992, 1995) combined the algorithm described above
with an articial neural network to learn to play backgammon, which has
approximately 1020 states. With this many states it is impossible ever to
experience more than a small fraction of them. Tesauro's program learned to
play far better than any previous program, and now plays at the level of the
world's best human players (see Chapter 15). The neural network provides
the program with the ability to generalize from its experience, so that in new
states it selects moves based on information saved from similar states faced
in the past, as determined by its network. How well a reinforcement learning
system can work in problems with such large state sets is intimately tied to
how appropriately it can generalize from past experience. It is in this role that
we have the greatest need for supervised learning methods with reinforcement
learning. Neural networks are not the only, or necessarily the best, way to do
this.

虽然井字棋问题的状态集有限且元素数目较小, 但即使状态集很大乃至无限, 强化学习
方法也依然适用. 例如, Tesauro (1992, 1995) 将上述算法同人工神经网络结合起来, 并用于
有大约1020 个状态的西洋双陆棋<backgammon> 游戏. 对于这么多状态而言, 即使只经历
其中的一小部分也是不可能的. Tesauro 的程序的表现远比之前的任何程序好, 并最终超过
了顶级人类玩家( 16.1 节). 人工神经网络为程序提供了对过去经验进行泛化的能力, 所以当
面临新状态时, 程序能通过人工神经网络, 来根据过去面临的类似状态选择合适的下法. 在
面临拥有如此庞大的状态集的问题时, 强化学习系统的性能同其能在多大程度上对过去的经
验进行泛化密切相关. 在这个主题上, 强化学习最需要监督学习的方法. 为了做到这一点, 人
工神经网络与深度学习( 9.6 节) 既不是唯一的方法, 也不一定是最好的方法.

In this tic-tac-toe example, learning started with no prior knowledge beyond
the rules of the game, but reinforcement learning by no means entails a
tabula rasa view of learning and intelligence. On the contrary, prior information
can be incorporated into reinforcement learning in a variety of ways that

can be critical for ecient learning. We also had access to the true state in the
tic-tac-toe example, whereas reinforcement learning can also be applied when
part of the state is hidden, or when dierent states appear to the learner to be
the same. That case, however, is substantially more dicult, and we do not
cover it signicantly in this book.

井字棋游戏中, 在学习开始时没有除游戏规则外的任何先验知识, 但强化学习并不一定
要从空白开始. 恰恰相反, 先验知识可以以多种方式集成到强化学习中, 且有时这对高效的
学习而言是必需的(例如, 参见9.5 节、17.4 节及13.1 节). 此外, 在井字棋游戏中我们可以获
取到真实的状态信息, 但强化学习也可以应用于部分状态被隐藏的情形, 或者对学习器而言
不同的状态看上去相同的情形.

1、incorporated 合并

Finally, the tic-tac-toe player was able to look ahead and know the states
that would result from each of its possible moves. To do this, it had to have
a model of the game that allowed it to \think about" how its environment
would change in response to moves that it might never make. Many problems
are like this, but in others even a short-term model of the eects of actions
is lacking. Reinforcement learning can be applied in either case. No model is
required, but models can easily be used if they are available or can be learned

最后, 井字棋程序能够预见未来, 并预测其所有可能的下法所引出的状态. 但是为了做
到这一点, 强化学习程序需要游戏的一个模型, 该模型能预测环境对程序尚未走的那一步的
可能反应. 许多问题都类似这样, 但在有些问题中甚至关于短期的动作效果的模型也无法得
到. 强化学习在两种情况下都可以适用. 模型并不是必须的, 但如果有现成的模型或模型可
以学得, 那么这些模型可以轻而易举地被使用(第8 章).

On the other hand, there are reinforcement learning methods that do not
need any kind of environment model at all. Model-free systems cannot even
think about how their environments will change in response to a single action.
The tic-tac-toe player is model-free in this sense with respect to its opponent:
it has no model of its opponent of any kind. Because models have to be
reasonably accurate to be useful, model-free methods can have advantages over
more complex methods when the real bottleneck in solving a problem is the
diculty of constructing a suciently accurate environment model. Modelfree
methods are also important building blocks for model-based methods. In
this book we devote several chapters to model-free methods before we discuss
how they can be used as components of more complex model-based methods.

在另一方面, 存在着不需要任何环境模型的强化学习方法. 免模型系统甚至无法预测环
境对单个动作的反应. 使用TD 方法的井字棋程序从对手的意义上说是免模型的: 其没有任
何种类的关于对手的模型. 因为模型必须要足够准确才能派得上用场, 所以当解决问题的瓶
颈在于难以构建足够准确的环境模型时, 免模型方法比其他更复杂的方法有优势. 免模型方
法也是有模型方法的重要组件. 本书中, 我们先用数章讨论免模型方法, 然后再讨论其怎样
作为更为复杂的有模型方法的组件.

But reinforcement learning can be used at both high and low levels in a system.
Although the tic-tac-toe player learned only about the basic moves of the
game, nothing prevents reinforcement learning from working at higher levels
where each of the \actions" may itself be the application of a possibly elaborate
problem-solving method. In hierarchical learning systems, reinforcement
learning can work simultaneously on several levels.

强化学习方法既可以用于系统中的高层, 也可以用于系统中的底层. 虽然在井字棋程
序仅学会了游戏的下法, 但没有什么能阻碍将强化学习用于更高的层次中, 其中可能每一个
“动作” 本身就是复杂的应用. 在层次化学习系统中, 强化学习可以同时工作于多个层级.

1.6 Summary

Reinforcement learning is a computational approach to understanding and automating
goal-directed learning and decision-making. It is distinguished from
other computational approaches by its emphasis on learning by an agent from
direct interaction with its environment, without relying on exemplary supervision
or complete models of the environment. In our opinion, reinforcement
learning is the rst eld to seriously address the computational issues that
arise when learning from interaction with an environment in order to achieve

long-term goals.

强化学习是理解并自动化目标导向的学习与决策的计算性方法. 与别的计算性方法的
不同之处在于: 其强调从代理与环境的直接交互中学习, 而不需要示范性的训练集或环境的
完整模型. 我们认为, 强化学习是第一个真正处理“从与环境的交互中学习来达成长期的目
标” 这一问题的学科.

Reinforcement learning uses a formal framework dening the interaction
between a learning agent and its environment in terms of states, actions, and
rewards. This framework is intended to be a simple way of representing essential
features of the articial intelligence problem. These features include a
sense of cause and eect, a sense of uncertainty and nondeterminism, and the
existence of explicit goals.

强化学习使用马尔科夫决策过程这一框架, 依据状态、动作与奖赏来定义学习代理与环
境之间的交互. 这一框架试图以一种简单的方式表示人工智能问题中的关键特征. 这些特征
包括对因果的理解, 对不确定性与概率性的理解, 以及存在有明确的目标.

The concepts of value and value functions are the key features of most of
the reinforcement learning methods that we consider in this book. We take
the position that value functions are important for ecient search in the space
of policies. Their use of value functions distinguishes reinforcement learning
methods from evolutionary methods that search directly in policy space guided
by scalar evaluations of entire policies.

值与值函数的概念是本书中介绍的多数强化学习方法的关键. 我们认为值函数对于在
策略空间中的高效搜索而言是至关重要的. 值函数的使用将强化学习方法同进化方法区别
开来, 其中后者通过对整个策略的评估来在策略空间中进行直接搜索

money_yuan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【SuttonBartoIPRLBook2ndEd】【chapter I】

The idea that we learn by interacting with our environment is probably therst to occur to us when we think about the nature of learning. When aninfant plays, waves its arms, or looks about, it has n...
复制链接

扫一扫

专栏目录