Chapter 3 Finite Markov Decision Processes
这是我第二遍看这本书,顺便将书中重点摘录出来方便自己以后回顾。
目录
Chapter 3 Finite Markov Decision Processes
3.1 The Agent–Environment Interface
3.4 Unifified Notation for Episodic and Continuing Tasks
3.5 Policies and Value Functions
3.6 Optimal Policies and Optimal Value Functions
3.7 Optimality and Approximation
MDPs are a classical formalization of sequential decision making
MDPs are meant to be a straightforward framing of the problem of learning from
interaction to achieve a goal
3.1 The Agent–Environment Interface
The learner and decision maker is called the agent
The thing the
agent
interacts with, comprising everything outside the agent, is called the
environment
These interact continually, the agent selecting actions and the environment responding to
these actions and presenting new situations to the agent.
1
The environment also gives
rise to rewards, special numerical values that the agent seeks to maximize over time
through its choice of actions.
The MDP and agent
together thereby give rise to a sequence or
trajectory
that begins like this
In a
fifinite
MDP, the sets of states, actions, and rewards (
S
,
A
, and
R
) all have a fifinite
number of elements. In this case, the random variables
R
t
and
S
t
have well defifined
discrete probability distributions dependent only on the preceding state and action.
The function
p
defifines the
dynamics
of the MDP
The
dynamics function
is an ordinary deterministic function of four
arguments.
p
specififies a probability distribution for each choice of
s
and
a
, that is,
马尔可夫性
In a
Markov
decision process, the probabilities given by
p
completely characterize the
environment’s dynamics. That is, the probability of each possible value for
S
t
and
R
t
depends only on the immediately preceding state and action,
S
t
#
1
and
A
t
#
1
, and, given
them, not at all on earlier states and actions. This is best viewed a restriction not on the
decision process, but on the
state
. The state must include information about all aspects
of the past agent–environment interaction that make a difffference for the future. If it
does, then the state is said to have the
Markov property.
其他有时候会用到的公式:
The MDP framework is abstract and flflexible and can be applied to many difffferent
problems in many difffferent ways.
the time steps need not refer to fifixed
intervals of real time; they can refer to arbitrary successive stages of decision making
and acting.
The actions can be low-level controls, such as the voltages applied to the
motors of a robot arm, or high-level decisions, such as whether or not to have lunch or
to go to graduate school.
the states can take a wide variety of forms. They
can be completely determined by low-level sensations, such as direct sensor readings, or they can be more high-level and abstract, such as symbolic descriptions of objects in a room.Some of what makes up a state could be based on memory of past sensations or
even be entirely mental or subjective. For example, an agent could be in the state of not
being sure where an object is, or of having just been surprised in some clearly defifined
sense. Similarly, some actions might be totally mental or computational. For example,
some actions might control what an agent chooses to think about, or where it focuses its
attention.
In general, actions can be any decisions we want to learn how to make, and
the states can be anything we can know that might be useful in making them
the boundary between agent and environment :The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction.
any problem of
learning goal-directed behavior can be reduced to three signals passing back and forth
between an agent and its environment: one signal to represent the choices made by the
agent (the actions), one signal to represent the basis on whichthe choices are made (the
states), and one signal to defifine the agent’s goal (the rewards). This framework may not
be suffiffifficient to represent all decision-learning problems usefully, but it has proved to be
widely useful and applicable.
Example 3.3 Recycling Robot
3.2 Goals and Rewards
In reinforcement learning, the purpose or goal of the agent is formalized in terms of a
special signal, called the
reward
, passing from the environment to the agent. At each time
step, the reward is a simple number
the agent’s goal is to maximize
the total amount of reward it receives ,
This means maximizing not immediate reward,
but cumulative reward in the long run。
The reward signal is your way of communicating to
the robot
what
you want it to achieve, not
how
you want it achieved
3.3 Returns and Episodes .
If the sequence of rewards received after time step t is denoted
we seek to maximize the
expected return
, where the return, denoted
Gt
, is
defifined as some specifific function of the reward sequence. In the simplest case the return
is the sum of the rewards:
when the agent–environment interaction
breaks naturally into subsequences, which we call
episodes
,
Each episode ends in a special
state called the
terminal state
, followed by a reset to a standard starting state or to a
sample from a standard distribution of starting states
the next episode begins
independently of how the previous one ended.
Thus the episodes can all be considered to
end in the same terminal state, with difffferent rewards for the difffferent outcomes.
Tasks
with episodes of this kind are called
episodic tasks
in many cases the agent–environment interaction does not break
naturally into identififiable episodes, but goes on continually without limit. For example,
this would be the natural way to formulate an on-going process-control task, or an
application to a robot with a long life span. We call these continuing tasks.
The discount rate determines the present value of future rewards
3.4 Unifified Notation for Episodic and Continuing Tasks
the agent–environment interaction naturally breaks down into a sequence of 、
separate episodes (episodic tasks), and one in which it does not (continuing tasks).
These
two can be unifified by considering episode termination to be the entering of a special
absorbing state
that transitions only to itself and that generates only rewards of zero
3.5 Policies and Value Functions
value functions
—functions
of states (or of state–action pairs) that estimate
how good
it is for the agent to be in a
given state (or how good it is to perform a given action in a given state). The notion
of “how good” here is defifined in terms of future rewards that can be expected, or, to
be precise, in terms of expected return.
Of course the rewards the agent can expect to
receive in the future depend on what actions it will take.
Accordingly, value functions
are defifined with respect to particular ways of acting, called policies.
a
policy
is a mapping from states to probabilities of selecting each possible
action.
Reinforcement learning methods specify how the agent’s policy is changed as a result of
its experience
The
value function
of a state
s
under a policy
π
, denoted vπ(
s
), is the expected return
when starting in
s
and following
π
thereafter. For MDPs, we can defifine vπ formally by
We call the function vπ the
state-value function for policy π
we defifine the value of taking action
a
in state
s
under a policy
π
, denoted qπ(s, a
), as the expected return starting from
s
, taking the action
a
, and thereafter
following policy
π
:
We call qπ the
action-value function for policy π
The value functions vπ and qπ can be estimated from experience.
For example, if an
agent follows policy
π
and maintains an average, for each state encountered, of the actual
returns that have followed that state, then the average will converge to the state’s value, vπ(s
), as the number of times that state is encountered approaches infifinity. If separate
averages are kept for each action taken in each state, then these averages will similarly
converge to the action values, qπ(s, a). We call estimation methods of this kind Monte Carlo methods because they involve averaging over many random samples of actual returns.
Of course, if there are very many
states, then it may not be practical to keep separate averages for each state individually
Instead, the agent would have to maintain vπ and qπ as parameterized functions (with
fewer parameters than states) and adjust the parameters to better match the observe
returns. This can also produce accurate estimates, although much depends on the nature
of the parameterized function approximator
A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships similar to that which we have already established for the return by
For any policy
π
and any state
s
, the
following consistency condition holds between the value of
s
and the value of its possible
successor states:
以上公式为:
the Bellman equation for vΠ
It expresses
a relationship between the value of a state and the values of
its successor states. Think of looking ahead from a state to its
possible successor states
Starting from state
s
, the root
node at the top, the agent could take any of some set of actions—
three are shown in the diagram—based on its policy
π
. From
each of these, the environment could respond with one of several next states,
s
0
(two are
shown in the fifigure), along with a reward,
r
, depending on its dynamics given by the
function
p。
Bellman方程对所有的可能性求平均值,以发生的概率来加权
Gridworld 例子代码实现(附代码)
实现:
此处是迭代求状态值的过程
此处是设置画表的过程:
求最优状态值,根据最优状态值函数的公式去迭代更新,直到收敛
others:
3.6 Optimal Policies and Optimal Value Functions
There is always at least one policy that is better than or equal to all other
policies. This is an
optimal policy
.
Although there may be more than one, we denote all
the optimal policies by
π*
They share the same state-value function,called the
optimal state-value function
, denoted
v*
Optimal policies also share the same
optimal action-value function
, denoted
q*
, and
defifined as
the Bellman optimality equation expresses the
fact that the value of a state under an optimal policy must equal the expected return for
the best action from that state:
(最优策略下的最大动作值)
The last two equations are two forms of the Bellman optimality equation for
v*
. The
Bellman optimality equation for
q*
is
By means of
v
⇤
, the optimal expected long-term return is
turned into a quantity that is locally and immediately available for each state. Hence, a
one-step-ahead search yields the long-term optimal actions.
Many difffferent decision-making methods can be viewed as ways of approximately
solving the Bellman optimality equation.
The methods of dynamic programming can be related even more
closely to the Bellman optimality equation. Many reinforcement learning methods can
be clearly understood as approximately solving the Bellman optimality equation, using
actual experienced transitions in place of knowledge of the expected transitions. We
consider a variety of such methods in the following chapters.
3.7 Optimality and Approximation
it is possible to form these approximations using
arrays or tables with one entry for each state (or state–action pair). This we call the
tabular
case, and the corresponding methods we call tabular methods
In many cases
of practical interest, however, there are far more states than could possibly be entries
in a table. In these cases the functions must be approximated, using some sort of more
compact parameterized function representation
The online nature of reinforcement learning makes it possible to approximate
optimal policies in ways that put more effffort into learning to make good decisions for
frequently encountered states, at the expense of less effffort for infrequently encountered
states. This is one key property that distinguishes reinforcement learning from other
approaches to approximately solving MDPs.
3.8 Summary
Let us summarize the elements of the reinforcement learning problem that we have
presented in this chapter.
Reinforcement learning is about learning from interaction
how to behave in order to achieve a goal. The reinforcement learning
agent
and its
environment
interact over a sequence of discrete time steps.
The specifification of their
interface defifines a particular task: the
actions
are the choices made by the agent; the
states
are the basis for making the choices; and the
rewards
are the basis for evaluating
the choices.
Everything inside the agent is completely known and controllable by the
agent; everything outside is incompletely controllable but may or may not be completely
known.
A
policy
is a stochastic rule by which the agent selects actions as a function of
states.
The agent’s objective is to maximize the amount of reward it receives over time.
When the reinforcement learning setup described above is formulated with well defifined
transition probabilities it constitutes a
Markov decision process
(MDP). A
fifinite MDP
is
an MDP with fifinite state, action, and (as we formulate it here) reward sets.
Much of the
current theory of reinforcement learning is restricted to fifinite MDPs, but the methods
and ideas apply more generally.
The
return
is the function of future rewards that the agent seeks to maximize (in
expected value). It has several difffferent defifinitions depending upon the nature of the
task and whether one wishes to
discount
delayed reward.
The undiscounted formulation
is appropriate for
episodic tasks
, in which the agent–environment interaction breaks
naturally into
episodes
; the discounted formulation is appropriate for
continuing tasks
, in
which the interaction does not naturally break into episodes but continues without limit.
We try to defifine the returns for the two kinds of tasks such that one set of equations can
apply to both the episodic and continuing cases.
A policy’s
value functions
assign to each state, or state–action pair, the expected return
from that state, or state–action pair, given that the agent uses the policy.
The optimal
value functions
assign to each state, or state–action pair, the largest expected return
achievable
by any policy.
A policy whose value functions are optimal is an
optimal policy
.
Whereas the optimal value functions for states and state–action pairs are
unique
for a
given MDP, there can be many optimal policies.
Any policy that is
greedy
with respect to
the optimal value functions must be an optimal policy.
The
Bellman optimality equations
are special consistency conditions that the optimal value functions must satisfy and that
can, in principle, be solved for the optimal value functions, from which an optimal policy
can be determined with relative ease。
A reinforcement learning problem can be posed in a variety of difffferent ways depending
on assumptions about the level of knowledge initially available to the agent.
In problems
of
complete knowledge
, the agent has a complete and accurate model of the environment’s
dynamics. If the environment is an MDP, then such a model consists of the complete four-
argument dynamics function
p
In problems of
incomplete knowledge
, a complete
and perfect model of the environment is not available.
Even if the agent has a complete and accurate environment model, the agent is
typically unable to perform enough computation per time step to fully use it.
The
memory available is also an important constraint. Memory may be required to build
up accurate approximations of value functions, policies, and models. In most cases of
practical interest there are far more states than could possibly be entries in a table, and
approximations must be made.
A well-defifined notion of optimality organizes the approach to learning we describe in
this book and provides a way to understand the theoretical properties of various learning
algorithms, but it is an ideal that reinforcement learning agents can only approximate
to varying degrees.
In reinforcement learning we are very much concerned with cases in which optimal solutions cannot be found but must be approximated in some way.