Chapter 3 Finite Markov Decision Processes

最新推荐文章于 2023-04-21 08:16:42 发布

阳光什锦

最新推荐文章于 2023-04-21 08:16:42 发布

阅读量413

点赞数

分类专栏：强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/weixin_45563118/article/details/106197324

版权

强化学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Chapter 3 Finite Markov Decision Processes

这是我第二遍看这本书，顺便将书中重点摘录出来方便自己以后回顾。

Chapter 3 Finite Markov Decision Processes

3.1 The Agent–Environment Interface

Example 3.3 Recycling Robot

3.2 Goals and Rewards

3.3 Returns and Episodes .

3.4 Unifified Notation for Episodic and Continuing Tasks

3.5 Policies and Value Functions

Gridworld 例子代码实现（附代码）

3.6 Optimal Policies and Optimal Value Functions

3.7 Optimality and Approximation

3.8 Summary

MDPs are a classical formalization of sequential decision making

MDPs are meant to be a straightforward framing of the problem of learning from

interaction to achieve a goal

3.1 The Agent–Environment Interface

The learner and decision maker is called the agent

The thing the agent interacts with, comprising everything outside the agent, is called the environment

These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. 1 The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

The MDP and agent together thereby give rise to a sequence or trajectory that begins like this

In a fifinite MDP, the sets of states, actions, and rewards ( S , A , and R ) all have a fifinite number of elements. In this case, the random variables R t and S t have well defifined discrete probability distributions dependent only on the preceding state and action.

The function p defifines the dynamics of the MDP

The dynamics function

is an ordinary deterministic function of four arguments.

p specififies a probability distribution for each choice of s and a , that is,

马尔可夫性

In a Markov decision process, the probabilities given by p completely characterize the environment’s dynamics. That is, the probability of each possible value for S t and R t depends only on the immediately preceding state and action, S t # 1 and A t # 1 , and, given them, not at all on earlier states and actions. This is best viewed a restriction not on the decision process, but on the state . The state must include information about all aspects of the past agent–environment interaction that make a difffference for the future. If it does, then the state is said to have the Markov property.

其他有时候会用到的公式：

The MDP framework is abstract and flflexible and can be applied to many difffferent problems in many difffferent ways.

the time steps need not refer to fifixed intervals of real time; they can refer to arbitrary successive stages of decision making and acting.

The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school.

the states can take a wide variety of forms. They can be completely determined by low-level sensations, such as direct sensor readings, or they can be more high-level and abstract, such as symbolic descriptions of objects in a room.Some of what makes up a state could be based on memory of past sensations or even be entirely mental or subjective. For example, an agent could be in the state of not being sure where an object is, or of having just been surprised in some clearly defifined sense. Similarly, some actions might be totally mental or computational. For example, some actions might control what an agent chooses to think about, or where it focuses its attention.

In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them

the boundary between agent and environment ：The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. any problem of learning goal-directed behavior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on whichthe choices are made (the states), and one signal to defifine the agent’s goal (the rewards). This framework may not be suffiffifficient to represent all decision-learning problems usefully, but it has proved to be widely useful and applicable.

Example 3.3 Recycling Robot

3.2 Goals and Rewards

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the reward , passing from the environment to the agent. At each time step, the reward is a simple number

the agent’s goal is to maximize the total amount of reward it receives ， This means maximizing not immediate reward, but cumulative reward in the long run。

The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved

3.3 Returns and Episodes .

If the sequence of rewards received after time step t is denoted we seek to maximize the expected return , where the return, denoted Gt , is defifined as some specifific function of the reward sequence. In the simplest case the return is the sum of the rewards:

when the agent–environment interaction breaks naturally into subsequences, which we call episodes ,

Each episode ends in a special state called the terminal state , followed by a reset to a standard starting state or to a sample from a standard distribution of starting states

the next episode begins independently of how the previous one ended.

Thus the episodes can all be considered to end in the same terminal state, with difffferent rewards for the difffferent outcomes.

Tasks with episodes of this kind are called episodic tasks

in many cases the agent–environment interaction does not break naturally into identififiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span. We call these continuing tasks.

The discount rate determines the present value of future rewards

3.4 Unifified Notation for Episodic and Continuing Tasks

the agent–environment interaction naturally breaks down into a sequence of 、 separate episodes (episodic tasks), and one in which it does not (continuing tasks).

These two can be unifified by considering episode termination to be the entering of a special absorbing state that transitions only to itself and that generates only rewards of zero

3.5 Policies and Value Functions

value functions —functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defifined in terms of future rewards that can be expected, or, to be precise, in terms of expected return.

Of course the rewards the agent can expect to receive in the future depend on what actions it will take.

Accordingly, value functions are defifined with respect to particular ways of acting, called policies.

a policy is a mapping from states to probabilities of selecting each possible action.

Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience

The value function of a state s under a policy π , denoted vπ( s ), is the expected return when starting in s and following π thereafter. For MDPs, we can defifine vπ formally by

We call the function vπ the state-value function for policy π

we defifine the value of taking action a in state s under a policy π , denoted qπ(s, a ), as the expected return starting from s , taking the action a , and thereafter following policy π :

We call qπ the action-value function for policy π

The value functions vπ and qπ can be estimated from experience.

For example, if an agent follows policy π and maintains an average, for each state encountered, of the actual returns that have followed that state, then the average will converge to the state’s value, vπ(s ), as the number of times that state is encountered approaches infifinity. If separate averages are kept for each action taken in each state, then these averages will similarly converge to the action values, qπ(s, a). We call estimation methods of this kind Monte Carlo methods because they involve averaging over many random samples of actual returns.

Of course, if there are very many states, then it may not be practical to keep separate averages for each state individually

Instead, the agent would have to maintain vπ and qπ as parameterized functions (with fewer parameters than states) and adjust the parameters to better match the observe returns. This can also produce accurate estimates, although much depends on the nature of the parameterized function approximator

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships similar to that which we have already established for the return by

For any policy π and any state s , the following consistency condition holds between the value of s and the value of its possible

successor states:

以上公式为： the Bellman equation for vΠ

It expresses a relationship between the value of a state and the values of its successor states. Think of looking ahead from a state to its possible successor states

Starting from state s , the root node at the top, the agent could take any of some set of actions— three are shown in the diagram—based on its policy π . From each of these, the environment could respond with one of several next states, s 0 (two are shown in the fifigure), along with a reward, r , depending on its dynamics given by the function p。

Bellman方程对所有的可能性求平均值，以发生的概率来加权

Gridworld 例子代码实现（附代码）

实现：

此处是迭代求状态值的过程

此处是设置画表的过程：

求最优状态值，根据最优状态值函数的公式去迭代更新，直到收敛

others：

3.6 Optimal Policies and Optimal Value Functions

There is always at least one policy that is better than or equal to all other policies. This is an optimal policy .

Although there may be more than one, we denote all the optimal policies by π* They share the same state-value function,called the optimal state-value function , denoted v*

Optimal policies also share the same optimal action-value function , denoted q* , and defifined as

the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:

（最优策略下的最大动作值）

The last two equations are two forms of the Bellman optimality equation for v* . The Bellman optimality equation for q* is

By means of v ⇤ , the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. Hence, a one-step-ahead search yields the long-term optimal actions.

Many difffferent decision-making methods can be viewed as ways of approximately solving the Bellman optimality equation.

The methods of dynamic programming can be related even more closely to the Bellman optimality equation. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using

actual experienced transitions in place of knowledge of the expected transitions. We consider a variety of such methods in the following chapters.

3.7 Optimality and Approximation

it is possible to form these approximations using arrays or tables with one entry for each state (or state–action pair). This we call the tabular case, and the corresponding methods we call tabular methods

In many cases of practical interest, however, there are far more states than could possibly be entries in a table. In these cases the functions must be approximated, using some sort of more compact parameterized function representation

The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effffort into learning to make good decisions for frequently encountered states, at the expense of less effffort for infrequently encountered

states. This is one key property that distinguishes reinforcement learning from other approaches to approximately solving MDPs.

3.8 Summary

Let us summarize the elements of the reinforcement learning problem that we have presented in this chapter.

Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning agent and its environment interact over a sequence of discrete time steps.

The specifification of their interface defifines a particular task: the actions are the choices made by the agent; the states are the basis for making the choices; and the rewards are the basis for evaluating the choices.

Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known.

A policy is a stochastic rule by which the agent selects actions as a function of states.

The agent’s objective is to maximize the amount of reward it receives over time.

When the reinforcement learning setup described above is formulated with well defifined transition probabilities it constitutes a Markov decision process (MDP). A fifinite MDP is an MDP with fifinite state, action, and (as we formulate it here) reward sets.

Much of the current theory of reinforcement learning is restricted to fifinite MDPs, but the methods

and ideas apply more generally.

The return is the function of future rewards that the agent seeks to maximize (in expected value). It has several difffferent defifinitions depending upon the nature of the task and whether one wishes to discount delayed reward.

The undiscounted formulation is appropriate for episodic tasks , in which the agent–environment interaction breaks

naturally into episodes ; the discounted formulation is appropriate for continuing tasks , in which the interaction does not naturally break into episodes but continues without limit.

We try to defifine the returns for the two kinds of tasks such that one set of equations can apply to both the episodic and continuing cases.

A policy’s value functions assign to each state, or state–action pair, the expected return from that state, or state–action pair, given that the agent uses the policy.

The optimal value functions assign to each state, or state–action pair, the largest expected return achievable by any policy.

A policy whose value functions are optimal is an optimal policy .

Whereas the optimal value functions for states and state–action pairs are unique for a given MDP, there can be many optimal policies.

Any policy that is greedy with respect to the optimal value functions must be an optimal policy.

The Bellman optimality equations are special consistency conditions that the optimal value functions must satisfy and that

can, in principle, be solved for the optimal value functions, from which an optimal policy can be determined with relative ease。

A reinforcement learning problem can be posed in a variety of difffferent ways depending on assumptions about the level of knowledge initially available to the agent.

In problems of complete knowledge , the agent has a complete and accurate model of the environment’s dynamics. If the environment is an MDP, then such a model consists of the complete four- argument dynamics function p

In problems of incomplete knowledge , a complete and perfect model of the environment is not available.

Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it.

The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made.

A well-defifined notion of optimality organizes the approach to learning we describe in this book and provides a way to understand the theoretical properties of various learning algorithms, but it is an ideal that reinforcement learning agents can only approximate to varying degrees.

In reinforcement learning we are very much concerned with cases in which optimal solutions cannot be found but must be approximated in some way.