RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)

最新推荐文章于 2023-01-12 10:42:55 发布

连理o

最新推荐文章于 2023-01-12 10:42:55 发布

阅读量665

点赞数 2

分类专栏：强化学习文章标签：强化学习

本文链接：https://blog.csdn.net/weixin_42437114/article/details/109377920

版权

强化学习专栏收录该内容

18 篇文章 5 订阅

订阅专栏

本文为强化学习笔记，主要参考以下内容：

Reinforcement Learning: An Introduction
代码全部来自 GitHub
习题答案参考 Github

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states. Thus MDPs involve delayed reward and the need to trade off immediate and delayed reward.
Whereas in bandit problems we estimated the value $q_*(a)$ of each action $a$ , in MDPs we estimate the value $q_*(s, a)$ of each action $a$ in each state $s$ , or we estimate the value $v_*(s)$ of each state given optimal action selections.

The Agent–Environment Interface

智能体-环境交互接口

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
The MDP and agent together give rise to a sequence or $t r a j e c t o r y$ that begins like this:

智能体在某个时间 $t$ 得到环境状态 $S_t\in\mathcal S$ ，基于 $S_t$ 做出动作 $A_t\in\mathcal A(s)$ ，并在下一个时间 $t + 1$ 得到收益 $R_{t+1}\in\mathcal R$ ，同时得到新的环境状态 $S_{t+1}$ … 这个过程不断持续进而得到上面的轨迹

In a finite MDP, the sets of states, actions, and rewards ( $\mathcal S, \mathcal A$ , and $\mathcal R$ ) all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action. The function $p$ defines the $d y n a m i c s$ (动态特性) of the MDP. That is, the probability of each possible value for $S_t$ and $R_t$ depends on the immediately preceding state and action, $S_{t−1}$ and $A_{t−1}$ .
- $p$ specifies a probability distribution for each choice of $s$ and $a$ , that is, that
This is best viewed as a restriction not on the decision process, but on the $s t a t e$ . The state must include information about all aspects of the past agent–environment interaction that make a difference for the future. If it does, then the state is said to have the $\boldsymbol {Markov}$ $\boldsymbol {property}$ .
- 例如，在玩游戏时，你不知道对手的行动，但状态会同时被你和对手影响，这种情况就不是 $M D P$

MDP 将目标导向的交互式学习问题概括为了三个信号：
$r e w a r d s, a c t i o n s, s t a t e s$
当然，要注意每个状态都必须满足 $\boldsymbol {Markov}$ $\boldsymbol {property}$ (马尔科夫性)

From the four-argument dynamics function, $p$ , one can compute anything else one might want to know about the environment, such as

the $s t a t e$ - $t r a n s i t i o n$ $p r o b a b i l i t i e s$ :

在这里插入图片描述

the expected rewards for state–action pairs

在这里插入图片描述

the expected rewards for state–action–next-state triples

在这里插入图片描述

e.g. two ways to summarize the $d y n a m i c s$ of a finite MDP: $t r a n s i t i o n$ $t a b l e$ and $t r a n s i t i o n$ $g r a p h$ :
- In the transition graph, there is a $s t a t e$ $n o d e$ for each possible state (a large open circle labeled by the name of the state), and an $a c t i o n$ $n o d e$ for each state–action pair (a small solid circle labeled by the name of the action and connected by a line to the state node). Each arrow corresponds to a triple $(s, s^{'}, a)$ , and we label the arrow with the transition probability, $p (s^{'} ∣ s, a)$ , and the expected reward for that transition, $r (s, a, s^{'})$ .

Goals and Rewards

在这里插入图片描述

The reward signal is your way of communicating to the agent what you want achieved, not how you want it achieved.

It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.(Better places for imparting this kind of prior knowledge are the initial policy or initial value function.)
- For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponent’s pieces or aining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent’s pieces even at the cost of losing the game.

Returns and Episodes

回报和分幕

In general, we seek to maximize the $e x p e c t e d$ $r e t u r n$ $G_t$ , which is defined as some specific function of the reward sequence.

$e p i s o d i c$ $t a s k s$ (分幕式任务)

In the simplest case the return is the sum of the rewards:
where $T$ is a final time step.
This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call $\boldsymbol {episodes}$ . Each episode ends in a $t e r m i n a l$ $s t a t e$ , followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Thus the episodes can all be considered to end in the same terminal state, with different rewards for the different outcomes.

In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted $S$ , from the set of all states plus the terminal state, denoted $S^+$ .

$c o n t i n u i n g$ $t a s k s$ (持续性任务)

On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit.
The return formulation (3.7) is problematic for continuing tasks because the final time step would be $\infty$ , and the return could easily be infinite.
Thus, in this book we usually use a definition of return that is slightly more complex conceptually but much simpler mathematically.

$discounted\ return$ (折后回报)：

The agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized
where $0\leq\gamma\leq1$ is called $discount\ rate$ (折扣率).
- If $\gamma < 1$ , the infinite sum in (3.8) has a finite value as long as the reward sequence ${R_k\}$ is bounded.
- If $\gamma = 0$ , the agent is “ $m y o p i c$ ” in being concerned only with maximizing immediate rewards.
- As $\gamma$ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more $f a r s i g h t e d$ .
Returns at successive time steps are related to each other in a way described below:
- Note that this works for all time steps $t < T$ , even if termination occurs at $t + 1$ , provided we define $G_T = 0$ . This often makes it easy to compute returns from reward sequences.

Note that although the return (3.8) is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if $\gamma < 1$ . For example, if the reward is a constant $+ 1$ , then the return is

Example 3.4: Pole-Balancing

The objective in this task is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical after each failure.

在这里插入图片描述

This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be $+ 1$ for every time step on which failure did not occur, so that the return at each time would be the number of steps until failure. In this case, successful balancing forever would mean a return of infinity.
Alternatively, we could treat pole-balancing as a continuing task, using discounting. In this case the reward would be $- 1$ on each failure and zero at all other times. The return at each time would then be related to $−\gamma^{K−1}$ , where $K$ is the number of time steps before failure.
In either case, the return is maximized by keeping the pole balanced for as long as possible.

Exercise 3.7

Imagine that you are designing a robot to run a maze. You decide to give it a reward of $+ 1$ for escaping from the maze and a reward of zero at all other times. You decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

ANSWER

在这里插入图片描述

Unified Notation for Episodic and Continuing Tasks

分幕式和持续性任务的统一表示法

$S_{t,i}$ : the state representation at time $t$ of episode $i$
(and similarly for $A_{t,i}, R_{t,i}, \pi_{t,i}, T_i$ , etc.).

In fact, when we discuss episodic tasks we almost never have to distinguish between different episodes. We are almost always considering a particular episode, or stating something that is true for all episodes.
Accordingly, in practice, we write $S_t$ to refer to $S_{t,i}$ , and so on.

We have defined the return as a sum over a finite number of terms in one case (3.7) and as a sum over an infinite number of terms in the other (3.8). These two can be unified by considering episode termination to be the entering of a special $a b s o r b i n g$ $s t a t e$ that transitions only to itself and that generates only rewards of zero.
- For example, consider the state transition diagram:
  Summing the reward sequence $+ 1, + 1, + 1, 0, 0, 0 . . .$ , we get the same return whether we sum over the first $T$ rewards (here $T = 3$ ) or over the full infinite sequence. This remains true even if we introduce discounting.
Thus, we can define the return, in general, according to (3.8).
Alternatively, we can write
including the possibility that $T = 1$ or $\gamma = 1$ (but not both).

Policies and Value Functions

策略与价值函数

Value functions: functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state.
- Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.
Policy: a mapping from states to probabilities of selecting each possible action.
- If the agent is following policy $\pi$ at time $t$ , then $\pi(a|s)$ is the probability that $A_t = a$ if $S_t = s$ .

Exercise 3.11

If the current state is $S_t$ , and actions are selected according to a stochastic policy $\pi$ , then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

ANSWER

在这里插入图片描述

The value function of a state $s$ under a policy $\pi$ , denoted $v_{\pi}(s)$ , is the expected return when starting in $s$ and following $\pi$ . We call the function $v_{\pi}$ the $s t a t e$ - $v a l u e$ $f u n c t i o n$ for policy $\pi$ .thereafter.
For MDPs, we can define $v_{\pi}$ formally by
where $\mathbb E_{\pi}[·]$ denotes the expected value of a random variable given that the agent follows policy $\pi$ .
Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$ , denoted $q_{\pi}(s, a)$ . We call $q_{\pi}$ the $a c t i o n$ - $v a l u e$ function for policy $\pi$ .

Exercise 3.12

Give an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$ .

ANSWER
$\begin{aligned}v_\pi(s)&=\mathbb E_\pi[G_t|S_t=s]=\sum_a\sum_{s',r'}p(s',r',a|s)G_t \\&=\sum_a\sum_{s',r'}p(s',r'|s,a)p(a|s)G_t \\&=\sum_a\pi(a|s)\sum_{s',r'}p(s',r'|s,a)G_t \\&=\sum_a\pi(a|s)q_\pi(s,a) \end{aligned}$

Exercise 3.13

Give an equation for $q_\pi$ in terms of $v_\pi$ and the four-argument $p$ .

ANSWER
$\begin{aligned}q_\pi(s,a)&=\mathbb E[G_t|s,a] \\&=\mathbb E[R_{t+1}|s,a]+\mathbb E[\gamma G_{t+1}|s,a] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)\mathbb E[G_{t+1}|s',r',s,a] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)\mathbb E[G_{t+1}|s'] \\&=\sum_{s',r'}p(s',r'|s,a)r'+\gamma\sum_{s',r'}p(s',r'|s,a)v_\pi(s') \\&=\sum_{s',r'}p(s',r'|s,a)[r'+\gamma v_\pi(s')] \end{aligned}$

Bellman equation (贝尔曼方程)：

A fundamental property of value functions is that they satisfy recursive relationships:
It is really a sum over all values of the three variables, $a$ , $s^{'}$ , and $r$ . For each triple, we compute its probability, $\pi(a|s)p(s', r|s, a)$ , weight the quantity in brackets by that probability, then sum over all possibilities to get an expected value.
Equation (3.14) is the $B e l l m a n$ $e q u a t i o n$ for $v_\pi$ . It expresses a relationship between the value of a state and the values of its successor states.

The existence and uniqueness of $v_\pi$ are guaranteed as long as either $\gamma < 1$ or eventual termination is guaranteed from all states under the policy $\pi$ .

Backup diagram (回溯图):
在这里插入图片描述

Starting from state $s$ , the agent could take any of some set of actions—three are shown in the diagram—based on its policy $\pi$ . From each of these, the environment could respond with one of several next states, $s^{'}$ (two are shown in the figure), along with a reward, $r$ , depending on its dynamics given by the function $p$ .
The Bellman equation (3.14) averages over all the possibilities, weighting each by its probability of occurring. It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.
We call diagrams like that above $b a c k u p$ $d i a g r a m s$ because they diagram relationships that form the basis of the update or backup operations. These operations transfer value information back to a state (or a state–action pair) from its successor states (or state–action pairs).

Note that, unlike transition graphs, the state nodes of backup diagrams do not necessarily represent distinct states; for example, a state might be its own successor.

Exercise 3.17

What is the Bellman equation for action values, that is, for $q_\pi$ ?

ANSWER

在这里插入图片描述

Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy.

在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the value at the root node, $v_\pi(s)$ , in terms of the value at the expected leaf node, $q_\pi(s, a)$ , given $S_t = s$ .

ANSWER

在这里插入图片描述

Exercise 3.19

The value of an action, $q_\pi(s, a)$ , depends on the expected next reward and the expected sum of the remaining rewards.

在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, $q_\pi(s, a)$ , in terms of the expected next reward, $R_{t+1}$ , and the expected next state value, $v_\pi(S_{t+1})$ , given that $S_t=s$ and $A_t=a$ .

ANSWER

在这里插入图片描述

Optimal Policies and Optimal Value Functions

A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states. In other words, $\pi\geq\pi'$ if and only if $v_\pi(s) \geq v_{\pi'}(s)$ for all $s\in\mathcal S$ . (这实际上是一个偏序关系)
There is always at least one policy that is better than or equal to all other policies. This is an $\boldsymbol{optimal}$ $\boldsymbol{policy}$ . We denote all the optimal policies by $\boldsymbol{\pi_*}$ . They share the same state-value function, called the $o p t i m a l$ $s t a t e$ - $v a l u e$ $f u n c t i o n$ , denoted $v_*$ , and defined as
$v_*(s)=\max_\pi v_\pi(s)\ \ \ \ \ \ \ \ \ (3.15)$ for all $s\in\mathcal S$ .

This is the Bellman equation for $v_*$ , or the Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
The last two equations are two forms of the Bellman optimality equation for $v_*$ .
The Bellman optimality equation for $q_*$ is
- These are the same as the backup diagrams for $v_\pi$ and $q_\pi$ presented earlier except that arcs have been added at the agent’s choice points to represent that the maximum over that choice is taken rather than the expected value given some policy.

For finite MDPs, the Bellman optimality equation for $v_*$ (3.19) has a unique solution. The Bellman optimality equation is actually a system of equations, one for each state, so if there are $n$ states, then there are $n$ equations in $n$ unknowns. If the dynamics $p$ of the environment are known, then in principle one can solve this system of equations for $v_*$ using any one of a variety of methods for solving systems of nonlinear equations. One can solve a related set of equations for $q_*$ .

Example 3.9: Bellman Optimality Equations for the Recycling Robot

机器人有两个状态：高电量和低电量，每个状态下可以采取三种动作：主动清扫、等待别人扔垃圾给它、返回充电；工作时没电得到负收益(这里假设没电之后会有人把机器人重新拿去充电，因此如果没电，下一个状态即为高电量)，其余情况则得到相应收益
Using (3.19), we can explicitly give the Bellman optimality equation for the recycling robot example. To make things more compact, we abbreviate the states high and low, and the actions search, wait, and recharge respectively by $h, l, s, w,$ and $r e$ . Because there are only two states, the Bellman optimality equation consists of two equations.

For any choice of $r_s, r_w, \alpha, \beta$ , and $\gamma$ , with $0\leq \gamma < 1$ , $\leq \alpha,\beta \leq 1$ , there is exactly one pair of numbers, $v_*(h)$ and $v_*(l)$ , that simultaneously satisfy these two nonlinear equations.

Once one has $v_*$ , it is relatively easy to determine an optimal policy.
- For each state $s$ , there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero probability only to these actions is an optimal policy. You can think of this as a one-step search. If you have the optimal value function, $v_*$ , then the actions that appear best after a one-step search will be optimal actions.
- Another way of saying this is that any policy that is $g r e e d y$ with respect to the optimal evaluation function $v_*$ is an optimal policy.
  (对于 $v_*$ ，任何贪心策略都是最优策略)
- The beauty of $v_*$ is that if one uses it to evaluate the short-term consequences of actions—specifically, the one-step consequences—then a greedy policy is actually optimal in the long-term sense in which we are interested because $v_*$ already takes into account the reward consequences of all possible future behavior.
Having $q_*$ makes choosing optimal actions even easier. With $q_*$ , the agent does not even have to do a one-step-ahead search: for any state $s$ , it can simply find any action that maximizes $q_*(s, a)$ .

Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, and thus to solving the reinforcement learning problem. However, this solution is rarely directly useful. It is akin to an exhaustive search (穷举), looking ahead at all possibilities, computing their probabilities of occurrence and their desirabilities in terms of expected rewards. This solution relies on at least three assumptions that are rarely true in practice:
- the dynamics of the environment are accurately known;
- computational resources are sufficient to complete the calculation; In particular, extensive memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states than could possibly be entries in a table, and approximations must be made.
- the states have the Markov property.

In reinforcement learning one typically has to settle for approximate solutions. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions.
- For example, heuristic search methods (启发式搜索) can be viewed as expanding the right-hand side of (3.19) several times, up to some depth, forming a “tree” of possibilities, and then using a heuristic evaluation function to approximate $v_*$ at the “leaf” nodes.

Exercise 3.22

Consider the continuing MDP shown below. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $\pi_{left}$ and $\pi_{right}$ . What policy is optimal if $\gamma = 0$ ? If $\gamma = 0.9$ ? If $\gamma = 0.5$ ?

在这里插入图片描述
ANSWER

在这里插入图片描述

Based on the above return formulas for each policy, $\gamma = 0.5$ seems to be the borderline. If $\gamma > 0.5$ , right is optimal; if $\gamma < 0.5$ , left is optimal. If $\gamma = 0.5$ , both are optimal.

连理o

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)

本文为强化学习笔记，主要参考以下内容：Reinforcement Learning: An Introduction代码全部来自 GitHub习题答案参考 Github莫烦 Python 的强化学习教程还有两个应该比较好的公开课，我还没看过：李宏毅 2020 深度强化学习课程David Silver 强化学习课程以及知乎上一个很棒的课程总结目录The Agent–Environment InterfaceGoals and RewardsReturns and EpisodesU
复制链接

扫一扫