Finite Markov Decision Processes

最新推荐文章于 2020-11-01 20:45:15 发布

Vic_Hao

最新推荐文章于 2020-11-01 20:45:15 发布

阅读量279

点赞数 2

分类专栏：强化学习

本文链接：https://blog.csdn.net/weixin_42018112/article/details/80534817

版权

强化学习专栏收录该内容

18 篇文章 3 订阅

订阅专栏

A Markov process is a memoryless random process, i.e. a sequence of random states $S_{1}, S_{2},\cdots$ with the Markov property.

Definition
A Markov Process (or Markov Chain) is a tuple (S,P)
- $S$ is a set of states (finite).
- $P$ is a state transition probability matrix, $P_{s,{s}'}=\mathbb{P}[S_{t+1}={s}'\mid S_{t}=s]$

Markov Decision Processes are a classical formalization of sequential decision making. MDPs are a mathmatically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability.

MDP formally describe an environment for RL, where the environment is fully observable.

1. The Agent-Environment Interface

The learner and decision maker is called the agent.
The thing it interacts with, comprising everything outside the agent, is called the environment.
(We uses the terms agent, environment, and action instead of the engineers’ terms controller, controlled system(or plant), and control signal because they are meaningful to a wider audience.)

In particular, the boundary between agent and environment is typically not the same as the physical boudary of robot’s or animal body. The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.

Figure 1: The agent-environment interaction in a Markov decision process.

The probabilities given by the four-argument function $p$ completely characterize the dynamics of a finite MDP:

p (s^{^{'}}, r ∣ s, a) ≐ P r {S_{t} = s^{^{'}}, R_{t} = r ∣ S_{t - 1} = s, A_{t - 1} = a}

$p(s^{'},r\mid s,a)\doteq \mathrm{Pr}\left \{ S_{t}=s^{'},R_{t}=r\mid S_{t-1}=s,A_{t-1}=a \right \}$
From it, we can compute anything else one might want to know about the environment.
For example:

the state-transition probabilities
the expected rewards for state-action pairs
the expected rewards for state-action-next-state triples

The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. Any problem of learning goal-directed behevior can be reduced to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which choices are made (the states), and one signal to defineteh agent’s goal (the rewards).

2. Goals and Rewards

That all of what we mean by goals and purposes can be well thought of as the maximization of othe expected value of the cumulative sum of a receive scalar signal (called reward).

It is thus critical that the rewards we set up truly indicate what we want accomplished. In paricular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.

3. Returns and Episodes

In general, we seek to maximize the expected return, where the return, denoted $G_{t}$ , is defined as some specific function of the reward sequence.

$G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + \dots = \sum k = 0 \infty γ k R t + k + 1$ $G_{t}\doteq R_{t+1}+\gamma R_{t+2}+\gamma ^{2}R_{t+3}+\cdots =\sum_{k=0}^{\infty}\gamma ^{k}R_{t+k+1}$
The discount rate determines the present value of future rewards: a reward received k time steps in the future is worth only $\gamma^{k-1}$ times what it would be worth if it were reveived immediately.
If $\gamma < 1$ , the infibite sum above have a finite value as long as the reward sequence ${R_{k}}$ bounded.
If $\gamma=0$ , the agent is “myopic” in being concerned only with maximizing immediate rewards.

Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning:

$G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + \dots = R t + 1 + γ (R t + 2 + γ R t + 3 + γ 2 R t + 4 + \dots) = R t + 1 + γ G t + 1$ $G_{t}\doteq R_{t+1}+\gamma R_{t+2}+\gamma ^{2}R_{t+3}+\cdots =R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+\gamma ^{2}R_{t+4}+\cdots)=R_{t+1}+\gamma G_{t+1}$

5. Policies and Value Functions

Almost all reinforcement learning algorithms invlove estimating value functions—functions of states (or of state-action pairs) that esimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).

Formally, a policy is a mapping from states to probabilities of selecting each possible action.

$π (a ∣ s) = P r (A = a ∣ S = s)$ $\pi(a\mid s)=\mathrm{Pr}(A=a\mid S=s)$

Value function is a prediction of future reward, which is used to .evaluate the goodness/badness of states.

The value of a state $s$ under a policy $\pi$ , denoted $v_{\pi}(s)$ , is the expected return when starting in $s$ and following $\pi$ thereafter.
state-value function for policy $\pi$ :

$v π (s) ≐ E π [G t ∣ S t = s] = E π [\sum k = 0 \infty γ k R t + k + 1 ∣ S t = s]$ $v_{\pi}(s)\doteq \mathbb{E}_{\pi}[G_{t}\mid S_{t}=s]=\mathbb{E}_{\pi}\left [ \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}\mid S_{t}=s \right ]$

Similarly, we difine the value of taking action $a$ in state $s$ under a policy $\pi$ , denoted $q_{\pi}(s,a)$ , as the expected return starting from $s$ , taking the action $a$ , and thereafter following policy $\pi$ .
action-value function for policy $\pi$ :

$q π (s, a) ≐ E π [G t ∣ S t = s, A t = a] = E π [\sum k = 0 \infty γ k R t + k + 1 ∣ S t = s, A t = a]$ $q_{\pi}(s,a)\doteq \mathbb{E}_{\pi}[G_{t}\mid S_{t}=s,A_{t}=a]=\mathbb{E}_{\pi}\left [ \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}\mid S_{t}=s,A_{t}=a \right ]$

The relationship between state-value function and action-value function:

$v π (s) = \sum a \in A π (a ∣ s) q π (s, a)$ $v_{\pi}(s)=\sum_{a\in A}\pi(a\mid s)q_{\pi}(s,a)$
For any policy $\pi$ and any state $s$ , the following consistency condition holds between the value of $s$ and the value of its possible successor states:
$v π (s) ≐ E π [G t ∣ S t = s] = E π [R t + 1 + γ G t + 1 ∣ S t = s] = E π [R t + 1 + γ v π (S t + 1) ∣ S t = s]$ $v_{\pi}(s)\doteq \mathbb{E}_{\pi}[G_{t}\mid S_{t}=s]=\mathbb{E}_{\pi}[R_{t+1}+\gamma G_{t+1}\mid S_{t}=s]=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})\mid S_{t}=s]$
$= \sum a π (a ∣ s) \sum s' \sum r p (s', r ∣ s, a) [r + γ E π [G t + 1 ∣ S t + 1 = s']]$ $=\sum_{a}\pi(a\mid s)\sum_{s^{'}}\sum_{r}p(s^{'},r\mid s,a)\left [ r+\gamma\mathbb{E}_{\pi}[G_{t+1}\mid S_{t+1}=s^{'}] \right ]$
$= \sum a π (a ∣ s) \sum s', r p (s', r ∣ s, a) [r + γ v π (s')]]$ $=\sum_{a}\pi(a\mid s)\sum_{s^{'},r}p(s^{'},r\mid s,a)\left [ r+\gamma v_{\pi}(s^{'})] \right ]$
Equation above is the Bellman equation for $v_{\pi}$ . It expresses a relationship between the value of a state and the values of its successor states.
Bellman equation forms the basis of a number of ways to compute, approximate and learn $v_{\pi}$ . The existence and uniqueness of $v_{\pi}$ are guaranteed as long as either $\gamma<1$ or eventual termination is guaranteed from all states under the policy $\pi$ .

A fundamental property of value function used throughout RL is that they satisfy recursive relationships similar to that which we have already established for the return.

6. Optimal Policies and Optimal Value Functions

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run.
There is always at least one policy that is better than or equal to all other policies called optimal policy, denoted as $\pi_{*}$ .
The defination of “better”: $\pi>\pi^{'}$ if and only if $v_{\pi}(s)\geq v_{\pi^{'}}(s)$ , for all $s\in S$ .
$\pi_{*}$ share the same state-value function, called the optimal state-value function, denoteed $v_{*}$ , and defined as

$v * (s) ≐ max π v π (s), for all s \in S$ $v_{*}(s)\doteq \max_{\pi}v_{\pi}(s), \textrm{for all } s\in S$
Optimal policies also share the same optimal action-value function, denoted $q_{*}$ , and difined as
$q * (s, a) ≐ max π q π (s, a), for all s \in S, a \in A$ $q_{*}(s,a)\doteq \max_{\pi}q_{\pi}(s,a), \textrm{for all } s\in S, a\in A$
We can write $q_{*}$ in terms of $v_{*}$ as follows:
$q * (s, a) = E [R t + 1 + γ v * (S t + 1) ∣ S t = s, A t = a]$ $q_{*}(s,a)=\mathbb{E}\left [ R_{t+1}+\gamma v_{*}\left ( S_{t+1} \right )\mid S_{t}=s,A_{t}=a \right ]$
Bellman Optimality Equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:
$v * (s) = max a \in A (s) q π * (s, a) = max a E π * [G t ∣ S t = s, A t = a] = max a E π * [R t + 1 + γ G t + 1 ∣ S t = s, A t = a]$ $v_{*}(s)=\max_{a\in A(s)}q_{\pi_{*}}(s,a)=\max_{a}\mathbb{E}_{\pi_{*}}[G_{t}\mid S_{t}=s,A_{t}=a]=\max_{a}\mathbb{E}_{\pi_{*}}[R_{t+1}+\gamma G_{t+1}\mid S_{t}=s,A_{t}=a]$
$= max a E π * [R t + 1 + γ v * (S t + 1) ∣ S t = s, A t = a] = max a \sum s', r p (s', r ∣ s, a) [r + γ v * (s')]$ $=\max_{a}\mathbb{E}_{\pi_{*}}[R_{t+1}+\gamma v_{*}(S_{t+1})\mid S_{t}=s,A_{t}=a]=\max_{a}\sum_{s^{'},r}p(s^{'},r \mid s,a)[r+\gamma v_{*}(s^{'})]$

The Bellman Optimality Equation for $q_{*}$ is

$q * (s, a) = E [R t + 1 + γ max a' q * (S t + 1, a') ∣ S t = a, A t = a] = \sum s', r p (s', r ∣ s, a) [r + γ max a' q * (s', a')]$ $q_{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma \max_{a^{'}}q_{*}(S_{t+1},a^{'})\mid S_{t}=a, A_{t}=a]=\sum_{s_{'},r}p(s^{'},r\mid s,a)[r+\gamma \max_{a^{'}}q_{*}(s^{'},a^{'})]$

Vic_Hao

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Finite Markov Decision Processes

MDPs are a classical formalization of sequential decision making. MDPs are a mathmatically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. As...
复制链接

扫一扫